Blitvak Week 14

From LMU BioDB 2015
Jump to: navigation, search

Goals for Week 14

  • Consult with Anu to make modifications to TallyEngine/GenMAPP (share initial export results)
  • Use Excel to track discrepant IDs (reference: Using Microsoft Excel to Compare ID Lists)
  • Conduct gene database exports for any modified versions of GenMAPP builder that are created
    • Analyze any conducted exports and perform Q&A work

Initial Export Analysis

Overview of Week 12 findings

  • Using XMLPipeDB Match, 7127 unique matches were found that correlated with the OrderedLocusNames IDs outlined at the end of the week 12 assignment
  • TallyEngine reported that 337 OrderedLocusNames were present in the XML and within the PSQL database
  • Using select count(*) from genenametype where type = 'ordered locus' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';, it was verified that 337 OrderedLocusNames entries were present in B.cenocepacia_J2315_20151119_gmb3build5.
  • By looking at the data present in the genenametype table, it was found that the OrderedLocusName data was in the format of BceJ2315_#####

Steps taken for further analysis, conducted on 12/1

  • The UniProt XML file was opened , via first object XML editor, in order to investigate and verify the location/nature of the OrderedLocusName data.
  • A data entry was selected and the related data was looked into:

XML exploration GEN BL14 20151201.png

  • In this entry, and in numerous others, it was noticed that only the gene name in the format of BceJ2315_##### was tagged as being of the "ordered locus" type. The format that was being focused upon in previous work, that of p?BCA[M,S,L]###?[A,a]#[A-Z]?, was labeled as being of the type "ORF". It was noticed that all entires that contained an "ordered locus" gene name also contained an "ORF" name for the same gene; most entries, additionally, lacked an "ordered locus" name and only contained an "ORF" name.
  • GenMAPP builder, by default, is made to pick up and utilize the ordered locus data within the XML; it was realized that, with respect to the initial export, it was functioning properly. Since the XML data only contained 337 OrderedLocus names, only 337 made it to the database. Since 7127 matches were found, using XMLPipeDB Match, that correlated to an "ORF" name, it is assumed that most of the gene data is ignored by focusing on OrderedLocus names.
  • UniProt KB was referenced in order to further verify that all BceJ2315_##### gene names were coupled with one that was considered an "ORF" name
  • A search query was conducted that consisted of bcej2315 NOT gene:bca*; this query, it was hoped, would show the number of gene entries that contained just an OrderedLocusName ID.

UniProt Results 2 12.1 GEN BL14 20151201.png

Discrepant ID analysis for the Initial Export, Conducted on 12/3

  • The PSQL database for the initial export was opened up and the SQL query select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?'; was run in order to find the ORF counts within the database (since TallyEngine, with the present used build of GenMAPP builder, did not incorporate the ORF data).
    • It was found that 7121 entries were within the database that corresponded to ORF data. select * from genenametype where type = 'ORF'; was run in order to observe the data; it was found that the data within the table corresponded to the gene name format of interest (p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?).
  • However, it was found that there was a difference in count between what was previously reported by XMLPipeDB Match and by Postgres (7127 vs. 7121, a difference of 6 entries). It was realized that Excel should be utilized in order to track down the discrepant IDs.

Using Excel to track down discrepant IDs

  • Pg Admin III was initialized and the database that was the initial export was booted up.
  • The SQL query select * from genenametype where type = 'ORF' order by value was utilized in order to put the data in ascending order (lower ID #s come first); the results of query was then exported in a format that Excel can read (text file).
  • Using the windows command line, through XMLPipeDB match, the 7127 unique XML entries that fit the criteria of ORF gene name were exported as a text file using java -jar xmlpipedb-match-1.1.1.jar "p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?" < uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml > MATCHIDs_GEN_BL14_20151203
  • Both files were opened with Excel and the proper settings were selected so that the gene name data ended up on its own column (for the Match data, a colon was selected as the divider between columns; PSQL data was comma separated).
  • The column of IDs from from the Match utility, and the one that was found in the PostgreSQL database, were put side by side in a new Excel document with no spaces between them; it was ensured that each column was in ascending order. The column of IDs from the Match utility was given the label of "MATCH IDs"; the one from the PSQL database was given the label of "IDs FROM postgreSQL.
  • Two new columns were created to the right of the ID columns, one was given the label of "MATCH: 1 to 2", the other that of "MATCH: 2 to 1". The plan, at this point, was to utilize Excel MATCH commands in order to compare the two sets of IDs with eachother; it was hoped that these commands would indicate which IDs were present in one set but not in the other.
  • MATCH commands were then written the 2 MATCH columns and applied to the entirety of each MATCH column; the basic format is =MATCH(VALUE TO LOOK-UP, RANGE/COLUMN WHERE THE LOOKING-UP OF A VALUE TAKES PLACE, "MATCH TYPE" [0 in this case]). The purpose of these MATCH commands is to compare the two different ID lists (with each other)
*ALL MATCH COMMANDS:
Format - Column Label : MATCH Command (in first "cell")
MATCH:1 to 2 : =MATCH(A2, B$2:B$7122, 0)   
MATCH:2 to 1 : =MATCH(B2, A$2:A$7122, 0)

Note: In the analysis conducted, the IDs from XMLPipeDB Match were placed in column A, and the ones from the Postgres Database were placed in column B. Values of "#N/A" appear in instances where an ID in one was set was not found in another.

  • The Find function was used in Excel (via control + F) and the value #N/A was searched for ("Look In:" was set to Values). 6 instances of #N/A were found, which coheres with the difference of 6 that was found between the Match utility results and those of PSQL.
    • The discrepant IDs are: bca199f, bca5253f, bca636c, bcad837b, bcal0235a, and bcal0239a
  • bca199f, bca5253f, bca636c, and bcad837b were found to be a part of a sequence of letters and numbers under the label of "checksum"; these appeared to have been accidentally captured by the utilized Match command.
  • bcal0235a and bcal0239a follow the previous identified gene name patterns, however, they both show up as database reference IDs (database reference to STRING, which is a database of known and predicted protein interactions; these data will be ignored as they do not fall under an entry that refers to a gene name.
    • At this point, it was also realized that the commands utilized for Match and PSQL could use some adjustments with respect to the desired pattern. p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z] was modified to p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]; this new pattern was tested with XMLPipe DB match using the command java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < filename.xml. The new pattern resulted in 7126 matches which eliminates one discrepant and incorrect match.

Export of Build 2, a custom version of GenMAPP builder: conducted on 12/1

Creating the database "B.cenocepacia_J2315_20151201_BUILD2_genialomics" in PostgreSQL

  • Steps taken were sourced from the Running GenMAPP Builder page
  • pgAdmin III was launched and a connection to the server was made. "Databases" was right clicked and select "New Database..." was chosen. The database was given a name, B.cenocepacia_J2315_20151201_gmbuilder-genialomics-20151201, and OK was clicked.
  • The new database was selected and the Query Tool was launched. Open File was clicked in the Query Tool and gmbuilder.sql in the gmbuilder-genialomics-12012015-build-2 folder (within the sql folder) was selected. Upon selection of that file, a query was loaded into Query Tool and it was subsequently executed by clicking the green "Execute Query" arrow
  • This query populates the created database with all of its tables. In order to ensure that the query properly worked, it was checked that 167 tables existed in the database

Data Import into gmbuilder-genialomics-12012015-build-2

  • gmbuilder.bat in the gmbuilder-genialomics-12012015-build-2 folder was launched
  • Under file -> configure database, the host was left as localhost, the port number was left as 5432, database name was set to gmbuilder-genialomics-12012015-build-2, Username was set to postgres, Password was set to the password of the PostgreSQL database that was recently created. OK was clicked.
  • File -> Import UniProt XML was selected
    • The UniProt XML file that was previously extracted was chosen, open was clicked. The import process was allowed to proceed uninterrupted.
  • File -> Import GO OBO-XML was selected
    • The GO OBO-XML that was previously extracted was chosen, open was clicked. The import process was allowed to proceed uninterrupted.
  • File -> Import GOA was selected
    • The GOA file that was downloaded previously was chosen, open was clicked, and the import process was allowed to proceed uninterrupted.

Exporting a GenMAPP Gene Database (.gdb file)

  • File -> Export to GenMAPP Gene Database was selected
  • BL was typed into the Owner field. The species of interest was selected for export (B. cenocepacia J2315)
  • Next was clicked, the create GenMAPP database file/location was selected, and the boxes for the exporting of Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms were left checked. The export process was initialized by clicking next; the windows were left open for the program to continue and finish with the export process (was estimated to take somewhere between 1-2 hrs). The database was given the name "Bc-Std_GEN_BL12_20151201.gdb".

Database Testing Report for Build 2

Exports of Build 3 and 4, custom versions of GenMAPP builder: conducted on 12/3

  • Gene database (.gdb) exports were conducted for builds 3 and 4 of the customized GenMAPP builder.
  • Build 3 of GenMAPP Builder: program was modified so that the gene names will be picked up from ORF data rather than ordered locus data.
  • Build 4 of GenMAPP Builder: TallyEngine code was cleaned up and code errors were fixed.

Creating the postgres databases for builds 3 and 4 of customized GenMAPP builder

  • pgAdmin III was launched and a connection to the server was made. "Databases" was right clicked and select "New Database..." was chosen. The database, for Build 3, was given the name B.cenocepacia_J2315_20151203_BUILD3_genialomics, and OK was clicked. The same procedure was repeated for Build 4; the build 4 database was given the name B.cenocepacia_J2315_20151204_BUILD4_genialomics.
  • The new databases were selected and the Query Tool was launched. Open File was clicked in the Query Tool and gmbuilder.sql in the sql folder of the gmbuilder folder of Build 3 and Build 4 of the customized GenMAPP builder was selected. Upon selection of that file, a query was loaded into Query Tool and it was subsequently executed by clicking the green "Execute Query" arrow
  • This query populates the created databases with all of its tables. In order to ensure that the query properly worked, it was checked that 167 tables existed in the databases.

Data Import into B.cenocepacia_J2315_20151203_BUILD3_genialomics and B.cenocepacia_J2315_20151204_BUILD4_genialomics

  • gmbuilder.bat was launched in the respective GenMAPP builder folder for both of the exports involving the two custom builds.
  • Under file -> configure database, the host was left as localhost, the port number was left as 5432, database name was set to B.cenocepacia_J2315_20151203_BUILD3_genialomics, in the case of the third build of GenMAPP builder. The database name was set to B.cenocepacia_J2315_20151204_BUILD4_genialomics in the case of the fourth build.
  • The username was set to postgres, Password was set to the password of the PostgreSQL server that was being utilized. OK was clicked.
  • File -> Import UniProt XML was selected
    • The UniProt XML file that was previously extracted was chosen, open was clicked. The import process was allowed to proceed uninterrupted.
  • File -> Import GO OBO-XML was selected
    • The GO OBO-XML that was previously extracted was chosen, open was clicked. The import process was allowed to proceed uninterrupted.
  • File -> Import GOA was selected
    • The GOA file that was downloaded previously was chosen, open was clicked, and the import process was allowed to proceed uninterrupted.

Exporting a GenMAPP Gene Database (.gdb file)

  • File -> Export to GenMAPP Gene Database was selected
  • BL was typed into the Owner field. The species of interest was selected for export (B. cenocepacia J2315)
  • Next was clicked, the create GenMAPP database file/location was selected, and the boxes for the exporting of Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms were left checked. The export process was initialized by clicking next; the windows were left open for the program to continue and finish with the export process (was estimated to take somewhere between 1-2 hrs). The database for the build 3 export was given the name of "Bc-Std_GEN_Build3_20151203.gdb"; the one for the build 4 export was given the name "Bc-Std_GEN_Build4_20151204.gdb".

Database Testing Report for Builds 3 and 4


Weekly Group Assignments Shared Group Journals Project Links Team Members

Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments Individual Journal Pages Shared Journal Pages