GÉNialOMICS Gene Database Testing Report (Build 2 Export)

From LMU BioDB 2015
Jump to: navigation, search

Export Information

Version of GenMAPP Builder: GenMAPP Builder Custom, Build 2

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151201_BUILD2_genialomics

UniProt XML filename: uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml

GO OBO-XML filename: go_daily-termdb_GEN_BL12_20151119.obo-xml

  • GO OBO-XML version (derived from the date modified on the file, itself): Date Modified: 11/19/2015 2:24 AM
  • GO OBO-XML download link: Link from GO website
  • Time taken to import: 5.25 minutes
  • Time taken to process: 3.91 minutes
    • Note: No issues were found with the import of this file.

GOA filename: 31277.B_cepacia_GEN_BL12_20151119.goa

  • GOA version: Date Modified: 11/10/15, 1:47:00 PM (information sourced from FTP site)
  • GOA download link: FTP site file
  • Time taken to import: 0.04 Minutes
    • Note: No issues were found with the import of this file.

Name of .gdb file: Bc-Std_Build2_GEN_BL14_20151201.gdb

  • Time taken to export: 4 hours 22 minutes
    • Start time: 10:27 pm
    • End time: 2:49 am
    • Note: File was exported without any major issues, however, the export appeared to take even longer than the one conducted for the initial export. This export took almost 2 more hours than the previous export and it is suspected that this difference is due to the fact that work was being done on the computer while the export was taking place.

Using TallyEngine

  • PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151201_BUILD2_genialomics was left running
  • GenMAPP builder was booted and Run XML and Database Tallies for UniProt and GO was selected under the Tallies menu item; the UniProt XML and GO files that were imported were chosen
  • Results of TallyEngine:
  • Tallyengine results GEN BL14 20151201.png
    • Note: These results are identical to what was found through the initial, dry, export of the gene database. This isn't all too surprising since the only difference in build 2 is that there exists a customized species profile for J2315. The data that GenMAPP builder collects from the XML is still the same since no major coding modifications were done on the program. As outlined in the Week 14 assignment, future versions of GenMAPP builder should collect the data related to the "ORF" gene name data, rather than the "ordered locus" gene name data that is collected by default. It is apparent, through analysis of the XML file and through Match commands, that the XML only contains "ORF" names for most of its genes (see Week 14 assignment, and the initial export testing report).

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

  • The Windows command line was launched (cmd.exe)
  • This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
  • java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"
    • NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory)
  • XmlpipedbmatchOUTPUT GEN BL14 20151201.png
  • 7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?

  • These results are very different from what was found through TallyEngine because these TallyEngine results, as mentioned in the Week 14 assignment, only represent the gene name data in the XML that is tagged as being "ordered locus". The match command was found to reflect the data that is found under the "ORF" tag; the ORF and ordered locus counts are both very different, and this is reflected in the difference between TallyEngine and XMLPipeDB match with respect to the gene name counts.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

  • pgAdmin III was booted and all of the necessary connections were made
  • It was realized that the gene/name tags in the XML file end up in the genenametype table (source: the wiki page regarding database quality analysis
  • In pgAdmin III, the query select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]'; was issued via the SQL Query menu in order to validate the TallyEngine count for "orderedlocusnames" for the PSQL database.
    • 337 unique matches were found in pgAdmin III (postgres database results). This lines up with what was found in TallyEngine.
  • Additionally, the query select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?'; was run via SQL in order to the verify the ORF counts (compared to the results that were found using XMLPipeDB match, see the week 14 assignment.
    • 7121 counts were found which lines up with what was found through XMLPipeDB match and through an analysis of the XML file
  • At this point, it was assumed that the data in the genenametype table of the PSQL database is identical to what was within the same table in the initial export PSQL database.
  • Are your results the same as reported by the TallyEngine? Why or why not?
    • The "ordered locus" results are the same as what was reported by TallyEngine since both are focusing on the same set of data.

OriginalRowCounts Comparison

  • The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, MDB Viewer Plus was utilized.
  • Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
  • OriginalRowCounts for Build 2 Export of J2315
    • Build2ExportOriginalRowCounts GEN BL14 20151201.png
  • It was decided that a good reference or "benchmark" would be the database that was created for the initial, dry, export of the gene data related to J2315; comparing the two should bring to light any differences that are the result of the export.
  • Benchmark .gdb file: compressed Bc-Std_GEN_BL12_20151119.gdb
  • OriginalRowCounts for Initial Export of J2315
    • OriginalRowCounts(initial export) GEN BL12 20151123.png

Note: It was noticed that the OriginalRowCounts table in this export is identical to the one found in the initial export; this is not surprising since the utilized GenMAPP builder is mostly identical to one that was utilized in the initial export.

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

  • Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
    • Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
  • Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
  • In the UniProt table, it appears that the ID and EntryName columns involve the correct ID form for J2315. The GeneName column in UniProt, however, appears to be missing most of its entries. No gene names in the basic form of BCA[S,L,M]#### and BceJ2315_##### can be found. Very few gene names are present, and those present are in the form of either four letters with the final letter being capital or in the form of three uncapitalized letters.
  • RefSeq table appears to be in order
  • OrderedLocusNames table, as suggested by earlier analysis, only contains 337 rows and IDs in the format of BceJ2315_#####.

Note: It was noticed that the UniProt table only contained gene names that are either of "ordered locus" type or in the format of four letters.

.gdb Use in GenMAPP

  • Some of the protocol from Part 2 of the Vibrio cholerae Microarray Data Analysis was used as a reference for this portion of the assignment
  • Bc-Std_Build2_GEN_BL14_20151201.gdb was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
  • GenMAPP (Version 2.1) was launched
  • The new gene database was loaded by going into Data > Choose Gene Database
  • The tab deliminated GenMAPP formatted data sourced from the microarray paper was loaded into GenMAPP through Data > Expression Dataset Manager > Expression Datasets > New Dataset > GenMAPP formatted microarray data_GEN_B14_20151207.txt
  • Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 7251 errors in the loaded data. It is suspected that this gene database does not cover the majority of the genes within the microarray data (which is expected, since the microarray data represents ORF gene names)

Compare Gene Database to Outside Resource

Outside Resource: Burkholderia Genome DB

  • The strain page for J2315 was looked up: [1]
  • Only 337 OrderedLocusNames IDs were found in the exported database; 7384 annotated genes, however, are present in the MOD
    • J2315MODGENES GEN BL12 20151123.png

Note: A lot more OrderedLocusNames IDs should be present in the exported database than the counts that were found. Data on the MOD and executed Match queries help confirm this. Current number of OrderedLocusNames (337) is very far from the numbers that was seen in the MOD (7384 annotated genes, with 7114 involved with the coding of protein). The count is so low, it is now known, due to the fact that GenMAPP builder, at the moment, is programmed to only pick-up "ordered locus" data from the XML; most of the gene names reside as "ORF" data, which explains the fact that most of the data is not present in the export.


Weekly Group Assignments Shared Group Journals Project Links Team Members

Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments Individual Journal Pages Shared Journal Pages