GÉNialOMICS Gene Database Testing Report (Build 4 Export)

From LMU BioDB 2015
Jump to: navigation, search

Export Information

Version of GenMAPP Builder: GenMAPP Builder Custom, Build 4

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151204_BUILD4_genialomics

UniProt XML filename: uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml

  • UniProt XML version: UniProt release 2015_11 - November 11, 2015
  • UniProt XML download link: UniProtKB link for the complete proteome of J2315
  • Time taken to import: 3.46 minutes
    • Note: Time taken appears to be slightly shorter than previous exports.

GO OBO-XML filename: go_daily-termdb_GEN_BL12_20151119.obo-xml

  • GO OBO-XML version (derived from the date modified on the file, itself): Date Modified: 11/19/2015 2:24 AM
  • GO OBO-XML download link: Link from GO website
  • Time taken to import: 5.05 minutes
  • Time taken to process: 3.75 minutes
    • Note: Time taken appears to be slightly shorter than previous exports.

GOA filename: 31277.B_cepacia_GEN_BL12_20151119.goa

  • GOA version: Date Modified: 11/10/15, 1:47:00 PM (information sourced from FTP site)
  • GOA download link: FTP site file
  • Time taken to import: 0.04 Minutes
    • Note: No issues were found with the import of this file.

Name of .gdb file: Bc-Std GEN Build4 20151204.gdb

  • Time taken to export: 11 hours 6 minutes
    • Start time: 7:51 am
    • End time: 6:57 pm
    • Note: File was exported without any major issues, however, the export appeared to take significantly longer than the previous exports. It is likely that the export took so long because the workstation had, for some period of time, entered a "sleep" mode (export was delayed, as the computer had to be taken off of "sleep").

Using TallyEngine

  • PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151204_BUILD4_genialomics was left running
  • GenMAPP builder was booted and Run XML and Database Tallies for UniProt and GO was selected under the Tallies menu item; the UniProt XML and GO files that were imported were chosen
  • Results of TallyEngine:
  • Build4tallyengine results GEN BL14 20151204.png
    • Note: These results differ significantly from what was found in previous exports. The 337 Ordered Locus gene names are now distinct from the 7121 ORF gene names (and are represented, as such, by TallyEngine). All of the counts related to external references (like UniProt) remain the same. The major and crucial change is the inclusion and representation of the ORF data.

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

  • The Windows command line was launched (cmd.exe)
  • This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
  • java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"
    • NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the the build 2 export.
  • XmlpipedbmatchOUTPUT GEN BL14 20151201.png
  • 7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?

  • These results vary slightly from what was found by TallyEngine due to the presence of 5 discrepant IDs
  • The discrepant IDs, previously identified, are: bca199f, bca5253f, bca636c, bcad837b, bcal0235a, and bcal0239a
  • bca199f, bca5253f, bca636c, and bcad837b were found to be a part of a sequence of letters and numbers under the label of "checksum"; these appeared to have been accidentally captured by the utilized Match command.
  • bcal0235a and bcal0239a follow the previous identified gene name patterns, however, they both show up as database reference IDs (database reference to STRING, which is a database of known and predicted protein interactions; these data will be ignored as they do not refer to a UniProt entry.
  • Excluding these 5 accidental matches, the results found using the Match utility are the same as what was found using TallyEngine

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

  • pgAdmin III was booted and all of the necessary connections were made
  • In pgAdmin III, the query select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]'; was issued via the SQL Query menu in order to validate the TallyEngine count for "Ordered Locus" for the PSQL database.
    • 337 unique matches were found in pgAdmin III (postgres database results). This lines up with what was found in TallyEngine.
  • Additionally, the query select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'; was run via SQL in order to the verify the ORF counts
    • 7121 counts were found which is identical to what was found through XMLPipeDB match (ignoring the discrepant IDs) and to what was reported by TallyEngine (for the ORF data).
  • Are your results the same as reported by the TallyEngine? Why or why not?
    • The results are the same as what was reported by TallyEngine; this is due to the fact that the most recent build incorporated code fixes that allowed GenMAPP builder, and TallyEngine, to properly include the ORF data in their analysis/work.

OriginalRowCounts Comparison

  • The newly created J2315 .gdb file was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, MDB Viewer Plus was utilized.
  • Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
  • OriginalRowCounts for Build 4 export of J2315
    • Build4OriginalRowCounts GEN BL14 20151204.png
  • It was decided that a good reference or "benchmark" would be the database that was created using Build 3 of the customized GenMAPP builder; comparing the two should bring into light any issues or differences that could be the result of utilizing an updated version of the modified GenMAPP builder.
  • Benchmark .gdb file: compressed Bc-Std_GEN_Build3_20151203.gdb
  • OriginalRowCounts for the Build 3 export of J2315
    • Build3OriginalRowCounts GEN BL14 20151203.png

Note: It was noticed that the OriginalRowCounts table in this export is identical to the one that came from the Build 3 export. This seems to suggests that the only fundamental difference between the two builds of GenMAPP builder lies with TallyEngine (this makes sense, considering that build 4 focused upon fixing problems with TallyEngine and improper code).

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

  • Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
    • Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
  • Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
  • In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, like in Build 3, only reflects gene names in the form of p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: Visually, no changes seem apparent between the Build 3 and Build 4 export.

.gdb Use in GenMAPP

  • Some of the protocol from Part 2 of the Vibrio cholerae Microarray Data Analysis was used as a reference for this portion of the assignment
  • Bc-Std_GEN_Build4_20151204.gdb was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
  • GenMAPP (Version 2.1) was launched
  • The new gene database was loaded by going into Data > Choose Gene Database
  • The tab deliminated GenMAPP formatted data sourced from the microarray paper was loaded into GenMAPP through Data > Expression Dataset Manager > Expression Datasets > New Dataset > GenMAPP formatted microarray data_GEN_B14_20151207.txt

Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 284 errors in the loaded raw data; this error count is identical to what was seen with the build 3 export.

Putting a gene on the MAPP using the GeneFinder window

  • A test expression data-set was created in order to observe the behavior of GenMAPP with the exported database
  • GeneFinder was loaded by placing a blank Gene element on the drafting board of GenMAPP and right-clicking it.
  • The genes BCAL0001,BCAL0002, BCAM0005, and BCAS0105 were searched in the Gene ID box, with the Gene ID System set to OrderedLocusNames
    • All genes were successfully found and reference pages with links successfully appeared

Note: All cross-referenced IDs were present for all of these sample Gene IDs. No crashing or issues at this step.

Creating an Expression Dataset in the Expression Dataset Manager

  • The IDs in the microarray dataset were imported into GenMAPP using the new database; there existed 284 exceptions.
  • The EX.txt file was opened through Excel and it was found out that the exceptions were identical to what was found with the Build 3 export.

Exceptions Analysis

  • Note: This analysis is sourced from the Build 3 Export
  • The EX.txt file was opened through Excel and it was found out that the error code for all of the exceptions was: Gene not found in OrderedLocusNames or any related system. The Gene IDs were sorted by error and the problematic IDs were analyzed. It was found, through the find function, that 101 of the exceptions were due to alterations in the usual formatting of the gene name (these gene names contained underscores, Js, and numbers). The rest of the exceptions, it was found (via UniProt KB searches), represented genes that are not present in the UniProt database. Several exceptions (BCAL2591, BCALr0080, BCAM0787, BCAM1951, BCASr0743a) were checked for their presence in UniProt KB or in the MOD:
    • BCAL2591: No results in UniProt KB. Found in MOD; gene has no product.
    • BCALr0080: No results in UniProt KB. Found in MOD; product: tRNA-Arg.
    • BCAM0787: No results in UniProt KB. No results in MOD.
    • BCAM1951: No results in UniProt KB. Found in MOD; gene has no product.
    • BCASr0743a: No results in UniProt KB. No results in MOD.
  • Note: The exceptions file contained error inducing genes that either lack a known product (protein/functional RNA), lack a MOD entry, or code for functional RNA (such as tRNA). Some gene names that contained unusual formatting (BCAL0563_J_0, and BCAL0563_J_1, for example) were found to represent genes that were covered by the MOD/UniProt (these entries were found by removing the unusual underscores/letters and searching the "fixed" gene names).
  • Excel Workbook utilized in visualizing the GenMAPP exceptions

Running MAPPFinder

  • Protocol sourced from the week 8 assignment
  • The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
  • "Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
  • For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
  • The Test criteria was selected
  • The boxes corresponding to "Gene Ontology" were checked
  • "Browse" button was clicked to add a name to the file that will be created
  • "Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Running MAPPFinder

  • Protocol sourced from the week 8 assignment
  • The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
  • "Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
  • For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
  • The Test criteria was selected
  • The boxes corresponding to "Gene Ontology" were checked
  • "Browse" button was clicked to add a name to the file that will be created
  • "Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Note: MAPPFinder successfully loaded and provided an output with this gene database.

Compare Gene Database to Outside Resource

Outside Resource: Burkholderia Genome DB, UniProt KB

  • The strain page for J2315 was looked up: [1]
  • 7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to J2315 proteins were found in UniProt KB, and 7114 coding sequences were found in the MOD. The count of 7121 genes that is represented by the exported database appears to make sense, in light of the data represented by the MOD and by UniProt. Since UniProt is protein-centric, the count of 6994 corresponds to only protein; it is likely that some proteins have several related gene names (which explains the reason why more gene names were found than proteins). The MOD was found, earlier, to be manually curated and it is possible that the difference between the MOD count of 7114 and the found count of 7121 is due to the MOD missing a few genes (that are present in other databases, like UniProt).
  • Note: The IDs and counts covered by this export appear to be consistent with outside resources.

Weekly Group Assignments Shared Group Journals Project Links Team Members

Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments Individual Journal Pages Shared Journal Pages