GÉNialOMICS Gene Database Testing Report (Build 3 Export)

1 Export Information
2 Using TallyEngine
3 Using XMLPipeDB match to Validate the XML Results from the TallyEngine
4 Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine
5 OriginalRowCounts Comparison
6 Visual Inspection
7 .gdb Use in GenMAPP
8 Compare Gene Database to Outside Resource

Export Information

Version of GenMAPP Builder: GenMAPP Builder Custom, Build 3

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151203_BUILD3_genialomics

UniProt XML filename: uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml

UniProt XML version: UniProt release 2015_11 - November 11, 2015
UniProt XML download link: UniProtKB link for the complete proteome of J2315
Time taken to import: 3.99 minutes
- Note: No issues were found with the import of this file.

GO OBO-XML filename: go_daily-termdb_GEN_BL12_20151119.obo-xml

GO OBO-XML version (derived from the date modified on the file, itself): Date Modified: 11/19/2015 2:24 AM
GO OBO-XML download link: Link from GO website
Time taken to import: 5.77 minutes
Time taken to process: 4.06 minutes
- Note: No issues were found with the import of this file.

GOA filename: 31277.B_cepacia_GEN_BL12_20151119.goa

GOA version: Date Modified: 11/10/15, 1:47:00 PM (information sourced from FTP site)
GOA download link: FTP site file
Time taken to import: 0.05 Minutes
- Note: No issues were found with the import of this file.

Name of .gdb file: Bc-Std_GEN_Build3_20151203.gdb

Time taken to export: 4 hours 37 minutes
- Start time: 7.24 pm
- End time: 12:01 am
- Note: File was exported without any major issues, however, the export appeared to take even longer than the one conducted for the initial export. This export took a little over 2 hours longer than the previous export and it is suspected that this difference is due to the fact that work was being done on the computer while the export was taking place.

Using TallyEngine

PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151203_BUILD3_genialomics was left running
GenMAPP builder was booted and Run XML and Database Tallies for UniProt and GO was selected under the Tallies menu item; the UniProt XML and GO files that were imported were chosen
Results of TallyEngine:
- Note: These results are identical to what was found in the initial export and in the export involving the second build of a modified genmapp builder (see the build 2 testing report). Since GenMAPP builder was modified, for Build 3, so that the gene names will be collected by the program from the ORF data rather than the ordered locus data, it appears that there exists some errors in the program that are preventing it from properly collecting and taking into account the "ORF" data that resides in the XML file.

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

The Windows command line was launched (cmd.exe)
This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"
- NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the the build 2 export.
7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?

These results are very different from what was found through TallyEngine because these TallyEngine results, as mentioned in the Week 14 assignment, only represent the gene name data in the XML that is tagged as being "ordered locus". The match command was found to reflect the data that is found under the "ORF" tag; the ORF and ordered locus counts are both very different, and this is reflected in the difference between TallyEngine and XMLPipeDB match with respect to the gene name counts.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

pgAdmin III was booted and all of the necessary connections were made
It was realized that the gene/name tags in the XML file end up in the genenametype table (source: the wiki page regarding database quality analysis
In pgAdmin III, the query select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]'; was issued via the SQL Query menu in order to validate the TallyEngine count for "orderedlocusnames" for the PSQL database.
- 337 unique matches were found in pgAdmin III (postgres database results). This lines up with what was found in TallyEngine.
Once again, the query select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?'; was run via SQL in order to the verify the ORF counts (compared to the results that were found using XMLPipeDB match, see the week 14 assignment.
- 7121 counts were found which lines up with what was found through XMLPipeDB match and through an analysis of the XML file
At this point, it was once again assumed that the data in the genenametype table of the PSQL database is identical to what was within the same table in the initial export PSQL database.
Are your results the same as reported by the TallyEngine? Why or why not?
- The "ordered locus" results are the same as what was reported by TallyEngine since both are focusing on the same set of data. TallyEngine was modified to focus upon the "ORF" data, however, it appears that there are issues that are preventing it from doing so.

OriginalRowCounts Comparison

The newly created J2315 .gdb file was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, MDB Viewer Plus was utilized.
Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
OriginalRowCounts for Build 3 export of J2315
It was decided that a good reference or "benchmark" would be the database that was created using Build 2 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data.
Benchmark .gdb file: compressed Bc-Std_Build2_GEN_BL14_20151201.gdb
OriginalRowCounts for the Build 2 export of J2315

Note: It was noticed that the OriginalRowCounts table in this export are mostly identical to the one found through the Build 2 export. However, it was noticed that there existed differences in the OrderedLocusNames table between the two exports. It was found that the recent export, the Build 3 export, contained 7121 rows in the OrderedLocusNames table (which indicates 7121 entries, which is the same as the number of ORF gene names in the XML), while the last export, the Build 2 export, contained 337 rows in the OrderedLocusNames tables. The fact that the build 3 export how shows 7121 entries in that table is indicative of the fact that this modified GenMAPP builder (build 3) is now focusing on the ORF data; it appears, however, that it is now labeling the "ORF" data as being OrderedLocusNames instead of the "ordered locus" data. The observation in the OriginalRowCounts table does not completely mesh with what was found earlier in the PSQL database. In the PSQL database, it was found that the OrderedLocusName data was still the "ordered locus" gene names that reside in the XML (and the "ORF" data are the 7121 gene names of interest). In conclusion, it feels that there are some issues with TallyEngine and GenMAPP builder that are leading to some issues (such as TallyEngine not reporting the ORF data).

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
- Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, now, only reflects gene names in the form of p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: The modifications to GenMAPP builder appear to have changed some of the data within the tables of the gene database (ORF gene names replacing "ordered locus" gene names, with respect to the OrderedLocusNames table).

.gdb Use in GenMAPP

Some of the protocol from Part 2 of the Vibrio cholerae Microarray Data Analysis was used as a reference for this portion of the assignment
Bc-Std_GEN_Build3_20151203.gdb was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
GenMAPP (Version 2.1) was launched
The new gene database was loaded by going into Data > Choose Gene Database
The tab deliminated GenMAPP formatted data sourced from the microarray paper was loaded into GenMAPP through Data > Expression Dataset Manager > Expression Datasets > New Dataset > GenMAPP formatted microarray data_GEN_B14_20151207.txt

Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 284 errors in the loaded raw data; this new error count is significantly smaller than the 7251 errors that were detected using the previously exported database. Since this export incorporates the ORF data, it appears that the majority of the genes present in the microarray dataset are covered.

Putting a gene on the MAPP using the GeneFinder window

A test expression data-set was created in order to observe the behavior of GenMAPP with the exported database
GeneFinder was loaded by placing a blank Gene element on the drafting board of GenMAPP and right-clicking it.
The genes BCAL0001,BCAL0002, BCAM0005, and BCAS0105 were searched in the Gene ID box, with the Gene ID System set to OrderedLocusNames
- All genes were successfully found and reference pages with links successfully appeared

Note: All cross-referenced IDs were present for all of these sample Gene IDs. No crashing or issues at this step.

Creating an Expression Dataset in the Expression Dataset Manager

The IDs in the microarray dataset were imported into GenMAPP using the new database; there existed 284 exceptions.
The EX.txt file was opened through Excel and it was found out that the error code for all of the exceptions was: Gene not found in OrderedLocusNames or any related system. The Gene IDs were sorted by error and the problematic IDs were analyzed. It was found, through the find function, that 101 of the exceptions were due to alterations in the usual formatting of the gene name (these gene names contained underscores, Js, and numbers). The rest of the exceptions, it was found (via UniProt KB searches), represented genes that are not present in the UniProt database. Several exceptions (BCAL2591, BCALr0080, BCAM0787, BCAM1951, BCASr0743a) were checked for their presence in UniProt KB or in the MOD:
- BCAL2591: No results in UniProt KB. Found in MOD; gene has no product.
- BCALr0080: No results in UniProt KB. Found in MOD; product: tRNA-Arg.
- BCAM0787: No results in UniProt KB. No results in MOD.
- BCAM1951: No results in UniProt KB. Found in MOD; gene has no product.
- BCASr0743a: No results in UniProt KB. No results in MOD.
Note: The exceptions file contained error inducing genes that either lack a known product (protein/functional RNA), lack a MOD entry, or code for functional RNA (such as tRNA). Some gene names that contained unusual formatting (BCAL0563_J_0, and BCAL0563_J_1, for example) were found to represent genes that were covered by the MOD/UniProt (these entries were found by removing the unusual underscores/letters and searching the "fixed" gene names).
Excel Workbook utilized in visualizing the GenMAPP exceptions

Running MAPPFinder

Protocol sourced from the week 8 assignment
The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
The Test criteria was selected
The boxes corresponding to "Gene Ontology" were checked
"Browse" button was clicked to add a name to the file that will be created
"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Note: MAPPFinder successfully loaded and provided an output with this gene database.

Compare Gene Database to Outside Resource

Outside Resource: Burkholderia Genome DB, UniProt KB

The strain page for J2315 was looked up: [1]
7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to protein encoding genes were found in UniProt KB, and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD. It is apparent that the count of 7121 (ORF data) is much closer to what is present in outside resources than the one of 337 ("ordered locus" data). The differences in count between UniProt and the gdb and MOD could be the result of the fact that UniProt only covers genes that code for protein (some of the coding sequences present in the MOD, or within the gdb, could be responsible for functional RNA, which are not covered by UniProt).

Note: The exported database now seems more in-line with what is to be expected of the genome of B. cenocepacia; the current OrderedLocusName counts (which actually represents ORF counts) seem very close to the counts expressed by the MOD and by UniProt.

Weekly Group Assignments

Shared Group Journals

Project Links

Team Members

Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments

Individual Journal Pages

Shared Journal Pages

GÉNialOMICS Gene Database Testing Report (Build 3 Export)

Contents

Export Information

Using TallyEngine

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

OriginalRowCounts Comparison

Visual Inspection

.gdb Use in GenMAPP

Putting a gene on the MAPP using the GeneFinder window

Creating an Expression Dataset in the Expression Dataset Manager

Running MAPPFinder

Compare Gene Database to Outside Resource

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools