Bklein7 Week 9

Files Asked for in the Gene Database Testing Report

For convenience, all of the files explicitly asked for in the "Gene Database Testing Report" section were compressed together in this file: File:Bklein7 Week9 Files.zip.

Pre-requisites

The following set of software was used in the creation and testing of the Vibrio cholerae gene database:

7-ziptool that for unpacking .gz and .zip files
PostgreSQL on Windows (version 9.4.x)
GenMAPP Builder
Java JDK 1.8 64-bit
GenMAPP 2
XMLPipeDB match utility for counting IDs in XML files
Microsoft Access for reading .mdb files

Gene Database Creation

Downloading Data Source Files and GenMAPP Builder

I download the UniProt XML, GOA, and GO OBO-XML files for Vibrio cholerae along with the GenMAPP Builder program.
- All files were saved to the folder Bklein7_Week9 on my computer's ThawSpace.
- Files that required extraction were unzipped using 7-zip.
- Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

UniProt XML

I went to the UniProt Complete Proteomes page.
- From there, I navigated to the complete proteome download page for Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961).
- I clicked on the "Download" button at the top of the page above and selected the following options:
  - "Download all"
  - "XML" from the "Format" drop-down menu
  - "Compressed" format
- I extracted the file using 7-zip.

GOA

UniProt-GOA files can be downloaded from the UniProt-GOA ftp site.
Within the above site, I navigated to the GO annotations for Vibrio cholerae.
- This text file was automatically opened by my browser. Therefore, I had to manually download the file.

GO OBO-XML

I downloaded the GO OBO-XML formatted file from the Gene Ontology legacy download page.
I extracted the file using 7-zip.

Downloaded GenMAPP Builder

I downloaded the GenMAPP Builder zip folder: Download gmbuilder-3.0.0-build-5.zip.
I extracted the GenMAPP Builder folder using 7-zip.

Creating the New Database in PostgreSQL

I launched pgAdmin III and connected to the PostgreSQL 9.4 server (localhost:5432).
- On this server, I created a new database: V.cholerae_20151027_gmb3build5.
- I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
  - I clicked on the Open File icon and selected the file gmbuilder.sql. This imported a series of SQL commands into the editor tab.
  - I clicked on the Execute Query icon to run this command.
  - In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.

Configuring GenMAPP Builder to Connect to the PostgreSQL Database

To begin, I launched gmbuilder.bat.
I selected the "Configure Database" option and entered the following information into the fields below:
- Host or address: localhost
- Port number: 5432
- Database name: V.cholerae_20151027_gmb3build5
- Username: postgres
- Password: Welcome1

Importing Data into the PostgreSQL Database

The downloaded data files for Vibrio cholerea were specified and imported into the database by clicking on the following buttons:
- Selected File > Import UniProt XML...
- Selected File > Import GO OBO-XML...
- Clicked OK to the message asking to process the GO data.
- Selected File > Import GOA...

Exporting a GenMAPP Gene Database (.gdb)

I selected File > Export to GenMAPP Gene Database... to begin the export process.
I typed my name in the owner field (Brandon Klein).
I selectedVibrio cholerae as the gene database species and then clicked Next.
The database was saved as Vc-Std_BK_20151027.
I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
Finally, I clicked the "Next" button to begin the export process.

Gene Database Testing Report

Export Information

Version of GenMAPP Builder: Version 3.0.0 Build 5

Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room

Postgres Database name: V.cholerae_20151027_gmb3build5

UniProt XML filename: uniprot-organism-243277_BK_20151027.xml

UniProt XML version: UniProt release 2015_10 (version information can be accessed on the UniProt News Page)
UniProt XML download link: http://www.uniprot.org/uniprot/?query=organism:243277
Time taken to import: 2.92 minutes
- Note: The import time was similar to my homework partner's, and no interruptions occurred during this process.

GO OBO-XML filename: go_daily-termdb_BK_20151027.obo-xml

GO OBO-XML version: File last modified on ‎Tuesday, ‎October ‎27, ‎2015, ‏‎2:24:02 AM (accessed from the file properties window)
GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
Time taken to import: 6.88 minutes
Time taken to process: 4.49 minutes
- Note: The import and processing times were once again similar to my homework partner's, and no interruptions occurred during these processes.

GOA filename: 46.V_cholerae_ATCC_39315_BK_20151027.goa

GOA version: File last modified on Wednesday, October 14, 2015 (accessed on the UniProt-GOA news page)
GOA download link: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa
Time taken to import: 0.06 minutes
- Note: This import process was equally quick for my homework partner and presented no complications.

Name of .gdb file: Vc-Std_BK_20151027.gdb

Time taken to export: 1 hour, 27 minutes
- Start time: 4:44 PM (restarted export after class)
- End time: 6:11 PM
- Note: Upon checking on my export progress after class, I found out that the export had prematurely stopped. Therefore, I restarted the process. When I returned to the S120 computer lab on 10/29, the export had completed successfully and all windows remained open.

TallyEngine

I ran the Tally Engine in GenMAPP Builder to process the number of records for UniProt and GO in the XML data and in the Postgres databases. This was done to verify that the XML data transferred consistently into the PostgreSQL database.
- I selected the menu item Tallies > Run XML and Database Tallies for UniProt and GO... to run the Tally Engine. The results are pictured below:
- - XML and database counts were consistent among all assessed data types.
  - Note: the "Ordered Locus" count was 3831.

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

On the command line (cmd on Windows), I entered the folder Bklein7_Week9 to access the XML data files. The first match I ran was designed to identify matches of any ordered locus name following the pattern VC_#### in the UniProt XML file. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" >       
OrderedLocusNames.txt

The results of this match are pictured below.

The number of unique matches yielded by XMLPipeDB Match, 2738, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries.
I investigated this discrepancy by opening the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access. Within Access, I opened the "OrdereLocusNames" table.
- Quickly skimming through this tablerevealed a large number of Ordered Locus entries that did not follow the pattern VC_####. Instead, these entries followed the pattern VC_A####. Therefore, I edited the pattern I input into XMLPipeDB Match.

The second match I ran accounted for the issues above. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml"   
> OrderedLocusNames.txt

The results of this match are pictured below.

With this updated pattern, XMLPipeDB yielded 3831 unique matches. This matched the Tally Engine results.
- Upon accounting for the alternate Ordered Locus entry pattern, XMLPipeDB validated the results of Tally Engine by calculating the same number of entries in the source data files used in the SQL database.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

The gene/name tags in the XML file land in the genenametype table, which contains ordered locus values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

The results of this query are pictured below.

The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
- As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.

I ran a second SQL query to include the pattern VC_A#### in the count output:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

The results of this query are pictured below.

With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
- These findings further validated the Tally Engine results.

OriginalRowCounts Comparison

I opened the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the OriginalRowCounts table.

A listing named "OrderedLocusNames" was present.
The table reported 7664 distinct Ordered Locus Names.
- This value is equal to (3831*2)+1

Benchmark .gdb file: File:Vc-Std External 20101022.gdb

As a sanity check, I compared the information in this table to the OriginalRowCounts table present in the 2010 Vibrio cholerae gene database that we worked with last week.

The "OrderedLocusNames" listing was also present here.
The table reported the same number of distinct Ordered Locus Names: 7664.
Although the above value was the same, row counts for various most of the other tables such as "GeneOntologyTree" dramatically increased between the 2010 and 2015 databases.
Overall, 10 more tables were listed in the 2015 database.

OriginalRowCounts table from the benchmark and new gdb:

Note: The OriginalRowCounts tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the OriginalRowCounts tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.

Visual Inspection

I visually inspected individual tables within the gene database using Microsoft Access to see if there were any problems.

First, I looked at the Systems table. This table includes columns detailing 35 different gene ID systems from which data was accessed.
- Of the 35 gene ID systems, only 11 have dates listed in the date column.
  - Those that do have values entered here all share the same date: 10/27/2015. This was the day on which the gene database was created, suggesting that these entries serve as markers for when the raw data was accessed from the gene ID systems' sites.
  - The fact that the other 24 gene ID systems do not have dates listed is problematic, as older versions of the data stored on these systems may have been accessed without this verification stamp.
Next, I opened the UniProt, RefSeq, and OrderedLocusNames tables to see if the listed IDs were listed in the correct forms.
- UniProt
  - The UniProt ID column includes accession numbers from the UniProt gene entries. The different possibilities for UniProt accession numbers are detailed on the UniProt website here.
  - All ID's in the UniProt table conform to the following pattern (originally found on the page linked to above):
  - Thus, there are no issues with these 3789 gene IDs.
- RefSeq
  - The RefSeq ID's are series of numbers with one of three prefixes: "NP_", "YP_", and "WP_". The meanings of these prefixes can be found in the RefSeq documentation found here.
  - "NP_" and "YP_" Prefixes
    - Refer to proteins. There are 3329 ID's with these prefixes (3328 for NP_ and 1 for YP_).
  - "WP_" Prefixes
    - Refer to " autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3328 ID's with these prefixes.
  - Overall, every entry in the ID column was an expected value.
- OrderedLocusNames
  - 7664 ID entries were present that fell into 4 different patterns:
    - VC_####
    - VC_A####
    - VC####
    - VCA####
  - This ID column was analyzed in Microsoft Excel to investigate why there were 7664 entries present (as opposed to the consistent count of 3831 in SQL and XMLPipeDB Match).
    - The analysis in Excel revealed that the variations of the the gene IDs with and without underscores were replicates of one another. Presumably, these gene IDs were duplicated during the process of exporting the gene database to account for a formatting issue pertaining to underscores. Regardless, this discovery means that only half of the 7664 entries were unique gene IDs. This leaves us with 3832 unique IDs, which is still one more than was expected.
    - Further analysis in Excel revealed that both the XML and SQL gene ID's were off by one due to errors, resulting in the erroneous count of 3831. The XML entries did not include the gene ID VC_A0360.1 (the period presented a formatting issue). The SQL entries did include VC_A0360.1, but an error occurred where two gene IDs were merged into one: VC_1738/VC_1739. The overall number of unique IDs is 3832 when these errors errors are taken into consideration.

Note: The Excel analysis performed above was done following the procedure present on this page- Using Microsoft Excel to Compare ID Lists. The Excel file I used for the analysis itself can be accessed here: File:OrderedLocusNames InvestigationBK1029.xlsx.

Vc-Std_BK_20151027.gdb Use in GenMAPP

Note: The ability of Vc-Std_BK_20151027.gdb to be processed was tested in GenMAPP Version 2.1. Before conducting any of the testing below, Vc-Std_BK_20151027.gdb was selected as the "Gene Database" withing GenMAPP.

Putting a gene on the MAPP using the GeneFinder window

To figure out how to perform this function, I referenced the "Gene Finder" tab within the "GenMAPP Help" window.

The following sample IDs were taken from different gene ID systems and tested:
- UniProt- Q9KQM0
  - The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
  - All expected cross-referenced IDs were present.
- RefSeq- NP_230721
  - The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
  - All expected cross-referenced IDs were present.
- OrderedLocusNames- VC_0274
  - The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
  - All expected cross-referenced IDs were present.
- GeneID- 2612551
  - The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
  - All expected cross-referenced IDs were present.
- EnsemblBacteria- AAF96503
  - The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
  - All expected cross-referenced IDs were present.
Screenshot of all of the sample ID's on a MAPP:

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

Creating an Expression Dataset in the Expression Dataset Manager

The file File:Merrell Compiled Raw Data Vibrio BK 20151015- Tab Delimited.txt from my Week 8 assignment page was used to create a new Expression Dataset.

How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

5100 of the 5221 gene IDs from the original text file were imported into the Expression Dataset.
There were 121 exceptions.
Within the EX.txt file, the error message "Gene not found in OrderedLocusNames or any related system." was listed 121 times.
Two examples of a genes that encountered the above error message were the genes with the following OrderedLocusNames (in the EX.txt file): VC2209 and VCA1031.
- When both of these IDs were searched in the Microsoft Access "OrderedLocusNames" table, no results were retrieved.
- When both of these IDs were searched in the XMLPipeDB Match results file created earlier in this protocol, no results were retrieved.
- Consequentially, it does not appear that the genes that retrieved error messages were present in the original UniProt XML file.
  - This finding would be explain the observation that individuals using the 2010 V. cholerae gene database last week, such as my Week 8 partner Veronica, also retrieved 121 errors during this step.

Note: 121 errors were encountered during the creation of this new Expression Dataset. A screenshot of the error message is shown here:

Coloring a MAPP with expression data

I customized the new Expression Dataset by creating a Color Set. The new Color Set was entitled "LogFoldChange".
- First, I created a criterion for this color set to label genes that demonstrated a significant increase in their expression.
  - I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
  - I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
  - I selected the color for this criterion as red using the color box.
  - I stated the criterion as follows and added it to the Criteria List: [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05
- Second, I created a criterion for this color set to label genes that demonstrated a significant decrease in their expression.
  - I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
  - I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
  - I selected the color for this criterion as green using the color box.
  - I stated the criterion as follows and added it to the Criteria List: [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05
Upon entering these color sets, I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated my .gex file with the new Color Set.

Screenshot of Color Set criteria:

Note: No errors were encountered in the creation of the Color Set.

Running MAPPFinder

I launched the MAPPFinder program from within GenMAPP and ensured that the Vc-Std_BK_20151027.gdb Gene Database was still loaded into GenMAPP.
I clicked on the button "Calculate New Results" followed by "Find File", at which point I the .gex file updated during the creation of the "LogFoldChange" color set.
I chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange Color Set to the data.
I checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
- This analysis took several minutes to complete.
MAPPFinder analysis results
- Initial result in MAPPFinder Browser:
- I clicked on the menu item "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
- - As an aside, it is interesting to note that few of the top 10 gene ontology results from this MAPPFinder analysis were conserved from my analysis using the 2009 Vibrio cholerae gene database. This suggests the high degree of change and progress made in genomics in the last six years.

Note: The MAPPFinder analysis took approximately 16 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the gene database created this week.

Compare Gene Database to Outside Resource

I will get back to this question during the Group Final Project, as was allowed:

This section applies to the Group Final Project and does not need to be completed for the 
Week 9 assignment. —  Kdahlquist (talk) 15:46, 2 November 2015 (PST)

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Links

User Page: Brandon Klein
Team Page: The Class Whoopers

Assignments Pages

Individual Journal Entries

Shared Journal Entries