Kzebrows Week 9

1 Electronic Lab Notebook
- 1.1 Export Information
2 =UniProt XML
3 Gene Database Testing Report
4 TallyEngine
5 Using XMLPipeDB Match to Validate the XML Results from the TallyEngine
6 Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine
7 OriginalRowCounts Comparison
8 Visual Inspection
9 .gdb Use in GenMAPP
10 Compare Gene Database to Outside Resource
11 Class Notes for 10/29
12 Assignments
13 Additional Links

Electronic Lab Notebook

Export of Vibrio cholerae GenMAPP Gene Database was used following the instructions on the Running GenMAPP Builder page. The rest of the assignment was conducted using instructions found on How Do I Count Thee? Let Me Count The Ways wiki page and information filled out using the Gene Database Testing Report Sample as a template. Additionally, lists of IDs were compared using the instructions on the Using Microsoft Excel to Compare ID Lists page.

Export Information

First we downloaded UniProt XML, GOA, and GO OBO-XML files. UniProt (protein database) is linked to GO OBO-XML (Gene Ontology) through GOA (Gene Ontology Associations) through XMLPipeDB, subset GenMAPP Builder. GenMAPP Builder takes the data and puts it into PostgreSQL database, where it is converted into a GenMAPP-compatible gene database (GDB). From there, we can then download and analyze microarray data via GenMAPP.

=UniProt XML

Went to UniProt Complete Proteomes page and filtered list by clicking on "bacteria" under Superkingdom heading
Filtered results and found V. cholerae
Clicked on the UniProt link for Vibrio cholerae serotype O1 and clicked download all with XML format as a compressed file.
Saved uniprot-organism%3A243277.xml.gz in Computer T drive under new folder Kzebrows

GOA

went to the Uniprot-GOA home page
Clicked on the link to download the proteomes directory
Found V. cholerae and right-clicked to download GO annotations
Saved and downloaded file and saw that V_cholerae was last modified 13 Oct 20016 at 07:31

GO OBO-XML

Followed link to Legacy Downloads on the Gene Ontology page
Downloaded obo-xml.gz to T drive Kzebrows

NOTE: Both the UniProt XML and GO OBO-XML files were extracted using 7-zip.

Downloading GenMAPP Builder

Downloaded gmbuilder-3.0 using this link
Extracted in T drive using 7 zip

Creating a New Database in PostgreSQL

Launched PgAdmin III.
Double-clicked on PostgreSQL 9.4
Selected "Database" and "New Database" and named it "Vcholerae_20151027_gmb3build5". Copied name to clipboard and pressed OK.
Ran prepackaged query: open file > Thaw space > gmbuilder-3.0.0-build-5 > sql > gmbuilder.sql
- Query returned successfully with no results in 5697 ms
- Closed query window

Configuring GenMAPP Builder to Connect to PostgreSQL Database

Launched gmbuilder.bat
Select File > Configure Database
Entered the following info and clicked OK
- Host: localhost
- Port number: 5432
- Database name: Vcholerae_20151027_gmb3build5
- Username: postgres
- password
Clicked OK

Importing Data into PostgreSQL Database

Details are found below in the Gene Database Testing Report.

Selected File > Import UniProt XML and found UniProt XML file
Selected File > Import GO OBO-XML file
Selected File > Import GOA

Exporting a GenMAPP Gene Database (.gdb)

Details are found below in the Gene Database Testing Report.

Selected File > Export to database
Typed my name into the Owner Field
Clicked on species V. cholerae
Created the database by saving under T drive
Left the boxes checked for exporting all Molecule Function, Cellular Component, and Gene Ontology terms
Clicked next to begin export process
Start time: October 27, 2015 at 3:55 pm

I then attempted to use Microsoft Excel to compare the ID lists. I did this from exporting from XMLPipeDB match and also from PostgreSQL and writing the file to my T drive. Then, using Microsoft Access, I exported my gene database data using the Excel button. I opened the files in Excel and pasted them into new Excel worksheets, with each ID list in its own column, and I added a new header row labeled "ID List 1" and "ID List 2" at the top.

Next I inserted a column to the right of the two ID columns and labeled it "1 to 2" and another to the right of that one "2 to 1" indicating that I was comparing list 1 to list 2 and then list 2 to list 1. In cell C2 I typed the formula =MATCH(A2,B$2:B$7666,0) using the $ sign as a lock. I then pasted this formula all across column C. In column D, I typed the formula =MATCH(B2, A$2:A$3832,0).

Compare ID Lists in Excel

Gene Database Testing Report

Version of GenMAPP Builder: gmbuilder-3.0.0-build-5.zip

Computer on which export was run: #7. Computer is located in Seaver computer lab, second row, third from the left if facing front of the room.

Postgres Database name: Vcholerae_20151027_gmb3build5

UniProt XML filename (give filename and upload and link to compressed file): uniprot-organism%3A243277.xml

UniProt XML version (The version information can be found at the UniProt News Page): 10/13/2015 at 07:31
UniProt XML download link
Time taken to import: 3.10 minutes
- Note: Data downloaded slowly.

GO OBO-XML filename (give filename and upload and link to compressed file): go_daily-termdb.obo-xml.gz

GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped): 10/09/2015
GO OBO-XML download link
Time taken to import: 7.14 minutes
Time taken to process: 4.64 minutes
- Note: Importing the data took a long time.

GOA filename (give filename and upload and link to compressed file): 46.V_cholerae_ATCC_39315.goa

GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): 10/13/15 at 6:31 am
GOA download link
Time taken to import: 0.06 minutes
- Note: Import was almost immediate.

Name of .gdb file (give filename and upload and link to compressed file): Vc-Std_20151027.gdb

Time taken to export: N/A
- Start time: October 27, 2015 at 3:55 pm
- End time: N/A
- Note: I left a note on top of my computer but someone came in and exited out of my windows between Tuesday's class and Thursday's class, so I was unable to record the end time or time taken to export. The average time of the people around me was around 5:11 pm on Tuesday, so I can infer that my file would have finished exporting around the same time.

TallyEngine

I used TallyEngine to verify that data was transferred consistently into PostgreSQL.

Ran PostgreSQL
Ran GenMAPP Builder to make sure it was connected to database by clicking File > Configure
Chose Run XML and Database Tallies for UniProt and Go
Chose UniProt and GO files that I imported

The Tally results looked like this:

Using XMLPipeDB Match to Validate the XML Results from the TallyEngine

I downloaded the application from the XMLPipeDB SourceForge site (location ???). I used the command line cmd and cd'd to the folder containing the file that I wanted to check, using the command T:\Kzebrows>java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-orgnaism%3A243277.xml" > OrderedLocusNames. Typing this command into cmd gave me:

Then, to account for VC_A, a known problem brought up in class that affected the results between the different checks, I used the command T:\Kzebrows>java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-orgnaism%3A243277.xml" > OrderedLocusNames to achieve this:

Are your results the same as you got for the TallyEngine? Why or why not? XMLPipeDB Match does not take into account VC_A entries, so I did not get the same results as with TallyEngine when I searched only VC_####. After changing the command to include VC_A#### entries by typing VC_A?, I was able to get the same amount of results (3,831).

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

Next I entered the following command in PgAdminIII:

select count (*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

and received 2,737 results after clicking the green button.

If I changed the command to 'VC_A?' indicating VC_ and then including VC_A or just VC_, I got 3,831 results.

For more information, see this page.

Are your results the same as reported by the TallyEngine? Why or why not? The results were only the same when I added the VC_A?. PostgreSQL does not take into account VC_A####, so originally I only had 2,737 results; however, by adding VC_A? I indicated that I wanted to include both VC_ and VC_A and I received 3,831 results.

OriginalRowCounts Comparison

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

I downloaded the 2010 file from SourceForge and opened the .gdb file from Microsoft Access. In comparing the files, I observed the following:

Both have 7664 entries. This is almost twice as many as what was found in XMLPipeDB Match and in PostgresDB with two extra (2 x 2,831=7,662)
The 2015 database has 52 tables while the 2010 one has only 42.
Generally the 2015 database had far more rows per table, e.g. in 2010 RefSeq-Gene Ontoloy had 13,332 and in 2015 it had 41,064 rows.

Benchmark .gdb file: Vc-Std_External_20101022.gdb

The OriginalRowCounts table from 2010 (benchmark .gdb) looked like this:

The OriginalRowCounts table from 2015 (new .gdb) looked like this

Note:

Visual Inspection

Next I performed visual inspection of individual tables to see if there are any problems.

Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database? No. Only 11 out of 35 gene ID systems have a date in the Date field. All of the dates are 10/27/2015.
Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
- UniProt: Yes
- RefSeq: Yes
- OrderedLocusNames: No. Some are VC####, some are VC_####, some are VCA#### and some are VC_A####.

Note:

.gdb Use in GenMAPP

Note:

Putting a gene on the MAPP using the GeneFinder window

In the main GenMAPP drafting board window I left-clicked on the "Gene" icon in the upper left hand corner and clicked on the drafting board to place the gene on the MAPP. Next I right-clicked the gene to access the GeneFinder window and pasted the gene ID VC_A0498 (listed first in the OrderedLocusNames table in the database) into the gene ID field. I selected the OrderedLocusNames system from the drop-down menu and clicked Search.

Next I opened the backpage by left-clicking on the gene box. All of the cross-referenced IDs that were supposed to be there were present.

Note:

Creating an Expression Dataset in the Expression Dataset Manager

I then launched the GenMAPP program and selected the Data Menu, from which I chose the Expression Dataset Manager to convert the data.

How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
- When I opened the file Kzebrows_microarrayanalysis20151025.txt I received a message saying that there were 121 errors. Because UniProt XML has been updated in the last month, I think these IDs were present in the Uniprot XML data but not imported because of a problem with the Merrell database itself. There didn't appear to be any connection between the error codes for records that weren't imported and IDs in UniProt XML.

Note:

Coloring a MAPP with expression data

I then customized the new Expression Dataset by creating new Color Sets that contained instructions to GenMAPP for displaying data. I created this by filling in different fields in the color set area of the Data Set Manager, as per my assignment page for Week 8.The Gene Value was "Avg_LogFC_all" for the dataset. The criteria were established using the formula [Avg_LogFC_All] > 0.25 AND [Pvalue] < 0.05 and [Avg_LogFC_All] < -0.25 AND [Pvalue] < 0.05. I coded "increased" as green and "decreased" as red. I then saved the data set and exited the Expression Dataset.

Note:

Running MAPPFinder

I launched MAPPFinder and chose my database for V. cholerae. I then clicked "Calculate New Results" and "Find File" and chose my Expression Dataset (.gex) file. I clicked OK and chose "increased" (arbitrary) and checked "gene ontology" and "p value" boxes. I then clicked Browse and created the filename Kzebrowsvcholerae20151102.

Note:Every time I attempted to run MAPPFinder on my computer the program crashed and I was given the "Not Responding" message. To fix this I had to upload my database to my Google drive and access it from another computer. Luckily MAPPFinder worked on the other computer so I chose my .gdb file as the gene database and then chose my .gex file and ran the program. It took about 12 minutes for the program to run.

I then attempted to click on one of the GO terms, purine ribonucleotide biosynthetic process. Instead of listing all of the genes associated with this GO term I received this message:

Compare Gene Database to Outside Resource

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Note:

Class Notes for 10/29

Review of the big picture: Two weeks ago we exported raw data from Merrell into excel to import into the Gene Database. MAPPFinder's backpage looked genes up in the database. Now we're working backwards to find where the Gene Database came from. Gene database comes from a combination of 3 files, which were created by running GenMAPP Builder. The product of the builder is the database.

UniProt (xml)
GO (xml)
GOA (tab delimited)

File in ThawSpace is Vc-Std_20151027.gdb. The import step loaded the data from the files into a PostgreSQL database. The export step takes them out of PostgreSQL and puts them in the Gene Database where we can perform microanalysis again (done 10/27).

Quality Insurance: Need to check that data traveled correctly to database and that it traveled correctly to final Gene Database. This should be performed every time you export data to a database to verify that everything happened as it should.