Anuvarsh Week 9

From LMU BioDB 2015
Jump to: navigation, search

***All files can be accessed here***
All procedures below were modified from the following pages:


Pre-requisites

This procedure was done in a Windows environment. While it is possible to run GenMAPP Builder under the Mac or Linux OS, the end product, a GenMAPP-compatible Gene Database (.gdb), can only be used with the GenMAPP program, which can only be run on Windows. This set of software has already been installed on the computers in the Seaver 120 computer lab. Prior to proceeding through this procedure, my machine had the following tools and programs:

  1. 7-zip to extract any zipped files.
  2. PostgreSQL on Windows (http://www.enterprisedb.com/products-services-training/pgdownload)
    • This procedure was written using PostgreSQL 9.4.x.
  3. GenMAPP Builder (https://sourceforge.net/projects/xmlpipedb/files/)
  4. Java JDK 1.8 64-bit
  5. GenMAPP 2 can be downloaded here. The file to download is "GenMAPPv2Setup.exe".
  6. XMLPipeDB match utility (https://sourceforge.net/projects/xmlpipedb/files/) for counting IDs in XML files
  7. Microsoft Access or any other tool that can read .mdb files


Export Information

Version of GenMAPP Builder: gmbuilder-3.0.0-build-5

Computer on which export was run: back row, second from door.

Postgres Database name: vcholera-20151027-gmb3build5-AV

UniProt XML filename (give filename and upload and link to compressed file):

GO OBO-XML filename (give filename and upload and link to compressed file):

  • GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [GO Download page] has been unzipped): 10/27/2015, 2:24am
  • GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
    • Clicked on obo-xml.gz link in far right column of second row of XML format table.
  • Time taken to import: 7.58 minutes
  • Time taken to process: 4.37 minutes
    • Note:

GOA filename (give filename and upload and link to compressed file):

Name of .gdb file (give filename and upload and link to compressed file):

  • Time taken to export: 1 hour 17.47 minutes
    • Start time: 10/27/2015, 3:52:08PM PDT
    • End time: 10/27/2015, 5:09:55PM PDT

Note:

TallyEngine

  • Ran the TallyEngine in GenMAPP Builder and recorded the number of records for UniProt and GO in the XML data and in the Postgres databases.
    • After running PostgreSQL and making sure my database was running, I ran GenMAPP builder and connected it to the database.
    • After performing an import, I chose Run XML and Database Tallies for Uniprot and and selected the UniProt and GO files that I imported.
    • My Tally results are in the screenshot below:

Week9-tally-results-AV.png

  • The Tally results indicated counts for unique genes (labelled as Ordered Locus)
    • XML: 3831 unique genes
    • Database: 3831 unique genes

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

  • In order to check the number of genes via xmlpipedb match, I used a command that utilized match for the file containing all of the uniprot genes.
  • Command Used:
   java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml"

Week9-xmlpipedb-results-AV.png

  • XMLPipeDB Match returned 2738 unique genes.
    • This number is different than the numbers the TallyEngine returned.
    • Upon later analysis (documented later in this lab notebook), I realized that the discrepancy between these counts is mostly attributed to the several different formats between which the genes are presented in the UniProt file. Despite all of these formats, we only counted the genes named in the format "VC_####" which disregards the other legitimate naming format.
  • We modified our command to take into consideration both 'VC_####' and 'VC_A####' as legitimate gene names. This resulted in xmlpipedb-match indicating 3831 unique genes.
  • Command Used:
   java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml > ordered-locus_match-results_AV.txt

Week9-xmlpipedb-results-2-AV.png

  • These new results match the results of the Tally Engine.
  • This result was sent to the file ordered-locus_match-results_AV.txt

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

  • On PostgreSQL, I searched through the database using a variation of the select count(*) query.
  • Command used:
   select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

Week9-postgresql-result-1-AV.png

  • PostgreSQL returned 2737 unique genes.
  • These results are different for the same reason as the xmlpipedb-match results. This search only counted one format of gene names.
  • When we reformatted the query to take into account both 'VC_####' and 'VC_A####', PostgreSQL indicated 3831 unique genes.
  • Command used:
   select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

Week9-postgresql-result-2-AV.png

  • This new count is the same as the xml and database results from the TallyEngine.

OriginalRowCounts Comparison

Within the .gdb file, I looked at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. I compared the tables and records with a benchmark .gdb file. The benchmark .gdb file that I used was the 2010 V. cholera database from the Week 8 DNA Microarray Analysis Journal.

Benchmark .gdb file: Vc-Std_External_20101022

Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:

  • New gdb:

Week9-OriginalRowCounts-table-AV.png

  • 2010 gdb:

Week9-OriginalRowCounts-table-2010-AV.png

  • Note:
    • Both files indicated the same number of UniProt-OrderedLocusNames.
    • The new file contains 10 more tables than the previous version.
    • Most counts in each Table in the new gdb are much higher than their corresponding counts in the 2010 gdb.

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

  • Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
    • The Systems table does not have gene ID information. Instead, it has a column titled "System Name" that seems to refer to all of the systems this database could be associated with. It does not have dates next to all of the databases mentioned.
  • Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
    • The UniProt table seems to have the correct form for that type of ID. All of the ID's are in the format (where L = Letter and # = any single digit number): L#LL[L or #]#
    • The RefSeq table seems to have the correct form for that type of ID. All of the ID's are in the format (where # = any single digit number): [N or W]P_[###### or #########]
    • The OrderedLocusNames table has IDs in one of the following formats (where # = any single digit number):
      • VC_####
      • VC_A####
      • VC####
      • VCA####
    • This variation between underscore and no underscore is intentional and is because the programs double the number of entries in the database. We only want to consider the entries with the underscore as the others are duplicates. The variation between A and no A is due to two different types of ID formats, and these differences were considered when counting the number of unique gene ID's.
    • The OrderedLocusNames table indicates 7664 unique gene IDs. We can divide that number by half to solve for the duplicate problem as previously stated. This brings us to 3832 unique gene IDs. There is still a discrepancy between this number and the number that we found in all of the other counting methods. This means that there are 2 IDs in the OrderedLocusNames table that have not been accounted for in any of the other methods (or just 1 under the assumption that there is a duplicate similar to all of the other genes).
    • This discrepancy between genes is due to one gene ID, "VC_A0360.1", being in a completely different format than all others. This results in 2 extra IDs in the OrderedLocusNames table because both forms of this gene ID (with and without the underscore) exist in the table. This is due to the duplicates problem that was addressed earlier.

.gdb Use in GenMAPP

Putting a gene on the MAPP using the GeneFinder window

  • In the main GenMAPP Drafting Board window, I left-clicked on the icon for "Gene" in the upper left corner of the window. I clicked on the Drafting Board to place the Gene on the MAPP. Then, I right-clicked on the gene to access the GeneFinder window. I pasted "VC_0014" into the Gene ID field, and selected the OrderedLocusNames ID. Once the ID was found, I clicked on the OK button to return to the Drafting Board window.
  • I opened the Backpage by left-clicking on the gene box on the Drafting Board.
  • Tried "VC_0014" from each of the gene ID systems. Open the Backpage and saw that all of the cross-referenced IDs that are supposed to be there are there.

Gene-Finder-Results-AV.png

Creating an Expression Dataset in the Expression Dataset Manager

  • I opened the Expression Dataset Manager from the Data menu in the main drafting board window and selected "New Dataset"
  • I selected the tab-delimited text file that I formatted for GenMAPP (.txt) last week. Since the only data included in this file are numerical values, I did not select any boxes in the Data Type Specification window that appeared.
  • The Expression Dataset Manger then converted my data for several seconds.
    • When uploading my tab delimited file to the GenMAPP software for conversion, 121 errors were detected in my raw data.
  • I then customized the new Expression Dataset by creating Color Sets that contain instructions to GenMAPP for displaying data on MAPPs
  • To do this, I created a color set named "LogFoldChange" and selected "Avg_LogFC_all" as the data that should be used as the Gene Value. I activated the criteria builder by clicking the New button.
  • Two criterion were created: one for increased expression of data and another for decreased expression of data
    • "Increased" was defined as [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05
    • "Decreased was defined as [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05
  • I saved the entire Expression Dataset by selected Save from the Expression Dataset menu and then exited from the Expression Dataset Manager to view the Color Sets on a MAPP.
  • How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
    • Uploading the Merrell et al. .txt file modified for GenMAPP from last week using the new database resulted in 121 errors. Surprisingly, this is the same number of errors that appeared when uploading this same data with the 2010 version of the V. cholera database.
    • 5,221 IDs were imported with the new database. This is the same as the number of IDs imported with the 2010 database.
    • All of the Errors in the .EX.txt file said: "Gene not found in OrderedLocusNames or any related system."
    • I opened the .EX.txt file and searched for the genes that were documented with an error in the "Uniprot" table in the database file via Microsoft Access. I spot checked 6 genes ("VC2209", "VCA1031", "VCA0745", "VC1476", "VCA0534", "VCA0276") and noticed that none of these genes appeared in the "Uniprot" table in the database. Because of this, I can conclude that these genes were not present in the UniProt XML.

Coloring a MAPP with expression data

  • I launched the MAPPFinder program from within GenMAPP by selecting MAPPFinder under the Tools menu on GenMAPP
  • After ensuring that the Gene Database for the correct species was loaded, I selected the button "Calculate New Results"
  • I clicked on "Find File" and selected the .gex file that contained my Expression Dataset (This file can be found in the .zip file at the beginning of this journal)
  • I chose the LogFoldChange color set and the Increased criteria in the right-hand box, and selected the boxes for "Gene Ontology" and "p value"
  • I then named my results file "MAPPFinder-Results-20151102_AV", and clicked Run MAPPFinder. MAPPFinder took several seconds to run. After MAPPFinder finishes running, a Gene Ontology browser opened showing my results.

Note: When I double click on one of the GO terms to open the data in GenMAPP, I get the following error:
MAPPFinder-error-2015-AV.png
I am not sure what is causing this error, but I am guessing that this is due to some error within my database in regards to the format of data. Another explanation is that this the new format of this data is not compatible with MAPPFinder or GenMAPP. I am unable to determine which of these explanations is more likely at this point.

Running MAPPFinder

The following are some screenshots from running MAPPFinder using my database.
MAPPFinder-results-2015-AV.png

MAPPFinder-GO-topranked-2015-AV.png

Note: Once again, there are several differences between GO terms between the top ranked GO terms from the 2015 database and those found in the 2010 and 2009 version of the V. cholera database. These differences, as explained in the week 8 journal, are most likely due to new information made available through UniProt, GO, and GOA.


Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15