Gene Database Testing Report - Oregon Trail Survivors

From LMU BioDB 2015
Jump to: navigation, search

Helpful Links

Team Links
Files Team Members Week 11 Assignment Week 12 Assignment Week 14 Assignment Week 15 Assignment
OTS Deliverables Trixie Week 11 Week 12 Week 14 Week 15
Jake Week 11 Week 12 Week 14 Week 15
Gene Database Testing Report Erich Week 11 Week 12 Week 14 Week 15
Kristin Week 11 Week 12 Week 14 Week 15
Gene Database Project Links
Overview Deliverables Reference Format Guilds Project Manager GenMAPP User Quality Assurance Coder
Teams Heavy Metal HaterZ The Class Whoopers GÉNialOMICS Oregon Trail Survivors

Things to note

  • Taxonomy ID: 623
  • UP000001006
  • File management system: Wiki

Initial (Vanilla) Export Information

Version of GenMAPP Builder:

  • gmbuilder-3.0.0-build-5

Computer on which export was run:

  • Front of the room, 3rd computer from the right.

Postgres Database name:

  • Shigella flexneri 20151911

UniProt XML filename (give filename and upload and link to compressed file):

  • UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2015_11
  • UniProt XML download link: Click here
  • Time taken to import: 4.48 minutes
    • Note:

GO OBO-XML filename (give filename and upload and link to compressed file):

  • GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped): Version created on 11/19/2015 (at 2:24 AM)
  • GO OBO-XML download link: Click here to download.
  • Time taken to import: 7.00 minutes
  • Time taken to process: 4.99 minutes
    • Note:

GOA filename (give filename and upload and link to compressed file):

  • GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): Version released on .
  • GOA download link: Click here to download.
  • Time taken to import: 0.06 minutes
    • Note:

Name of .gdb file (give filename and upload and link to compressed file): Sf-Std_20151119_OTS.gdb

  • Time taken to export: 1 Hours, 32 Minutes, 33 Seconds
    • Start time: 4:06:13 PM PM PDT
    • End time: 5:38:46 PM PDT
    • Note:

Export Information for Build with Coder Changes # 1

Build 1

Name of .gdb file: Sf-Std_20151201.gdb

  • Time taken to export:
    • Start time: 4:19:22 PM PDT
    • End time: 8:30:08 PM PDT
    • I have confirmed that the necessary information in the .gdb file exist in the new build (e.g. the URL of the database we are using).

TallyEngine

  • Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
    • Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
    • Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
    • Here is the screenshot of the tally result:

TallyEngine results OTS 112115.jpg

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Follow the instructions found on this page to run XMLPipeDB match.

  • In the Thawspace directory, I created a folder called "Shigella_flexneri_BioDB_2015" and created subfolders called "Source" and "Working" to store the source files (i.e., the compressed files) and the working files (i.e., the files I will actually be processing).
  • As a result, I had to cd to these directories first before using the command for using Match.
    • In order to change into the ThawSpace0\Shigella_flexneri_BioDB_2015\Working directory, use the following commands on the command prompt window:
T: && cd "Shigella_flexneri_BioDB_2015\Working"
  • The command I used once inside the directory I want is:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "SF[0-9][0-9][0-9][0-9]" < uniprot-proteome%3AUP000001006.xml
  • The results are as follows:

Match results OTS 112115.jpg

These results did not match up with what the TallyEngine gave (TallyEngine: 7567 vs. Match: 4610)

  • As a result, the commands would have to be modified somehow so that the numbers match: (CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)
  • The overall command to write to a text file is as follows:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)" < uniprot-proteome%3AUP000001006.xml > shigella_flexneri_results.txt
  • Then our results became:

Match results OTS 20151203 more accurate.jpg

  • Observations:
    • In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
    • When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
    • In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
    • I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
    • Note: It turns out the ShiBASE database only uses the pattern SF#### instead of S#### or CP#### so the regular expression would really have to be just SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)


Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

  • The command used to count the number of IDs is:
select count(*) from genenametype where type = "ordered locus" and value ~ "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?";

Postgres results OTS 20151203.jpg

  • The result above is exactly twice as much as the number of OrderedLocusNames from TallyEngine: 15134 / 2 = 7567 IDs
  • A quick peek at the results after the command select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?'; is used and the results are exported to Excel reveals that this is because every single entry is entered twice:

Postgres results excel form OTS 20151203.jpg

  • Adding the keyword "distinct" would resolve the double counting:
select distinct value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?

Analysis

  • The total number of OrderedLocusNames in TallyEngine is 7567.
  • Using the (best I could) regular expression in Match, the result is 7573. The additional 6 IDs emerged since those are originally captured by the regular expression (CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?</name> and trying to capture the IDs of the form SF?####/SF?#### would duplicate those captured IDs.
  • The total of entries in PostGreSQL is 15134, but this is only because each gene is repeated twice. As a result, diving by 2 would actually yield 7567.
  • Microsoft Access yielded 7569 in the OrderedLocusNames window. The extra 2 genes came from the IDs of the form SF?####/SF?#### since the export broke up the two IDs that represent the same ID.
    • 49 are of the form CP####
    • 3413 are of the form S####
      • 14 are of the form S####.#
    • 4107 are of the form SF####
      • 35 are of the form SF####.#
  • Inspecting the UniProt XML file was necessary in identifying the IDs. Looking through what was inside, I discovered (with help from Dondi) that I had to add the end tag "</name>" in order to narrow down the results in Match

"Export" from Build 2

Name of .gdb file: Sf-Std 20151207.gdb

  • Date: 12/7/15
  • Time taken to export: 4 hours, 24 minutes and 1 second
    • Start time: 9:13:45 PM PDT
    • End time: 1:37:46 AM PDT
    • Note: This export had to be redone since the PSQL database had twice as much entries.

Export Information (Re-imported) Build 2

Version of GenMAPP Builder:

  • gmbuilder-3.0.0-build-5

Computer on which export was run:

  • Front of the room, 3rd computer from the right.

Postgres Database name:

  • Shigella_flexneri_20151208

UniProt XML filename (give filename and upload and link to compressed file):

  • UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2015_11
  • UniProt XML download link: Click here
  • Time taken to import: 4.43 minutes
    • Note:

GO OBO-XML filename (give filename and upload and link to compressed file):

  • GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped): Version created on 11/19/2015 (at 2:24 AM)
  • GO OBO-XML download link: Click here to download.
  • Time taken to import: 6.84 minutes
  • Time taken to process: 5.49 minutes
    • Note:

GOA filename (give filename and upload and link to compressed file):

  • GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): Version released on .
  • GOA download link: Click here to download.
  • Time taken to import: 0.06 minutes
    • Note:

Name of .gdb file (give filename and upload and link to compressed file): Sf-Std 20151208.gdb

  • Time taken to export: 1 hour, 38 minutes, 42 seconds
    • Start time: 4:30:59 PM PDT
    • End time: 6:09:41 PM PDT
    • Note: The reason why I had to re-import everything into a new database is because the one I have been using had some files imported twice. Thus, the reports given by PostGres were all twice as much.

Using TallyEngine

  • The database used is the same one described in the section above: Shigella_flexneri_20151208
  • Notice in the image below that there is an error in the cells. It turns out that we did not even need to add the Ordered Locus since that was the default. We will definitely need to do one last build in order to fix that issue.

Shigella flexneri tallyEngine results build 2.png

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Regex1 OTS.png

Regex2 OTS.png

  • When added together, the results becomes 7566 + 3 = 7569.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

  • The following command in PostGreSQL resulted in 7567 entries:
select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';
  • The following command resulted in 214 entries:
select value from genenametype where type = 'ORF' and value ~ '(CP|SF?)(_p)?[0-9][0-9][0-9][0-9](\.[0-9])?';

OriginalRowCounts Comparison

Ms access originalrowcounts.png

  • The OrderedLocusNames row seems to report on the same number of IDs as our previous builds

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

  • Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
    • Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
  • Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
    • Yes, all of them seem to follow the same format (there ares more or less, 3 variations on the IDs for each of the tables).

Excel Inspection

Observations

  • Through the use of an XML-reader program, called "firstObject XML Editor", it was discovered that some ordered locus IDs that were exported by GenMAPPBuilder were placed in the same tag:

Dual ordered locus names.png

  • These differed from the ones originally captured (7567) since these existed separately in each of the gene/name tags:

Simple ordered locus names.png

  • Additionally, from the IDs reported by the GenMAPP users as missing, it was revealed that these do not exist in the XML file at all, or at least in the format that we wanted. These sets of IDs were actually misnomers since, even though CTRL + F lets us find them, they are not the ordered locus names that we were looking for:
  • Example 1:

Match pic.png

  • Example 2:

Id misnomers2.png

  • However, because of these observations, we have actually discovered ~92 IDs that existed within the XML file, albeit in a different tag than what we were using:

Id misnomers.png

  • All these observations led us to make one final build to capture those ~92 gene IDs.

Export Information (final)

  • Date: 12/14/15

Name of .gdb file (give filename and upload and link to compressed file): Sf-Std 20151208.gdb

  • Time taken to export: 2 hour, 0 minutes, 27 seconds
    • Start time: 9:35:00 PM PDT
    • End time: 11:35:27 PM PDT

Using TallyEngine

  • The results are shown below:

Tally results build2.png

Using Microsoft Access

  • Even though the results from the TallyEngine say different numbers, the OrderedLocusNames that were exported in the .gdb file is the following:

Orderedlocusnames.png

.gdb File

  • The resulting .gdb file can be downloaded here.