Gene Database Testing Report - Oregon Trail Survivors

Helpful Links

Team Links
Files	Team Members	Week 11 Assignment	Week 12 Assignment	Week 14 Assignment	Week 15 Assignment
OTS Deliverables	Trixie	Week 11	Week 12	Week 14	Week 15
OTS Deliverables	Jake	Week 11	Week 12	Week 14	Week 15
Gene Database Testing Report	Erich	Week 11	Week 12	Week 14	Week 15
Gene Database Testing Report	Kristin	Week 11	Week 12	Week 14	Week 15

Gene Database Project Links
Overview	Deliverables	Reference Format	Guilds	Project Manager	GenMAPP User	Quality Assurance	Coder
Overview	Deliverables	Reference Format	Teams	Heavy Metal HaterZ	The Class Whoopers	GÉNialOMICS	Oregon Trail Survivors

Things to note

Taxonomy ID: 623
UP000001006
File management system: Wiki

Initial (Vanilla) Export Information

Version of GenMAPP Builder:

gmbuilder-3.0.0-build-5

Computer on which export was run:

Front of the room, 3rd computer from the right.

Postgres Database name:

Shigella flexneri 20151911

UniProt XML filename (give filename and upload and link to compressed file):

UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2015_11
UniProt XML download link: Click here
Time taken to import: 4.48 minutes
- Note:

GO OBO-XML filename (give filename and upload and link to compressed file):

GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped): Version created on 11/19/2015 (at 2:24 AM)
GO OBO-XML download link: Click here to download.
Time taken to import: 7.00 minutes
Time taken to process: 4.99 minutes
- Note:

GOA filename (give filename and upload and link to compressed file):

GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): Version released on .
GOA download link: Click here to download.
Time taken to import: 0.06 minutes
- Note:

Name of .gdb file (give filename and upload and link to compressed file): Sf-Std_20151119_OTS.gdb

Time taken to export: 1 Hours, 32 Minutes, 33 Seconds
- Start time: 4:06:13 PM PM PDT
- End time: 5:38:46 PM PDT
- Note:

Export Information for Build with Coder Changes # 1

Build 1

Name of .gdb file: Sf-Std_20151201.gdb

Time taken to export:
- Start time: 4:19:22 PM PDT
- End time: 8:30:08 PM PDT
- I have confirmed that the necessary information in the .gdb file exist in the new build (e.g. the URL of the database we are using).

TallyEngine

Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
- Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
- Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
- Here is the screenshot of the tally result:

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Follow the instructions found on this page to run XMLPipeDB match.

In the Thawspace directory, I created a folder called "Shigella_flexneri_BioDB_2015" and created subfolders called "Source" and "Working" to store the source files (i.e., the compressed files) and the working files (i.e., the files I will actually be processing).
As a result, I had to cd to these directories first before using the command for using Match.
- In order to change into the ThawSpace0\Shigella_flexneri_BioDB_2015\Working directory, use the following commands on the command prompt window:

T: && cd "Shigella_flexneri_BioDB_2015\Working"

The command I used once inside the directory I want is:

java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "SF[0-9][0-9][0-9][0-9]" < uniprot-proteome%3AUP000001006.xml

The results are as follows:

These results did not match up with what the TallyEngine gave (TallyEngine: 7567 vs. Match: 4610)

As a result, the commands would have to be modified somehow so that the numbers match: (CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)
The overall command to write to a text file is as follows:

java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)" < uniprot-proteome%3AUP000001006.xml > shigella_flexneri_results.txt

Then our results became:

Observations:
- In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
- When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
- In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
- I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
- Note: It turns out the ShiBASE database only uses the pattern SF#### instead of S#### or CP#### so the regular expression would really have to be just SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

The command used to count the number of IDs is:

select count(*) from genenametype where type = "ordered locus" and value ~ "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?";

The result above is exactly twice as much as the number of OrderedLocusNames from TallyEngine: 15134 / 2 = 7567 IDs
A quick peek at the results after the command select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?'; is used and the results are exported to Excel reveals that this is because every single entry is entered twice:

Adding the keyword "distinct" would resolve the double counting:

select distinct value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?

Analysis

The total number of OrderedLocusNames in TallyEngine is 7567.
Using the (best I could) regular expression in Match, the result is 7573. The additional 6 IDs emerged since those are originally captured by the regular expression (CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?</name> and trying to capture the IDs of the form SF?####/SF?#### would duplicate those captured IDs.
The total of entries in PostGreSQL is 15134, but this is only because each gene is repeated twice. As a result, diving by 2 would actually yield 7567.
Microsoft Access yielded 7569 in the OrderedLocusNames window. The extra 2 genes came from the IDs of the form SF?####/SF?#### since the export broke up the two IDs that represent the same ID.
- 49 are of the form CP####
- 3413 are of the form S####
  - 14 are of the form S####.#
- 4107 are of the form SF####
  - 35 are of the form SF####.#
Inspecting the UniProt XML file was necessary in identifying the IDs. Looking through what was inside, I discovered (with help from Dondi) that I had to add the end tag "</name>" in order to narrow down the results in Match

"Export" from Build 2

Name of .gdb file: Sf-Std 20151207.gdb

Date: 12/7/15
Time taken to export: 4 hours, 24 minutes and 1 second
- Start time: 9:13:45 PM PDT
- End time: 1:37:46 AM PDT
- Note: This export had to be redone since the PSQL database had twice as much entries.

Export Information (Re-imported) Build 2

Version of GenMAPP Builder:

gmbuilder-3.0.0-build-5

Computer on which export was run:

Front of the room, 3rd computer from the right.

Postgres Database name:

Shigella_flexneri_20151208

UniProt XML filename (give filename and upload and link to compressed file):

UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2015_11
UniProt XML download link: Click here
Time taken to import: 4.43 minutes
- Note:

GO OBO-XML filename (give filename and upload and link to compressed file):

GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped): Version created on 11/19/2015 (at 2:24 AM)
GO OBO-XML download link: Click here to download.
Time taken to import: 6.84 minutes
Time taken to process: 5.49 minutes
- Note:

GOA filename (give filename and upload and link to compressed file):

GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): Version released on .
GOA download link: Click here to download.
Time taken to import: 0.06 minutes
- Note:

Name of .gdb file (give filename and upload and link to compressed file): Sf-Std 20151208.gdb

Time taken to export: 1 hour, 38 minutes, 42 seconds
- Start time: 4:30:59 PM PDT
- End time: 6:09:41 PM PDT
- Note: The reason why I had to re-import everything into a new database is because the one I have been using had some files imported twice. Thus, the reports given by PostGres were all twice as much.

Using TallyEngine

The database used is the same one described in the section above: Shigella_flexneri_20151208
Notice in the image below that there is an error in the cells. It turns out that we did not even need to add the Ordered Locus since that was the default. We will definitely need to do one last build in order to fix that issue.

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

When added together, the results becomes 7566 + 3 = 7569.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

The following command in PostGreSQL resulted in 7567 entries:

select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';

The following command resulted in 214 entries:

select value from genenametype where type = 'ORF' and value ~ '(CP|SF?)(_p)?[0-9][0-9][0-9][0-9](\.[0-9])?';

OriginalRowCounts Comparison

The OrderedLocusNames row seems to report on the same number of IDs as our previous builds

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
- Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
- Yes, all of them seem to follow the same format (there ares more or less, 3 variations on the IDs for each of the tables).

Excel Inspection

Excel file

Observations

Through the use of an XML-reader program, called "firstObject XML Editor", it was discovered that some ordered locus IDs that were exported by GenMAPPBuilder were placed in the same tag:

These differed from the ones originally captured (7567) since these existed separately in each of the gene/name tags:

Additionally, from the IDs reported by the GenMAPP users as missing, it was revealed that these do not exist in the XML file at all, or at least in the format that we wanted. These sets of IDs were actually misnomers since, even though CTRL + F lets us find them, they are not the ordered locus names that we were looking for:

Example 1:

Example 2:

However, because of these observations, we have actually discovered ~92 IDs that existed within the XML file, albeit in a different tag than what we were using:

All these observations led us to make one final build to capture those ~92 gene IDs.

Export Information (final)

Date: 12/14/15

Name of .gdb file (give filename and upload and link to compressed file): Sf-Std 20151208.gdb

Time taken to export: 2 hour, 0 minutes, 27 seconds
- Start time: 9:35:00 PM PDT
- End time: 11:35:27 PM PDT

Using TallyEngine

The results are shown below:

Using Microsoft Access

Even though the results from the TallyEngine say different numbers, the OrderedLocusNames that were exported in the .gdb file is the following:

.gdb File

The resulting .gdb file can be downloaded here.

Gene Database Testing Report - Oregon Trail Survivors

Contents

Helpful Links

Things to note

Initial (Vanilla) Export Information

Export Information for Build with Coder Changes # 1

Build 1

TallyEngine

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

Analysis

"Export" from Build 2

Export Information (Re-imported) Build 2

Using TallyEngine

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

OriginalRowCounts Comparison

Visual Inspection

Excel Inspection

Observations

Export Information (final)

Using TallyEngine

Using Microsoft Access

.gdb File

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools