Ksherbina Week 9

Katrina Sherbina

Class Page User Page

Assignment Description

Week 1

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

Week 13

Week 15

Class Journal

Week 1

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Individual Journal

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Other

Week 5: Database Wiki

Final Project

Team H(oo)KD Project Page

Journal Club Presentation

Project Individual Journal

Gene Database Testing Report

The following records the results of going through the procedure outlined in the Lab Notebook section below.

Export Information

Version of GenMAPP Builder: 2.0b70

Computer on which export was run: Personal computer

Postgres Database name: VC_KS_20131022_gmb2b70

UniProt XML filename: UniProt V cholerae KSTV 20131022.xml

UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2013_10
Time taken to import: 4.20 minutes

GO OBO-XML filename: go_daily-termdb_KS_20131022.obo-xml

GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped):
Time taken to import: 8.81 minutes
Time taken to process: 7.33 minutes

GOA filename: 46.V_cholerae_ATCC_39315_KS_20131022.goa

GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): Latest version released 10/16/2013
Time taken to import: 0.10 minutes

Name of .gdb file: Vc-Std_KS_20131022.gdb

Time taken to export .gdb:

Start Time: 6:50 pm
End Time: 9:22 pm

Link to file: Vc-Std KS 20131022.gdb

Note:

TallyEngine

Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the PostgreSQL databases (or you can upload and link to a screenshot of the results):

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Follow the instructions found on this page to run XMLPipeDB match.

Are your results the same as you got for the TallyEngine? Why or why not?

When searching for gene IDs with the pattern "VC_(A|)[0-9][0-9][0-9][0-9]", the results were the same as those for TallyEngine. In both cases, 3831 matches were found. However, when first searching for the pattern "VC_[0-9][0-9][0-9][0-9]", only 2738 matches were found.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

Follow the instructions on this page to query the PostgreSQL Database.

There were 3831 matches found using an SQL query (the same as with TallyEngine).

OriginalRowCounts Comparison

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: (for the Week 9 Assignment, use the "Vc-Std_External_20101022.gdb" as your benchmark, downloadable from here.

Copy the OriginalRowCounts table and paste it here:

Table Rows
EMBL 228
GeneOntology 5556
GeneOntologyCount 3239
GeneOntologyTree 97982
Info 1
InterPro 4349
OrderedLocusNames 3832
OrderedLocusNames-EMBL 4111
OrderedLocusNames-GeneOntology 20613
OrderedLocusNames-InterPro 9723
OrderedLocusNames-PDB 235
OrderedLocusNames-Pfam 4367
Other 0
PDB 223
Pfam 2102
RefSeq 3403
RefSeq-EMBL 3669
RefSeq-GeneOntology 18931
RefSeq-InterPro 9241
RefSeq-OrderedLocusNames 3520
RefSeq-PDB 234
RefSeq-Pfam 4145
Relations 18
Systems 30
UniProt 3784
UniProt-EMBL 5452
UniProt-GeneOntology 20464
UniProt-GOCount 3240
UniProt-InterPro 10393
UniProt-OrderedLocusNames 3832
UniProt-PDB 319
UniProt-Pfam 4648
UniProt-RefSeq 3635

Note:

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?

No, not all of the gene ID systems have a data listed in the Date field.

Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?

The IDs in the UniProt and the RefSeq tables look to be in the right form. However, the IDs in the OrderedLocusNames have underscores, which throws off the count for the gene IDs. Access counts 3832 gene IDs while the other methods counted 3831 IDs.

Note:

.gdb Use in GenMAPP

Note: I tried to create an expression dataset using the Merrell et al. data normalized last week and the gene database created this week. As expected, none of the genes in the data matched the genes in the database because the IDs in the latter have underscores while the IDs in the former do not.

Putting a gene on the MAPP using the GeneFinder window

Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.

Note:

Creating an Expression Dataset in the Expression Dataset Manager

How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note:

Coloring a MAPP with expression data

Note:

Running MAPPFinder

Note:

Compare Gene Database to Outside Resource

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Note:

Lab Notebook

Download GenMAPP Builder 2.0b70 from http://sourceforge.net/projects/xmlpipedb/files/GenMAPP%20Builder/

When the zipped file has downloaded, unzip the folder with 7-zip.

Download XMLPipeDB Match from http://sourceforge.net/projects/xmlpipedb/files/XMLPipeDB%20Match/

When the zipped file has downloaded, unzip the folder with 7-zip.

Download the UniProt XML file for Vibrio cholerae

Go to the page http://www.uniprot.org/uniprot/?query=organism:243277+keyword:1185
Click the orange Download link in the upper right-hand corner of the page.
Download the XML file to the "Downloads" folder on the computer.

Download the file 46.V_cholerae_ATCC_39315.goa to the "Downloads" folder on the computer.
Download the GO OBO-XML formatted file from the Gene Ontology download page to the "Downloads" folder on the computer. Click on the link for obo-xml.gz.

When the zipped file has downloaded, unzip the folder with 7-zip.

Import Data into the PostgreSQL Database

Launch pgAdmin III.
Double-click on PostgreSQL 9.2 (localhost:5432) on the upper left hand side of the window. Type in the password to connect.
Right click on Databases and select New Database...
Name the new database in the following format and click OK: VC_<your initials>_20131022_gmb2b70 (ex. VC_KS_20131022_gmb2b70).
Click on your new database name in the treeview on the left.
Click on the SQL icon in the toolbar at the top of the window.
Click on the Open File icon in the toolbar (the yellow folder with an arrow).
Navigate to the folder in which you unzipped GenMAPP Builder.
Open the sql folder and open the file gmbuilder.sql. You should see SQL code appear in the SQL Editor tab.
Click the Execute Query icon which looks like a green “Play” triangle button.
Close the query window.

To double check that all is OK, click the + sign for the database, then the + sign for Schemas, then finally the + sign for public. Under the Tables section, you should see a count of 159 in parentheses.

Export a GenMAPP Gene Database Using GenMAPP Builder

Keep pgAdminIII open. Go to the gmbuilder-2.0b70 folder in the Downloads folder and launch gmbuilder-32bit.bat
Select the menu item File > Configure Database...

Under the Database Connections tab,

The Database Driver defaults to PostgreSQL. Enter the following information into the corresponding fields:

Host or address: localhost
Port number: 5432
Database name: VC_KS_20131022_gmb2b70
Username: postgres
Password: <enter the password of the PostgreSQL database you created above>

Click the OK button.
Select File > Import UniProt XML...

Navigate to the UniProt XML file that you extracted previously and click the Open button.

Select File > Import GO OBO-XML...

Navigate to the GO OBO-XML file that you extracted previously. Click the Open button.
Click OK to the message asking you to process the GO data.

Select File > Import GOA...

Navigate to the GOA file that you downloaded previously and click the Import button.

Select File > Export to GenMAPP Gene Database...

Type a name in the Owner field (ex. LMU_Fall2013_BIOL367_KS).
GenMAPP Builder scans your PostgreSQL database to see what species are available. Click on the species that you would like to export, then click Next to continue.

Click on the Save GenMAPP Database File As... button. Make sure that the Downloads folder appears. Modify the default file name to include your initials and then click on Save.

Click the Next button to start the export process.

Check the Quality of the Exported Database

After performing an export, choose Run XML and Database Tallies for UniProt and GO....
Choose the UniProt and GO files that you imported.

Open the Command Prompt on your computer.
Change the directory to where the XMLPipeDB match tool is located.

cd C:\Users\Katrina\Downloads

Use XMLPipeDB Match to count the number of unique gene ideas in the V. cholerae Uniprot XML file.

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < UniProt_V_cholerae_KSTV_20131022

Since some of the gene IDs are in the format "VC_A####" rather than "VC_####", try the following command to determine the number of unique gene IDs.

java -jar xmlpipedb-match-1.1.1.jar "VC_(A|)[0-9][0-9][0-9][0-9]" < UniProt_V_cholerae_KSTV_20131022

Go back to pgAdminIII (it should still be open).
Click on the button to execute SQL queries.
Count the number of unique gene IDs using the following query:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_(A|)[0-9][0-9][0-9][0-9]'

Open Microsoft Access.
Open the .gdb file (Vc-Std_KS_20131022.gdb) that was exported using GenMAPP Builder.
In the Tables window, double-click on the table named OriginalRowCounts and look at the row titled OrderedLocusNames to see how many gene IDs Microsoft Access detected in the file.

Ksherbina Week 9

Contents

Gene Database Testing Report

Export Information

TallyEngine

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

OriginalRowCounts Comparison

Visual Inspection

.gdb Use in GenMAPP

Putting a gene on the MAPP using the GeneFinder window

Creating an Expression Dataset in the Expression Dataset Manager

Coloring a MAPP with expression data

Running MAPPFinder

Compare Gene Database to Outside Resource

Lab Notebook

Import Data into the PostgreSQL Database

Export a GenMAPP Gene Database Using GenMAPP Builder

Check the Quality of the Exported Database

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox