Ksherbina Week 9

From LMU BioDB 2013
Jump to: navigation, search
Katrina Sherbina
Class Page    User Page
Assignment Description Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Week 10 Week 11 Week 12 Week 13 Week 15
Class Journal Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9
Individual Journal Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Week 10 Week 11
Other Week 5: Database Wiki
Final Project Team H(oo)KD Project Page Journal Club Presentation Project Individual Journal

Contents

Gene Database Testing Report

The following records the results of going through the procedure outlined in the Lab Notebook section below.

Export Information

Version of GenMAPP Builder: 2.0b70

Computer on which export was run: Personal computer

Postgres Database name: VC_KS_20131022_gmb2b70

UniProt XML filename: UniProt V cholerae KSTV 20131022.xml

  • UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2013_10
  • Time taken to import: 4.20 minutes

GO OBO-XML filename: go_daily-termdb_KS_20131022.obo-xml

  • GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped):
  • Time taken to import: 8.81 minutes
  • Time taken to process: 7.33 minutes

GOA filename: 46.V_cholerae_ATCC_39315_KS_20131022.goa

  • GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): Latest version released 10/16/2013
  • Time taken to import: 0.10 minutes

Name of .gdb file: Vc-Std_KS_20131022.gdb

  • Time taken to export .gdb:
  • Start Time: 6:50 pm
  • End Time: 9:22 pm

Note:

TallyEngine

Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the PostgreSQL databases (or you can upload and link to a screenshot of the results):

TallyEngine Results KS 20131024.jpg

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Follow the instructions found on this page to run XMLPipeDB match.

Are your results the same as you got for the TallyEngine? Why or why not?

  • When searching for gene IDs with the pattern "VC_(A|)[0-9][0-9][0-9][0-9]", the results were the same as those for TallyEngine. In both cases, 3831 matches were found. However, when first searching for the pattern "VC_[0-9][0-9][0-9][0-9]", only 2738 matches were found.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

Follow the instructions on this page to query the PostgreSQL Database.

  • There were 3831 matches found using an SQL query (the same as with TallyEngine).

OriginalRowCounts Comparison

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: (for the Week 9 Assignment, use the "Vc-Std_External_20101022.gdb" as your benchmark, downloadable from here.

Copy the OriginalRowCounts table and paste it here:

  • Table Rows
  • EMBL 228
  • GeneOntology 5556
  • GeneOntologyCount 3239
  • GeneOntologyTree 97982
  • Info 1
  • InterPro 4349
  • OrderedLocusNames 3832
  • OrderedLocusNames-EMBL 4111
  • OrderedLocusNames-GeneOntology 20613
  • OrderedLocusNames-InterPro 9723
  • OrderedLocusNames-PDB 235
  • OrderedLocusNames-Pfam 4367
  • Other 0
  • PDB 223
  • Pfam 2102
  • RefSeq 3403
  • RefSeq-EMBL 3669
  • RefSeq-GeneOntology 18931
  • RefSeq-InterPro 9241
  • RefSeq-OrderedLocusNames 3520
  • RefSeq-PDB 234
  • RefSeq-Pfam 4145
  • Relations 18
  • Systems 30
  • UniProt 3784
  • UniProt-EMBL 5452
  • UniProt-GeneOntology 20464
  • UniProt-GOCount 3240
  • UniProt-InterPro 10393
  • UniProt-OrderedLocusNames 3832
  • UniProt-PDB 319
  • UniProt-Pfam 4648
  • UniProt-RefSeq 3635

Note:

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

  • Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
No, not all of the gene ID systems have a data listed in the Date field.
  • Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
The IDs in the UniProt and the RefSeq tables look to be in the right form. However, the IDs in the OrderedLocusNames have underscores, which throws off the count for the gene IDs. Access counts 3832 gene IDs while the other methods counted 3831 IDs.

Note:

.gdb Use in GenMAPP

Note: I tried to create an expression dataset using the Merrell et al. data normalized last week and the gene database created this week. As expected, none of the genes in the data matched the genes in the database because the IDs in the latter have underscores while the IDs in the former do not.

Putting a gene on the MAPP using the GeneFinder window

  • Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.

Note:

Creating an Expression Dataset in the Expression Dataset Manager

  • How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note:

Coloring a MAPP with expression data

Note:

Running MAPPFinder

Note:

Compare Gene Database to Outside Resource

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Note:

Lab Notebook

  • When the zipped file has downloaded, unzip the folder with 7-zip.
  • When the zipped file has downloaded, unzip the folder with 7-zip.
  • Download the UniProt XML file for Vibrio cholerae
  • When the zipped file has downloaded, unzip the folder with 7-zip.

Import Data into the PostgreSQL Database

  • Launch pgAdmin III.
  • Double-click on PostgreSQL 9.2 (localhost:5432) on the upper left hand side of the window. Type in the password to connect.
  • Right click on Databases and select New Database...
  • Name the new database in the following format and click OK: VC_<your initials>_20131022_gmb2b70 (ex. VC_KS_20131022_gmb2b70).
  • Click on your new database name in the treeview on the left.
  • Click on the SQL icon in the toolbar at the top of the window.
  • Click on the Open File icon in the toolbar (the yellow folder with an arrow).
  • Navigate to the folder in which you unzipped GenMAPP Builder.
  • Open the sql folder and open the file gmbuilder.sql. You should see SQL code appear in the SQL Editor tab.
  • Click the Execute Query icon which looks like a green “Play” triangle button.
  • Close the query window.
  • To double check that all is OK, click the + sign for the database, then the + sign for Schemas, then finally the + sign for public. Under the Tables section, you should see a count of 159 in parentheses.

Export a GenMAPP Gene Database Using GenMAPP Builder

  • Keep pgAdminIII open. Go to the gmbuilder-2.0b70 folder in the Downloads folder and launch gmbuilder-32bit.bat
  • Select the menu item File > Configure Database...
Using GenMAPP Builder KS 20131024.jpg
  • Under the Database Connections tab,
  • The Database Driver defaults to PostgreSQL. Enter the following information into the corresponding fields:
  • Host or address: localhost
  • Port number: 5432
  • Database name: VC_KS_20131022_gmb2b70
  • Username: postgres
  • Password: <enter the password of the PostgreSQL database you created above>
  • Click the OK button.
  • Select File > Import UniProt XML...
  • Navigate to the UniProt XML file that you extracted previously and click the Open button.
  • Select File > Import GO OBO-XML...
  • Navigate to the GO OBO-XML file that you extracted previously. Click the Open button.
  • Click OK to the message asking you to process the GO data.
  • Select File > Import GOA...
  • Navigate to the GOA file that you downloaded previously and click the Import button.
  • Select File > Export to GenMAPP Gene Database...
  • Type a name in the Owner field (ex. LMU_Fall2013_BIOL367_KS).
  • GenMAPP Builder scans your PostgreSQL database to see what species are available. Click on the species that you would like to export, then click Next to continue.
Using the Export Wizard in GenMAPP Builder KS 20131024.jpg
  • Click on the Save GenMAPP Database File As... button. Make sure that the Downloads folder appears. Modify the default file name to include your initials and then click on Save.
Using the Export Wizard in GenMAPP Builder Second Window KS 20131024.jpg
  • Click the Next button to start the export process.

Check the Quality of the Exported Database

  • After performing an export, choose Run XML and Database Tallies for UniProt and GO....
  • Choose the UniProt and GO files that you imported.
Use Tally Engine in GenMAPP Builder KS 20131024.jpg
  • Open the Command Prompt on your computer.
  • Change the directory to where the XMLPipeDB match tool is located.
cd C:\Users\Katrina\Downloads
  • Use XMLPipeDB Match to count the number of unique gene ideas in the V. cholerae Uniprot XML file.
java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < UniProt_V_cholerae_KSTV_20131022
  • Since some of the gene IDs are in the format "VC_A####" rather than "VC_####", try the following command to determine the number of unique gene IDs.
java -jar xmlpipedb-match-1.1.1.jar "VC_(A|)[0-9][0-9][0-9][0-9]" < UniProt_V_cholerae_KSTV_20131022
  • Go back to pgAdminIII (it should still be open).
  • Click on the button to execute SQL queries.
  • Count the number of unique gene IDs using the following query:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_(A|)[0-9][0-9][0-9][0-9]'
  • Open Microsoft Access.
  • Open the .gdb file (Vc-Std_KS_20131022.gdb) that was exported using GenMAPP Builder.
  • In the Tables window, double-click on the table named OriginalRowCounts and look at the row titled OrderedLocusNames to see how many gene IDs Microsoft Access detected in the file.
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox