Taur.vil Week 9

From LMU BioDB 2013
Jump to: navigation, search

Week 9 Individual Journal

Contents

Digital Notebook

Downloading , importing, and exporting

  1. Downloaded GenMAPP Builder 2.0b70 and SMLPipeDB-Match-1.1.1
  2. Uniprot file for VC was downloaded and saved as VC_2013_10_22_TVKS.xml
  3. GOA file was saved as 46.V_cholerae_ATCC_39315_TVKS_2013_10_22.goa
    • Direct download from wiki due to network connectivity problems.
  4. Downloaded GO OBO-SML and saved as Go_daily-termdb_TVKS_2013_10_22.obo-xml.gz
    • Done using beta page and legacy download
  5. Opened PgAdminIII
  6. Logged into postgres and created new database titled VC_TVKS_2013_10_22_gmb2b70
  7. Used postgres function to open gmbuilder.sql in the GenMAPP Builder folder
  8. executed command to create tables in database
    • Verified that 159 tables were created
  9. Launced gmbuilder-32bit.bat from the GenMAPP Builder download folder
  10. Configured database to connect to postgres on the local computer
  11. Imported UniProt XML, GO OBO-XML, and GOA data files.
    • Processed GO data after it was imported
  12. Exported a GenMAPP database: Vc-Std 20131022 TVKS gmb2b70.gdb


Testing the Data Feed

  1. Ran tally engine in GenMAPP Builder to compare XML file and the database. The two matched. Tally ENgine output.PNG
  2. Ran XMLPipeDB Match in the command prompt
  3. Cd'd into folder on desktop for databases
  4. Ran the following two commands to count all ordered loci. The first did not detect those with the optional A after the underscore.
    • "\Program Files <x86>\Java]jre7\bin\java" -jar xmlpipe-db-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < VC_2013_20_22TVKS.xml
    • "\Program Files <x86>\Java]jre7\bin\java" -jar xmlpipe-db-match-1.1.1.jar "VC_(A|)[0-9][0-9][0-9][0-9]" < VC_2013_20_22TVKS.xml
    • Results matched those of the TallyEngine.
  5. Used SQL to search through the filled data tables using the command bellow. Found the same amount as the other two methods.
    • select count (*) when type='ordered locus' and value ~ 'VC_(A|)[0-9][0-9][0-9][0-9]'
  6. Opened the gdb file exported earlier in Microsoft Access
  7. Ordered Locus value was one higher than expected based on the prior data.
    • explained by GenMAPP splitting apart a conjoined pair of genes
    • VC_1738 and VC_1739 were linked together and not identified as separate genes by the other methods

Export Information

Version of GenMAPP Builder: 2.0b70

Computer on which export was run:BIOL 206, back right

Then my personal lab top

Postgres Database name: VC_TVKS_2013_10_22_gmb2b70

UniProt XML filename: VC_2013_10_22_TVKS.xml

  • UniProt XML version (The version information can be found at the UniProt News Page):
  • Time taken to import: 8.31 minutes (4.55 on own computer)

GO OBO-XML filename: go_daily-termdb_TVKS_2013_10_22.obo-xml

  • GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped):
  • Time taken to import: 9.11 minutes on own computer
  • Time taken to process: 7.60 min on own computer

GOA filename: 46.V_cholerae_ATCC_39315_TVKS_2013_10_22.goa

  • GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site):
  • Time taken to import: 0.08 minutes

Name of .gdb file:

  • Time taken to export .gdb:~3 hours
started at 20:15, finished by 23:30

Note: Initially attempted on lab computer, but it was too slow and I switched to my own computer that evening.My personal computer was used for the rest of the week's analysis.

TallyEngine

Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the PostgreSQL databases (or you can upload and link to a screenshot of the results).

Tally verified expected results, the XML count matched the database count: Tally ENgine output.PNG

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Follow the instructions found on this page to run XMLPipeDB match.

Are your results the same as you got for the TallyEngine? Why or why not?

Using XMLPipeDB Match, we initially found 2738 ordered loci using the first code listed bellow. However, when the command was changed to included VCA files (second bit of code), the actual results matched the expected at 3831.

"\Program Files <x86>\Java]jre7\bin\java" -jar xmlpipe-db-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < VC_2013_20_22TVKS.xml
"\Program Files <x86>\Java]jre7\bin\java" -jar xmlpipe-db-match-1.1.1.jar "VC_(A|)[0-9][0-9][0-9][0-9]" < VC_2013_20_22TVKS.xml

Note: needed to include the full extension of java due to technicalities in the Win8 system.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

Follow the instructions on this page to query the PostgreSQL Database.

Our SQL query (bellow), found the expected 3831 ordered loci.

select count (*) when type='ordered locus' and value ~ 'VC_(A|)[0-9][0-9][0-9][0-9]'

OriginalRowCounts Comparison

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: (for the Week 9 Assignment, use the "Vc-Std_External_20101022.gdb" as your benchmark, downloadable from here.

Copy the OriginalRowCounts table and paste it here: My newly formed gdb

  • Table Rows
  1. Info 1
  2. Systems 30
  3. Relations 18
  4. Other 0
  5. GeneOntologyTree 97982
  6. GeneOntology 5556
  7. UniProt-GOCount 3240
  8. GeneOntologyCount 3239
  9. UniProt-GeneOntology 20464
  10. UniProt 3784
  11. Pfam 2102
  12. RefSeq 3403
  13. PDB 223
  14. InterPro 4349
  15. OrderedLocusNames 3832
  16. EMBL 228
  17. UniProt-EMBL 5452
  18. UniProt-OrderedLocusNames 3832
  19. UniProt-PDB 319
  20. UniProt-InterPro 10393
  21. UniProt-RefSeq 3635
  22. UniProt-Pfam 4648
  23. RefSeq-Pfam 4145
  24. RefSeq-InterPro 9241
  25. RefSeq-PDB 234
  26. RefSeq-OrderedLocusNames 3520
  27. RefSeq-EMBL 3669
  28. OrderedLocusNames-Pfam 4367
  29. OrderedLocusNames-InterPro 9723
  30. OrderedLocusNames-PDB 235
  31. OrderedLocusNames-EMBL 4111
  32. RefSeq-GeneOntology 18931
  33. OrderedLocusNames-GeneOntology 20613

Vc_External: Dowloaded gdb

  • Table Rows
  1. Info 1
  2. Systems 30
  3. Relations 26
  4. Other 0
  5. GeneOntologyTree 35314
  6. GeneOntology 3829
  7. UniProt-GOCount 2467
  8. GeneOntologyCount 2466
  9. UniProt-GeneOntology 13289
  10. UniProt 3784
  11. Pfam 1955
  12. RefSeq 3827
  13. GeneId 3827
  14. PDB 157
  15. InterPro 3942
  16. OrderedLocusNames 7664
  17. EMBL 293
  18. UniProt-EMBL 5742
  19. UniProt-OrderedLocusNames 7664
  20. UniProt-PDB 243
  21. UniProt-InterPro 9565
  22. UniProt-GeneId 4125
  23. UniProt-RefSeq 4125
  24. UniProt-Pfam 4601
  25. RefSeq-Pfam 4263
  26. RefSeq-GeneId 3971
  27. RefSeq-InterPro 8840
  28. RefSeq-PDB 169
  29. RefSeq-OrderedLocusNames 7942
  30. RefSeq-EMBL 4260
  31. GeneId-Pfam 4263
  32. GeneId-InterPro 8840
  33. GeneId-PDB 169
  34. GeneId-OrderedLocusNames 7942
  35. GeneId-EMBL 4260
  36. OrderedLocusNames-Pfam 8538
  37. OrderedLocusNames-InterPro 17712
  38. OrderedLocusNames-PDB 338
  39. OrderedLocusNames-EMBL 8540
  40. GeneId-GeneOntology 13332
  41. RefSeq-GeneOntology 13332
  42. OrderedLocusNames-GeneOntology 26702

Note: The downloaded database had more table entries than my database. In almost all cases, the values between the two databases were unequal, generally with the newer dataset (the one I made) having more examples. Interestingly, 3832 ordered locus names were found in this analysis instead of 3831. This is because GenMappBuilder has code to split compound names such as VC_1738/VC1739 which were combined in the original sheets. This explains how row counts appear different in the gdb than in the other methods.

(second note: doing an SQL or xmlPipeDB-match search for VC_(A|)[0-9][0-9][0-9][0-9]/VC_(A|)[0-9][0-9][0-9][0-9] finds one match, the linked genes that are split apart by genMAPP)

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

  1. Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
    • In the systems table there are not date field's for all gene ID systems in the database.
  2. Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
    • The IDs have some minor differences between them. All the IDs produced by GenMAPP have an underscore, but that is not present in some of the other formats such as UniProt.

Compare Gene Database to Outside Resource

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Note: The ordered names seem to make general sense. I am a bit confused what some parts of the procedure did, but feel I can work through it again and learn more about it doing the group projects when the example is not spoon fed to us in class.

Personal Template

By Tauras Vilgalys

As part of Biological Databases


Please Remember the Harassing of Deities is Strictly Prohibited

Never Forget Samson

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox