Vkuehn Week 9

1 ELECTRONIC LAB NOTEBOOK
2 TallyEngine
3 Using XMLPipeDB match to Validate the XML Results from the TallyEngine
4 Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine
5 OriginalRowCounts Comparison
6 Visual Inspection
7 .gdb Use in GenMAPP
8 Compare Gene Database to Outside Resource

ELECTRONIC LAB NOTEBOOK

UniProt XML

Go to the UniProt Complete Proteomes page.
Browse to the complete proteome download page for your species of interest. For example, to get to Vibrio cholerae page
Click through the results until you get to this page.
Click on the link for “complete proteome set” or “complete reference set” for the organism of interest
Click the orange Download link in the upper right-hand corner of the page.
Click to download the complete proteome set in [http://www.uniprot.org/uniprot/?

GOA

Go to the UniProt-GOA Downloads page.
The current and previous UniProt-GOA files can be downloaded from the UniProt-GOA ftp site.
In the directory that appears, click the link to the “proteomes” directory.
Find your organism of interest and right-click on the link to download the GO annotations and select “Save target as” or “Save link as” and save the GOA file. For example, this is the link for Vibrio cholerae.
- Note: Since the GOA file is a text file, your browser will not automatically download it when you left-click on the link. Instead, it will try to open the file in your browser window. Because it is a large file, this could take a long time if your internet connection is slow.
- The version information can be found on displayed in the ftp file directory under the “Last modified” column.

Create New Database in PostgreSQL

Launch pgAdmin III.
Double-click on PostgreSQL 9.2 (localhost:5432) on the upper left hand side of the window.
Right click on Databases and select New Database... and name it
Click on your new database name in the treeview on the left.
Click on the SQL icon in the toolbar at the top of the window.
Click on the Open File icon in the toolbar (the yellow folder with an arrow).
Navigate to the folder in which you unzipped GenMAPP Builder.
Open the sql folder and open the file gmbuilder.sql. You should see SQL code appear in the SQL Editor tab.
Click the Execute Query icon which looks like a green “Play” triangle button.
This query now created all the tables in the database (although there is still no data in them).
Close the query window

Exporting Information

Version of GenMAPP Builder:2.0b70

Computer on which export was run: Front row left computer

Postgres Database name: vc_VK_2013.10.22gmb2b70

UniProt XML filename:Uniprot_Vibrio_choleraVK.xml

UniProt XML version (The version information can be found at the UniProt News Page):
Time taken to import: 2.79 minutes

GO OBO-XML filename: go_daily-termdbVK2013.10.22.obo-xml

GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped):
Time taken to import: 5.98 minutes
Time taken to process: 4.35 minutes

GOA filename:46.V_cholerae_ATCC_39315VK10.21.2013.goa

GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site):
Time taken to import: 0.06 minutes
Updated the filename and had to upload directly from wiki

Name of .gdb file: Vc-Std_VK20131022.gdb

Time taken to export .gdb: 1 h 39 min
Upload your file and link to it here.File:Vc-Std VK20131022.gdb

Note:Export started at 10:33 am

Export ended at 12:12:09 pm ---Dondi (talk) 16:03, 22 October 2013 (PDT)

TallyEngine

Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the PostgreSQL databases (or you can upload and link to a screenshot of the results).

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Follow the instructions found on this page to run XMLPipeDB match.

Are your results the same as you got for the TallyEngine? Why or why not?

At first the match results were different than the TallyEngine results. The Match resulted in 2738 instead of 3831 counts. The data had some of the tags with an A before the number, so once this was accounted for the two counts matched.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

Follow the instructions on this page to query the PostgreSQL Database. The row count in SQL matched the ones done before.

OriginalRowCounts Comparison

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: (for the Week 9 Assignment, use the "Vc-Std_External_20101022.gdb" as your benchmark, downloadable from here.

Copy the OriginalRowCounts table and paste it here:

Note:

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?\
Most of the data seems the same. The one that was different was the Ordered Locus Names. There was one too many. After looking at the specific ID that was different it was clear that there was one that was named VC####/VC#### that was considered the same gene but in access it was separated into two.
We were unable to move past this step because the data we analyzed does not have all of the gene IDs in it. There was data that appeared as VC_#### and data that was named VC####. Our results only accounted for the one with the underscore so we are not able to upload this data until this problem is fixed.

.gdb Use in GenMAPP

Note:

Putting a gene on the MAPP using the GeneFinder window

Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.

Note:

Creating an Expression Dataset in the Expression Dataset Manager

How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note:

Coloring a MAPP with expression data

Note:

Running MAPPFinder

Note:

Compare Gene Database to Outside Resource

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Note:

Vkuehn Week 9

Contents

ELECTRONIC LAB NOTEBOOK

UniProt XML

GOA

Create New Database in PostgreSQL

Exporting Information

TallyEngine

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

OriginalRowCounts Comparison

Visual Inspection

.gdb Use in GenMAPP

Putting a gene on the MAPP using the GeneFinder window

Creating an Expression Dataset in the Expression Dataset Manager

Coloring a MAPP with expression data

Running MAPPFinder

Compare Gene Database to Outside Resource

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox