How Do I Count Thee? Let Me Count The Ways

From LMU BioDB 2013
Jump to: navigation, search

The overall task of creating a GenMAPP Gene Database from XML sources is essentially a two-step data conversion operation. The first step converts the original XML data (UniProt, Gene Ontology) into relational database tables stored in PostgreSQL. The second step then extracts information from these relational tables, combines them with the Gene Ontology associations file (.goa) that links gene IDs to GO terms, and produces the GenMAPP Gene Database, which, as you have seen by now, is really a Microsoft Access database file.

Ideally, these steps take place with absolute fidelity, with everything transferring flawlessly, for every species that we try. Unfortunately, this is not always the case. In some situations, issues are actually unavoidable, as data that we expect in one place might actually reside in another, or the data items themselves are inconsistently entered (e.g., capitalization is different, underscores may be used instead of spaces, etc.).

It is thus important to have a certain tool set at our disposal for checking, in an automated fashion, how well our data have transferred from one source to another. That is the focus of this document.

Contents

Tally Engine

The first tool, called the Tally Engine, can be used for verifying that certain data from the XML file transferred consistently into the PostgreSQL database upon import. The Tally Engine can be found in GenMAPP Builder itself.

  1. Run PostgreSQL (via pgAdmin III on Windows) and make sure that your database is up and running.
  2. Run GenMAPP Builder and make sure that it is connected to the database (via Configure Database...).
  3. After performing an import, choose Run XML and Database Tallies for UniProt and GO....
  4. Choose the UniProt and GO files that you imported.
  5. You should see a table for selected data items, and how many of each were found.

Tally-results.png

Under the hood, the Tally Engine bases its XML counts on certain XML tags, and bases its database counts on SQL queries using count. This tool is thus primarily useful for making sure that the “raw” import worked without any errors or glitches.

XMLPipeDB Match

XMLPipeDB Match is useful for counting data in files. Thus, in our context, you would use XMLPipeDB Match to tally stuff in XML files, with greater flexibility than with Tally Engine.

You will have to use XMLPipeDB Match from the command line. In addition, you can use this in any platform (as you have seen). Download the application from the XMLPipeDB SourceForge site and take note of the location of the xmlpipedb-match-1.1.1.jar. Then, on the command line (Terminal for Linux and Mac OS X, cmd on Windows), cd to the folder containing the XML file that you would like to check. Use XMLPipeDB Match as follows, with the parts in parentheses varying depending on your specific setup, desired pattern, and file being scanned:

java -jar (location-of-jar) "(pattern)" < (XML file)

On a Windows machine, with XMLPipeDB Match and a Vibrio cholerae XML file located on the Desktop, scanning for IDs of the form VC_####, where # represents a digit from 0 to 9, one would type, after cd-ing to the Desktop:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-46.V_cholerae_ATCC_39315.xml

As you have seen before, this will give you a list of unique matches, with a total number at the bottom.

The trick with XMLPipeDB Match is to use the patterns well: with the database project, you will mainly be matching IDs. A desired count is an XMLPipeDB Match result whose matched ID pattern corresponds to the number of IDs found by the Tally Engine.

SQL

You can also look for counts at the SQL level, using some variation of a select count(*) query. This requires some knowledge of which table received what data. Here’s an initial tip: the gene/name tags in the XML file land in the genenametype table. A query on this table counting values from this table that were marked as ordered locus in the XML file matching the pattern VC_[0-9][0-9][0-9][0-9] would look like this:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

In pgAdmin III, you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the SQL Editor tab, then clicking on the green triangular Play button to run.

Pgadminiii-query.png

Microsoft Access

For the GenMAPP Gene Database, you can open the .gdb in Microsoft Access and navigate its tables to find counts for various IDs. Opening the table, noting its size, and doing some sorting may help. You can also look at the OriginalRowCounts table for a summary of totals.

Again, the ideal situation is a correspondence in these numbers with what you found in XML and the relational database.

Back to the Command Line

Amidst all this, you can still use grep and wc on the command line for some basic counting. Just remember that these tools work on a line-by-line basis; useful in some cases, but not useful in others.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox