Quality Assurance

From LMU BioDB 2015
Jump to: navigation, search
Gene Database Project Links
Overview Deliverables Reference Format Guilds Project Manager GenMAPP User Quality Assurance Coder
Teams Heavy Metal HaterZ The Class Whoopers GÉNialOMICS Oregon Trail Survivors

The Quality Assurance team member is the resident expert on species ID systems and formats. He or she should be proficient with XMLPipeDB Match, SQL queries in PostgreSQL, Microsoft Excel, and Microsoft Access to navigate through the data and find missing IDs, discrepancies, sanity checks, etc.

Guild Members

Milestones

Milestone 1: Initial Database Export

  1. (with Coders) Get a full import-export cycle done.
  2. (with Coders) Decide on a file/version management scheme/system.
  3. Learn the ID systems:
    • Systems that are the same for each species (hint: guild members help each other out by posting the relevant information on this page)
      • UniProt
      • RefSeq
      • GeneID (EntrezGene from NCBI)
      • GO
    • The OrderedLocusNames for your species

Milestone 2: ID Pattern Definition and Verification

  1. Characterize regular expression patterns to detect the IDs (for filtering then counting).
    • XMLPipeDB Match utility
    • Direct SQL queries in PostgreSQL
    • For example, the Vibrio IDs were of the form VC#### or VC_####; how would you express that in Match or as an SQL query?
    • Table inspection/filtering/sorting in Microsoft Access
    • If needed, side-by-side sorted comparisons in Microsoft Excel (as described here)
  2. Document/log all work done, problems encountered, and how they were resolved.

Milestone 3: Tally Engine Configuration

Along with your Coder, customize the Tally Engine setup for your species as specified in the steps described below. You will want to add, at the very least, the ordered locus IDs for your species.

Milestone 4: Final Documentation

  1. Document the relational database schema for the gene database.
  2. Create the ReadMe with comparisons to MOD for your species.

Customizing the IDs that the Tally Engine Counts

  1. First, determine which IDs (outside of the defaults that the tally engine already counts) you would like to count. At a minimum, this includes at least the ordered locus IDs from the gene/name tag in the UniProt XML file. There may be more; the QA team member should be the authority on this.
  2. For each of these IDs, determine the following:
    • Where in the XML file they can be found, in terms of which XML tags
    • Where in the relational database they can be found, in terms of which relational tables
  3. Work with the Coder to open the Eclipse project for your team’s branch of GenMAPP Builder.
  4. Under edu.lmu.xmlpipedb.gmbuilder.resource.properties, open gmbuilder.properties.
  5. Locate the block of text below (it’s near the bottom). You will insert the customizations that will be described right above this block.
#
# wizard.properties
#
  • First, mark out the section that denotes the customization for your species:
# Species name
  • Next, rewrite your species name without spaces and all lowercase (e.g., Plasmodium falciparum becomes plasmodiumfalciparum). Specify the number of additional custom IDs to count as follows, where speciesname is your no-space, all-lowercase species name, and # represents the actual number of IDs:
speciesname_level_amount=#
  • Now, for each custom ID, you need to specify three things: an element, a query, and a name. Each of these items is numbered, starting from 0. Each item number is called a level.
    1. The element states where you expect an ID to be found in the UniProt XML file. It starts with uniprot/entry, then continues with additional tags as needed. After the tag, you may specify, separated by ampersands (&s), any specific attributes that you would like to choose.
    2. The query states the SQL query that you would use to count the IDs in the relational database. The query would be exactly as you would type it if you were entering it directly into the relational database.
    3. The name is a simple label: this is how you would like to identify this ID in the final Tally Engine table.
  • You can write these in any order, though existing customizations group them by element, query, and name. For example, if your species is speciesname and you only need to count ordered locus IDs, you would add:
# Species name
speciesname_level_amount=1

speciesname_element_level0=uniprot/entry/gene/name&type&ordered locus

speciesname_query_level0=select count(*) from genenametype where type = 'ordered locus';
speciesname_query_level0=Ordered Locus
  • Note how the element ends with name&type&ordered locus, because the name tag in the UniProt XML file will have different types (e.g., “primary”, “ORF”, “synonym”, “ordered locus”, etc.). For ordered locus IDs, we only want to count the name IDs whose type is “ordered locus”.

Once you are done with these customizations, you can test your work by building a new version of GenMAPP Builder, connecting to a relational database that already has imported data (or importing data first if needed), then running the Tally Engine. The resulting table should include, in addition to the defaults that you have seen before, the new IDs that you have added.

Gene Database Project Links
Overview Deliverables Reference Format Guilds Project Manager GenMAPP User Quality Assurance Coder
Teams Heavy Metal HaterZ The Class Whoopers GÉNialOMICS Oregon Trail Survivors