Msaeedi23 Week 15

From LMU BioDB 2015
Jump to: navigation, search

TallyEngine Customization (cw20151203)

In GenMAPP Builder version 3.0.0 Build 5 - cw20151203, the Bordetella pertussis species profile was customized to import 11 ORF gene IDs that were not exported in previous versions. To account for this change, the TallyEngine was customized for Bordetella pertussis to count "ORF" gene listings separately from "Ordered Locus Names". To do this, I followed the procedure documented below:

  • First, it was determined that we wanted to count the "ordered locus" IDs and "ORF" IDs from the gene/name tag in the UniProt XML file.
    • In the relational database bpertussis_cw20151203_gmb3build5, gene IDs were defined by the type "ordered locus" or "ORF" in the table "genenametype".
  • Next, Brandon opened our team's branch of GenMAPP Builder in Eclipse.
  • Under edu.lmu.xmlpipedb.gmbuilder.resource.properties, he opened gmbuilder.properties.
  • I located the block of text below (it was near the bottom).
#
# wizard.properties
#
  • Brandon added the necessary customizations above this block of text. The resulting code was as follows:
# Bordetella pertussis
bordetellapertussis_level_amount=1

bordetellapertussis_element_level0=uniprot/entry/gene/name&type&ORF

bordetellapertussis_query_level0=select count(*) from genenametype where type = 'ORF'; 

bordetellapertussis_table_name_level0=ORF

#
# wizard.properties
#
  • Brandon then committed and pushed the changes in the code to Github and created a new distribution of GenMAPP Builder.
  • Using the updated build of GenMAPP Builder present in the distribution folder, the relational database was connected bpertussis_cw20151203_gmb3build5 and TallyEngine was run. The results are pictured below:
    • Tallyenginecustomization cw20151203.png
      • The TallyEngine results successfully reflected the customizations we made to the TallyEngine, listing all 11 ORF genes in addition to the 3435 "Ordered Locus Names" gene IDs present in the Bordetella pertussis gene database.

Testing the Bordetella Pertussis Gene Database (cw20151203)

The full Gene Database Testing Report for the .gdb file tagged cw20151203 can be found here: Gene Database Testing Report- cw20151203. In assessing this gene database with Brandon, we found one gene ID that was not successfully exported into the .gdb file. A summary of this issue and the steps that were taken to detail it is presented below:

  • TallyEngine Count
    • As described in the "TallyEngine Customization (cw20151203)" section of this page, the expected gene ID count including "Ordered Locus Names" and "ORF" listings was 3446.
    • This count was confirmed using the customized TallyEngine.
  • XMLPipeDB Match Count
    • With the help of Dr. Dionisio, a new regex was crafted to retrieve all possible "ordered locus" and "ORF" gene ID patterns that we identified. The XMLPipeDB Match query and result are pictured below:
      • Xmlpipedbmatch cw20151203.png
        • To our surprise, XMLPipeDB Match returned a result of 3447 gene IDs that matched our updated regex.
        • Thus, this revealed that one ID matching our regex was not successfully epxorted to the cw20151203 .gdb file. Further investigation was necessary.
  • XMLPipeDB Match vs. "Ordered Locus Names" from File:Bpertussis-std cw20151203.zip
    • In order to identify the missing gene ID, we compared the XMLPipeDB Match output to the gene IDs listed in the "Ordered Locus Names" table of the file File:Bpertussis-std cw20151203.zip (retrieved using Microsoft Access).
    • In Excel, the missing gene ID was identified to be BP3167A:
      • Xmlpipedbmatch vs gdb cw20151203.PNG
        • Interestingly, this gene ID had another unusual variant that we previously documented- BP3167.1.
        • Although this gene ID's pattern (BP####A) matched that of the ORF values, it was not present in the list of ORF genes retrieved in PostgreSQL (see Bklein7_Week_14).
  • Identifying "BP3167A" in the Original XML File
    • Based on our TallyEngine and PostgreSQL results, it appeared as though the gene ID "BP3167A" was not listed under the type "ordered locus" or "ORF". To determine its gene type, we opened the original XML file (File:Uniprot-proteome-UP000002676 cw20151201.zip) and searched for "BP3167A":
      • MissedID cw20151203.png
        • In the XML file, "BP3167A" was listed with the general gene type "gene ID". This specific designation had not been observed as a stand alone gene type before.
        • Nevertheless, the manner in which "BP3167A" was listed in the XML file indicated that it was in fact a proper gene ID and not an artifactual finding. This necessitated further research.
  • Researching the Different Forms of "BP3167"
    • UniProt
      • Searching for "BP3167", "BP3167.1", or "BP3167A" all linked to the following gene page: http://www.uniprot.org/uniprot/Q7VUD4
        • The above page specifies that the gene ID "BP3167.1" refers to the gene ureE that codes for Urease accessory protein UreE.
    • EnsemblBacteria-
      • Searching for "BP3167" and "BP3167A" retrieves two different results:
        • [BP3167]- gene ureF is a pseudogene.
        • [BP3167A]- gene ureE, which codes for urease accessory protein (as in UniProt).
      • Therefore, the gene ID "BP3167A" is a valid ID that corresponds to the same ID as "BP3167.1" in the UniProt database.
    • Conclusion: "BP3167A" is a reference ID from EnsemblBacteria that is valid and must be exported.

Gene Database Testing Report 12/10

The Gene Database Testing Report for this new gene database can be found here: Gene Database Testing Report- cw20151210.

Mahrad Saeedi

Class Whoopers Team Page
Assignment Links
Individual Journals
Shared Journals