Bklein7 Week 15

From LMU BioDB 2015
Jump to: navigation, search

TallyEngine Customization (cw20151203)

In GenMAPP Builder version 3.0.0 Build 5 - cw20151203, the Bordetella pertussis species profile was customized to import 11 ORF gene IDs that were not exported in previous versions. To account for this change, the TallyEngine was customized for Bordetella pertussis to count "ORF" gene listings separately from "Ordered Locus Names". To do this, I followed the procedure documented below:

  • First, I determined that I wanted to count the "ordered locus" IDs and "ORF" IDs from the gene/name tag in the UniProt XML file.
    • In the relational database bpertussis_cw20151203_gmb3build5, gene IDs were defined by the type "ordered locus" or "ORF" in the table "genenametype".
  • Next, I opened our team's branch of GenMAPP Builder in Eclipse.
  • Under edu.lmu.xmlpipedb.gmbuilder.resource.properties, I opened gmbuilder.properties.
  • I located the block of text below (it was near the bottom).
#
# wizard.properties
#
  • I added the necessary customizations above this block of text. The resulting code was as follows:
# Bordetella pertussis
bordetellapertussis_level_amount=1

bordetellapertussis_element_level0=uniprot/entry/gene/name&type&ORF

bordetellapertussis_query_level0=select count(*) from genenametype where type = 'ORF'; 

bordetellapertussis_table_name_level0=ORF

#
# wizard.properties
#
  • I committed and pushed the changes in the code to Github and then created a new distribution of GenMAPP Builder. This distribution was uploaded as an updated version of the following file: *File:Dist cw20151203.zip.
  • Using the updated build of GenMAPP Builder present in the distribution folder, I connected to the relational database bpertussis_cw20151203_gmb3build5 and ran the TallyEngine. The results are pictured below:
    • Tallyenginecustomization cw20151203.png
      • The TallyEngine results successfully reflected the customizations I made to the TallyEngine, listing all 11 ORF genes in addition to the 3435 "Ordered Locus Names" gene IDs present in the Bordetella pertussis gene database.

Testing the Bordetella Pertussis Gene Database (cw20151203)

The full Gene Database Testing Report for the .gdb file tagged cw20151203 can be found here: Gene Database Testing Report- cw20151203. In assessing this gene database with Mahrad, we found one gene ID that was not successfully exported into the .gdb file. A summary of this issue and the steps that were taken to detail it is presented below:

  • TallyEngine Count
    • As described in the "TallyEngine Customization (cw20151203)" section of this page, the expected gene ID count including "Ordered Locus Names" and "ORF" listings was 3446.
    • This count was confirmed using the customized TallyEngine.
  • XMLPipeDB Match Count
    • With the help of Dr. Dionisio, a new regex was crafted to retrieve all possible "ordered locus" and "ORF" gene ID patterns that we identified. The XMLPipeDB Match query and result are pictured below:
      • Xmlpipedbmatch cw20151203.png
        • To our surprise, XMLPipeDB Match returned a result of 3447 gene IDs that matched our updated regex.
        • Thus, this revealed that one ID matching our regex was not successfully epxorted to the cw20151203 .gdb file. Further investigation was necessary.
  • XMLPipeDB Match vs. "Ordered Locus Names" from File:Bpertussis-std cw20151203.zip
    • In order to identify the missing gene ID, we compared the XMLPipeDB Match output to the gene IDs listed in the "Ordered Locus Names" table of the file File:Bpertussis-std cw20151203.zip (retrieved using Microsoft Access).
    • In Excel, the missing gene ID was identified to be BP3167A:
      • Xmlpipedbmatch vs gdb cw20151203.PNG
        • Interestingly, this gene ID had another unusual variant that we previously documented- BP3167.1.
        • Although this gene ID's pattern (BP####A) matched that of the ORF values, it was not present in the list of ORF genes retrieved in PostgreSQL (see Bklein7_Week_14).
  • Identifying "BP3167A" in the Original XML File
    • Based on our TallyEngine and PostgreSQL results, it appeared as though the gene ID "BP3167A" was not listed under the type "ordered locus" or "ORF". To determine its gene type, we opened the original XML file (File:Uniprot-proteome-UP000002676 cw20151201.zip) and searched for "BP3167A":
      • MissedID cw20151203.png
        • In the XML file, "BP3167A" was listed with the general gene type "gene ID". This specific designation had not been observed as a stand alone gene type before.
        • Nevertheless, the manner in which "BP3167A" was listed in the XML file indicated that it was in fact a proper gene ID and not an artifactual finding. This necessitated further research.
  • Researching the Different Forms of "BP3167"
    • UniProt
      • Searching for "BP3167", "BP3167.1", or "BP3167A" all linked to the following gene page: http://www.uniprot.org/uniprot/Q7VUD4
        • The above page specifies that the gene ID "BP3167.1" refers to the gene ureE that codes for Urease accessory protein UreE.
    • EnsemblBacteria-
      • Searching for "BP3167" and "BP3167A" retrieves two different results:
        • [BP3167]- gene ureF is a pseudogene.
        • [BP3167A]- gene ureE, which codes for urease accessory protein (as in UniProt).
      • Therefore, the gene ID "BP3167A" is a valid ID that corresponds to the same ID as "BP3167.1" in the UniProt database.
    • Conclusion: "BP3167A" is a reference ID from EnsemblBacteria that is valid and must be exported.

Editing the Bordetella Pertussis Species Profile

I presented our findings regarding "BP3167A" to Drs. Dionisio & Dahlquist. Dr. Dionisio helped me identify the location of "BP3167A" in the .gdb file and code for changes to my branch of GenMAPP Builder that would successfully import this ID. In doing so, we found that "BP3167A" was listed in a table that included reference gene IDs from gene databases other than UniProt (such as EnsemblBacteria). Although we tried to isolate only EnsemblBacteria reference IDs, just this list contained over 3,300 IDs, most of which were duplicates of the IDs from UniProt that we exported. This presented a serious problem in adding just "BP3167A" to the export. However, we came up with a soultion that took advantage of its relatively rare pattern- BP####A. This pattern corresponded only to 10 ORF gene IDs that we had to expand the code to import (see Bklein7_Week_14). Therefore, if we deleted the method block that imported the ORF values and then introduced a new method block to import all 11 ORF gene IDs plus "BP3167A" based on their unique regex (BP####A & BP####B), we would successfully add "BP3167A" to our export without introducing any duplicates. This was somewhat of a lucky fix, but Dr. Dionisio will work on a more fundamental fix for similar issues in the near future. Overall, our findings in PostgreSQL and the changes to the code that we made are presented below:

  • PostgreSQL Query to Retrieve All 11 ORF gene IDs + "BP3167A" (created by Dr. Dionisio):
select propertytype.value from propertytype inner join dbreferencetype on
   (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
   where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type = 'gene ID'
   and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)' order by propertytype.value;
  • Old ORF Method Block that was Deleted:
@Override
public TableManager getSystemsTableManagerCustomizations(TableManager tableManager, DatabaseProfile dbProfile) {
    super.getSystemsTableManagerCustomizations(tableManager, dbProfile);
    tableManager.submit("Systems", QueryType.update, new String[][] {
        { "SystemCode", "N" },
        { "Species", "|" + getSpeciesName() + "|" }
    });

    tableManager.submit("Systems", QueryType.update, new String[][] {
        { "SystemCode", "N" },
        { "Link", "http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis" }
    });

    return tableManager;
}
   @Override
    public TableManager getSystemTableManagerCustomizations(TableManager tableManager,
            TableManager primarySystemTableManager, Date version) throws SQLException, InvalidParameterException {
        // Start with the default OrderedLocusNames behavior.
        TableManager result = super.getSystemTableManagerCustomizations(tableManager, primarySystemTableManager,
                version);

        String sqlQuery = "select dbreferencetype.entrytype_dbreference_hjid as hjid, propertytype.value from propertytype inner join dbreferencetype on " +
                "(propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid) " +
                "where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type = 'gene ID' " +
                "and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)' order by propertytype.value";

        Connection c = ConnectionManager.getRelationalDBConnection();
        PreparedStatement ps;
        ResultSet rs;
        try {
            // Query, iterate, add to table manager.
            ps = c.prepareStatement(sqlQuery);
            rs = ps.executeQuery();
            while (rs.next()) {
                String hjid = Long.valueOf(rs.getLong("hjid")).toString();
                String id = rs.getString("value");
                result.submit("OrderedLocusNames", QueryType.insert, new Object[][] {
                    { "ID", id },
                    { "Species", "|" + getSpeciesName() + "|" },
                    { "Date", version },
                    { "UID", hjid }
                });
            }
        } catch(SQLException sqlexc) {
            logSQLException(sqlexc, sqlQuery);
        }

        return result;
    }

    private void logSQLException(SQLException sqlexc, String sqlQuery) {
        LOG.error("Exception trying to execute query: " + sqlQuery);
        while (sqlexc != null) {
            LOG.error("Error code: [" + sqlexc.getErrorCode() + "]");
            LOG.error("Error message: [" + sqlexc.getMessage() + "]");
            LOG.error("Error SQL State: [" + sqlexc.getSQLState() + "]");
            sqlexc = sqlexc.getNextException();
        }
    }

    private static final Log LOG = LogFactory.getLog(BordetellaPertussisUniProtSpeciesProfile.class);

}
  • Version Update
    • After implementing the above change to the code, the GenMAPP Builder version was changed to "3.0.0-build-5-cw20151210".
  • Github Update
    • The updated code was committed and pushed to Github using the procedure outlined in Bklein7_Week_14.
  • Creating & Exporting a New Distribution of GenMAPP Builder
    • A new distribution of version "3.0.0-build-5-cw20151210" of GenMAPP Builder was created using the procedure outlined in Bklein7_Week_14.
    • The new distribution was exported and uploaded to the wiki:

Running and Testing a New Gene Database Export (cw20151210)

Using version "3.0.0-build-5-cw20151210" of GenMAPP Builder (File:Dist cw20151210.zip), I conducted a new import-export cycle to create an updated gene database for Bordetella pertussis. The Gene Database Testing Report for this new gene database can be found here: Gene Database Testing Report- cw20151210. I wrote sections 1-5.2 of the testing report.

Final Project Deliverables

After concluding the Gene Database Testing Report for the latest export (cw20151210), I began working on our final project deliverables:

  • Began editing the ReadMe file template, customizing it for Bordetella pertussis and the cw20151210 .gdb file.
  • Downloaded and edited the posted schema.
  • Worked with Lena to create and name all of the GenMAPP oriented deliverables during our 12/13 meeting.

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries