Blitvak Week 15

From LMU BioDB 2015
Jump to: navigation, search

Work conducted on 12/8

  • The last gene database testing report was reviewed
  • It was noticed, once again, that UniProtKB reported 6994 distinct entries while 7121 gene names were found to be in the XML/final database
    • This represents a discrepancy of 127

PSQL investigation of the 127 count discrepancy

  • It was decided that this discrepancy would be investigated through PSQL queries on the initial export Postgres database (which contained all of the 7121 ORF entries of interest) <----Thank you Dondi for the commands/investigation!---->
  • The initial export PSQL database, B.cenocepacia_J2315_20151119_gmb3build5, was booted up
  • The previously utilized command select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]'; was condensed down to select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';; the modified command was executed to confirm the count of 7121
  • select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid; was executed to verify the number of UniProt protein entries that are covered by the ORF names; it was found that 6993 entries were present, which is one less from the 6994 reported by UniProt.
    • UniProtKB was checked in order to find this missing entry, using a search query that looked for entries that lacked gene names which are represented by BCA* and pBCA*. It was found that one UniProt entry lacked these usual gene name IDs
    • This entry was described as being a Proteolysis tag peptide encoded by tmRNA Burkh_cenoc_J2315; it appears to be encoded by a transfer-messenger RNA gene, and given that, it lacks a proper gene ID. This peptide will be ignored since it is associated with a functional RNA gene.
  • select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid having count(value) > 1; was executed in order to see if any UniProt protein entries were represented by multiple gene names; it was found that numerous entries had a count greater than 1 for corresponding gene names.
  • The following command was executed in order to find the total count of gene names represented by the protein entries that had a greater than 1 count:
select sum(dupe_count) from (select genetype_name_hjid, count(value) as dupe_count
    from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'
    group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;
  • This led to a sum of 205, which also included the first gene names covered by each entry (not just "extras")
  • The following command was executed in order to find the total count of gene names that are not extras (excluding the >1 gene names):
select count(genetype_name_hjid) from (select genetype_name_hjid, count(value) as dupe_count
    from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'
    group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;
  • This command led to a count of 77 records
    • Given that 6993 entries are represented in the PSQL database, there is a difference of 128 with respect to the 7121 gene name count. The output for the query for extra gene names was a count of 205; 205 minus the number of non-"extra" gene names, which is 77, results in 128. It is now apparent that the difference between the number of UniProt entries and the number of gene name IDs is due to the fact that some proteins are covered by several different gene name IDs.

Work conducted on 12/10

  • The presentation for the project was worked on
  • I checked in with Anu, Kevin, and Veronica regarding the future deliverables
  • I helped in making the readme file deliverable

Work conducted on 12/11

  • The final testing report was reviewed again, cleaned up, and slightly modified with the findings on 12/8 in mind.
  • Further work was done on the presentation slides
  • It was settled upon that the final commands that will be utilized for the presentation are:
    • java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml", and select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';.

Work conducted on 12/14

  • Met with the group and reviewed the presentation slides (made some minor alterations) and practiced the presentation
  • Checked the final readme and finalized it

Final Presentation Slides


Weekly Group Assignments Shared Group Journals Project Links Team Members

Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments Individual Journal Pages Shared Journal Pages