Troque Week 14

From LMU BioDB 2015
Jump to: navigation, search

User Page        Bio Databases Main Page       


Running New Builds

Build 1

Name of .gdb file (give filename and upload and link to compressed file): Sf-Std_20151201.gdb

  • Time taken to export: 4 hours, 10 minutes, 46 seconds
    • Start time: 4:19:22 PM PDT
    • End time: 8:30:08 PM PDT
    • Note:

Build 2

Name of .gdb file: Sf-Std 20151207.gdb

  • Date: 12/7/15
  • Time taken to export: 4 hours, 24 minutes and 1 second
    • Start time: 9:13:45 PM PDT
    • End time: 1:37:46 AM PDT
    • Note:

Important Files

Identifying the Gene IDs

  • Regular expression: (CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)
  • Observations:
    • In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
    • When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
    • In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
    • I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
    • Note: It turns out the ShiBASE database only uses the pattern SF#### and CP#### instead of S#### so the regular expression would really have to be just SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)

FOR THE FULL REPORT ON IDENTIFYING THE ID, VISIT THE GENE DATABASE TESTING REPORT PAGE.

Reflection

  1. What worked?
    • What worked in identifying the gene IDs is to look export .gdb file into Excel and compare with what the OrderedLocusNames table had (from Microsoft Access). From doing this, it was easier to find which genes were not found in the .gdb file and made it easier to look through them in the UniProt XML file. With the Excel file comparing the lists of gene IDs and using the CTRL+F shortcut, I was also able to discern which tags to include into the new builds for the databases. Because of this, I was able to confirm that some genes indeed do not exist in the XML file, while only a couple exist within the "dbReference" tag.
  2. What didn't work?
    • What didn't work is using Match multiple times without thinking. Even when I was trying to match the number of gene IDs with what Tally Engine gives me, Match didn't really help me in identifying where to find the genes in the XML file.
  3. What will I do next to fix what didn't work?
    • What I would do next to fix what didn't work is to actually use Match in conjunction to the XML file, or just use the Excel method completely since that was actually more helpful in finding the necessary tags than the Match method.

Assignment Links

Weekly Assignments

Individual Journal Entries

Shared Journal Entries