Malverso Week 15

Electronic Journal

Josh and I ran some of the statistical analysis data from the GenMAPP users through creating a new expression dataset and generated an exception file which found some issues with our database.
At first there was an exception for every single gene, because we did not compensate for the underscore in the ID. After manually inserting the underscore after the 'SO', we were able to find the actual errors.
We noted that there are 5408 genes listed in their data, but only 4196 genes in our database.
There are also 760 gene IDs that are in the form SO_####F, which are genes that don't exist in our database.
There are 681 gene IDs that are in a 'normal' form (either SO_#### or SO_A####) but do not exist in our database.
For some of the gene IDs that have 'F's, there are multiple genes of the same ID.
We failed at trying to find a way to search uniprot for all 1441 missing ID's, so decided to do a spot check, seearching for about 1 out of every 100 gene ID's on uniprot. None of which returned any results.
We also searched for the some of the 'F' IDs in our MOD and none of them returned results either.
We concluded that even though 1441 is a large number, the exception scoul dbe safely ignored.
I did, however, have to make another customization to genMapp to be able to handle the underscores.
In class on 12/10/15, we worked on figuring out the corrections for our GenMAPP code and made a new dataset using the finished data from the GenMAPP users.
Using the manually underscored dated, we were able to run MAPPFinder successfully and generated all of the necessary files for our deliverables.
I wrote the customization by copy and pasting the code from the vibrio cholerae customization and then customizing that to add an underscore to the data.
Ran new export according to https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Running_GenMAPP_Builder with newly customized code (The power outage erased my records of how long the various exports took but my .gdb file was saved in the ThawSpace which is a plus)
I tested the new .gdb file with the unmodified data ID's and found that every ID still resulted in an error.\
To figure out my problem I opened the .gdb file with Microsoft Access and looked in the ordered Locus names to see that there was a copy of each gene ID (Which was expected), but also that the copy of the ID names had two underscores in them instead of none.
I redid the code by adding this line:

	                	newId = "SO" + substrings[i].substring(3,substrings[i].length());

Then I re-exported the gene database. The links can be foound on our gene database testing report page.
- Uniprot took 3.13 minutes.
- OBO.XML took 7.05 minutes.
- Processing the GO terms took 4.76 minutes.
- GOA took .06 minutes.
- The export took 1 hour and 35 minutes.
The new .gdb file can be located on the files and deliverables page.

Team Page

Heavy Metal HaterZ

Assignments

Individual Journal Entries

Shared Journal Entries

Malverso Week 15

Contents

Electronic Journal

Team Page

Assignments

Individual Journal Entries

Shared Journal Entries

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools