Anuvarsh Week 8

From LMU BioDB 2015
Jump to: navigation, search

Statistical Analysis of Vibrio cholerae Microarray Data (Part 1)

  • First, I downloaded the Merrell_Compiled_Raw_Data_Vibrio.xls|Merrell_Compiled_Raw_Data_Vibrio.xls file to my Desktop and saved a copy with my initials and the current date. (This file can be accessed on the page containing the procedure for part 1 or in the .zip file uploaded at the top of this lab assignment.

Normalize the log ratios for the set of slides in the experiment

  • I created a new Worksheet in my Excel file and named it "scaled_centered"
  • I transferred all of the data in the "complied_raw_data" worksheet into the "scaled_centered" worksheet
  • I then created two new rows and titled them "Average" and "StdDev"
  • I computed the Average log ratio for each column of data (where each column represents one chip) using =AVERAGE(B4:B5224), and copied the formula into the rest of the row
  • Then, I computed the Standard Deviation of the log ratios using the equation =STDEV(B4:B5224), and then copied this formula into the rest of the row
  • I then went through and added names to the columns to the right of the raw data with following the format "A1_scaled_centered","A2_scaled_centered_, etc.
  • In cell N4, I typed the following equation: =(B4-B$2)/B$3 and copied this formula into the rest of the column
    • This step is important to do because it normalizes the averages of the data at 0.
  • I repeated this step for all of the "_scaled_centered" columns

Perform statistical analysis on the ratios

  • After creating a new worksheet called "statistics", I copied and pasted the "ID" column from the "scaled_centered" worksheet
  • I then copied over all of the "_scaled_centered" columns from the "scaled_centered" sheet into the "statistics" sheet using Past Special>Values, and deleted the rows titled "Average" and "StDev"
  • I then titled three new columns "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C"
  • Under each of these columns, I calculated the average log fold change with the equation =AVERAGE(B2:E2) where the designated cells selected correspond to the scaled and centered data for that patient ("Avg_LogFC_A" has cells from all of the "A#_scaled_centered" columns)
  • The previous step was modified and repeated for each of the other "Avg_LogFC" columns
  • I then computed the averages of the averages in the column titled "Avg_LogFC_all"
  • I created a new header for the column next to "Avg_LogFC_all" and computed the Tstat with the formula =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))
  • I then created another header for the column adjacent to the "Tstat" column and named it "Pval"
  • In this column, I calculated the P value using the equation: =TDIST(ABS(R2),degrees of freedom,2)

Calculate the Bonferroni p value correction

  • I created two columns to the right of the "Pval" column and named both of them "Bonferroni_Pvalue"
  • In the first column, I wrote the equation =S2*5221 in the first row and then copied the formula throughout the column
  • In the second column, I created a condition that replaced any p value greater than 1 with a 1 using the equation =IF(T2>1,1,T2)

Calculate the Benjamini & Hochberg p value Correction

  • I created a new worksheet called "B-H_Pvaue"
    • This is important because we will be reordering the data and we want to prevent the rest of the sheet from being affected by the calculations required for the B-H p value
  • I copied the ID column from the previous worksheets into this sheet, and created a "MasterIndex" sheet to the left of the ID column
  • In the MasterIndex column, I typed 1 and 2 into cells A2 and A3, selected these two cells, and then double clicked on the black + sign on the corner of the cell to fill every cell in this cell with corresponding data in the "ID" column
  • I then copied the p values column from the previous sheet into Column C using Paste Special>Values
  • I selected all three columns with data in them and sorted them by ascending values in column C.
  • I then created a new column titled "Rank" and numbered each of the rows in that column using the same strategy as the "MasterIndex" column
  • Two new columns were titled "B-H_Pvalue"
  • In the first "B-H_Pvalue" column, I entered the equation =(C2*5221)/D2 and copied that formula into the rest of the column
  • In the second "B-H_Pvalue" column I created a condition where any B-H p value that is greater than 1 is replaced with the number 1 using the formula =IF(E2>1,1,E2), and copied the formula to the rest of the column
  • I then selected columns A through F and reordered them in ascending order by the Master Index and copies the last "B-H_Pvalue" column into my "statistics" sheet

Prepare file for GenMAPP

  • A new sheet was created and named "forGenMAPP"
  • Everything was copied from the statistics sheet into this sheet using Paste Special>Values
  • All of the values in the fold change columns were modified to only show 2 decimal places
  • All of the columns containing p values were modified to show only 4 decimal places
  • The left-most Bonferroni p value was removed, leaving behind only the column containing the "if" statement
  • A new column was added to the left of the "ID" column was created with the title "SystemCode" and was populated by the value "N" in every row
  • This worksheet was saved as a tab delimited *.txt file

Sanity Check: Number of genes significantly change

  • To determine how many genes fall under each pvalue category, I added a filter to all rows of data and filtered the data according to the values indicated.
    • 948 (18.2%) genes have a p value < 0.05
    • 235 (4.5%) genes have a p value < 0.01
    • 21 (0.4%) genes have a p value < 0.001
    • 0 (0%) genes have a p value < 0.0001
  • Two methods were applies in order to correct the p value and adjust it for the number of Ttests I ran. To determine how many genes fell under p value < 0.05, I filtered the data according to the values indicated in the columns indicated.
    • 0 (0%) genes have a Bonferroni-corrected p value < 0.05
    • 0 (0%) genes have a Benjamini and Hochberg-corrected p value < 0.05
  • To determine which genes had a positive or negative fold change, I filtered all of the genes from the "Avg_LogFC_all" column with the indicated conditions.
    • 352 (6.7%) genes have a positive average log fold change in the "Avg_LogFC_all" column and a p value < 0.05
    • 596 (11.4%) genes have a negative average log fold change in the "Avg_LogFC_all" column and a p value < 0.05
    • 339 (6.4%) genes have an significant positive average log fold change (> 0.25) in the "Avg_LogFC_all" column and a p value < 0.05
    • 578 (11.1%) genes have an significant negative average log fold change (> 0.25) in the "Avg_LogFC_all" column and a p value < 0.05
  • Question: What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?
    • Merrell et al. determined statistically significant differences in gene expression using the Statistical Analysis for Microarrays (SAM) program. According to the article, genes that were determined to have statistically significant change were genes that had at least a two fold change in expression. Our method used p values to test for significant differences between experimental and control data.

Sanity Check: Compare individual genes with known data

Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. However, in our data...

  • (All genes were checked by searching the gene name in the "ID" column of the Excel spreadsheet.)
  • VC0028 (two entries exist)
    • Fold Change: 1.67, 1.27
    • P value: 0.0474, 0.0692
    • Significantly changed? The first entry of this gene was significantly changed, but not the second.
  • VC0941 (two entries exist)
    • Fold Change: 0.09, -0.28
    • P value: 0.6759, 0.1636
    • Significantly changed? Neither entry was significantly changed.
  • VC0869 (five entries exist)
    • Fold Change: 1.59, 1.95, 2.2, 1.5, 2.12
    • P value: 0.0463, 0.0227, 0.002, 0.0174, 0.02
    • Significantly changed? All entries were significantly changed.
  • VC0051 (two entries exist)
    • Fold Change: 1.92, 1.89
    • P value: 0.0139, 0.016
    • Significantly changed? Both entries were significantly changed.
  • VC0647 (three entries exist)
    • Fold Change: -1.11, -0.94, -1.05
    • P value: 0.0003, 0.0125, 0.0051
    • Significantly changed? All entries were significantly changed.
  • VC0468 (one entry exists)
    • Fold Change: -0.17
    • P value: 0.335
    • Significantly changed? Not significantly changed.
  • VC2350 (one entry exists)
    • Fold Change: -2.4
    • P value: 0.013
    • Significantly changed? Significantly changed.
  • VCA0583 (one entry exists)
    • Fold Change: 1.06
    • P value: 0.1011
    • Significantly changed? Not significantly changed.

MAPPFinder Analysis of Vibrio cholerae Microarray Data (Part 2)

Map Onto Biological Pathways (GenMAPP & MAPPFinder

  • I launched the GenMAPP software as it was installed on the lab computers in Seaver 120 as of October 20, 2015.
  • I downloaded the 2010 Vibrio cholerae Gene Database using the link on the OpenWetWare page, which is also available in the .zip file uploaded at the top of this journal
    • My partner, Brandon L, used the 2009 version of this species' gene database
  • After uploading the 2010 VC Gene Database onto GenMAPP, I followed the following procedure to upload data from the previously made Tab delimited .txt file for GenMAPP

GenMAPP Expression Dataset Manager Procedure

  • I opened the Expression Dataset Manager from the Data menu in the main drafting board window and selected "New Dataset"
  • I selected the tab-delimited text file that I formatted for GenMAPP (.txt). Since the only data included in this file are numerical values, I did not select any boxes in the Data Type Specification window that appeared.
  • The Expression Dataset Manger then converted my data for several seconds.
    • When uploading my tab delimited file to the GenMAPP software for conversion, 121 errors were detected in my raw data.
    • My partner, Brandon L., had 772 errors detected in his raw data conversion while using the 2009 Vibrio Cholera database.
      • This difference in errors is most likely due to the extensiveness of our databases. Because Brandon was using an older version of the database while I was using a newer version, it is possible that his version had fewer genes than mine.
  • I then customized the new Expression Dataset by creating Color Sets that contain instructions to GenMAPP for displaying data on MAPPs
  • To do this, I created a color set named "LogFoldChange" and selected "Avg_LogFC_all" as the data that should be used as the Gene Value. I activated the criteria builder by clicking the New button
  • Two criterion were created: one for increased expression of data and another for decreased expression of data
    • "Increased" was defined as [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05
    • "Decreased was defined as [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05
  • I saved the entire Expression Dataset by selected Save from the Expression Dataset menu and then exited from the Expression Dataset Manager to view the Color Sets on a MAPP

MAPPFinder Procedure

  • I launched the MAPPFinder program from within GenMAPP by selecting MAPPFinder under the Tools menu on GenMAPP
  • After ensuring that the Gene Database for the correct species was loaded, I selected the button "Calculate New Results"
  • I clicked on "Find File" and selected the .gex file that contained my Expression Dataset (This file can be found in the .zip file at the beginning of this journal)
  • I chose the LogFoldChange color set and the Increased criteria in the right-hand box, and selected the boxes for "Gene Ontology" and "p value"
  • I then created a meaningful name for my results file, and clicked Run MAPPFinder. MAPPFinder took several seconds to run. After MAPPFinder finishes running, a Gene Ontology browser opened showing my results.
    • The first four times that I tried to run MAPPFinder, the Gene Ontology browser did not appear. I initially thought it was a problem with the program itself, so I backtracked and reuploaded my data to GenMAPP and tried again. After this did not work, I changed the file name for my results to "mappfinder results" and GenMAPP worked. Prior to this, I attempted to save my file as "Merrell_Compiled_Raw_Data_Vibrio_MAPPFinder_AV_20151015" which appeared to not follow the GenMAPP guidelines for file names.
  • I then selected "Show Ranked List" to see the most significant Gene Ontology terms for my data. The top 10 Gene Ontology results were as follows:
    1. branched chain family amino acid metabolic process
    2. branched chain family amino acid biosynthetic process
    3. IMP metabolic process
    4. IMP biosyntehtic process
    5. purine ribonucleoside monophosphate metabolic process
    6. purine nucleoside monophosphate metabolic process
    7. purine nucleoside monophosphate biosyntehtic process
    8. purine ribonucleoside monophosphate biosyntehtic process
    9. arginine metabolic process
    10. cellular nitrogen compound biosynthetic process
    • Top 10 Ranked GO Terms, found using the 2009 Database, as reported by Brandon L.:
      1. biopolymer biosynthetic process
      2. macromolecule biosynthetic process
      3. macromolecule metabolic process
      4. localization
      5. transporter activity
      6. cellular biopolymer biosynthetic process
      7. cellular macromolecule metabolic process
      8. transport
      9. establishment of localization
      10. biopolymer metabolic process
    • Between both the GO results, there seem to be a theme of metabolic and biosynthetic processes. However, the 2010 database returned GO results that appear to be more specific than the 2009 database by including specific compound processes, while the 2009 database returned GO results that are more vague such as "transport" and "localization". These differences in GO terms are most likely due to the difference of information available between the 2009 and 2010 version in regards to which genes are associated with which GO terms. If in the 2009 database, genes were only assigned to more vague GO terms, it makes sense that the top lists are more vague Another possibility is that because so many more genes in the 2009 database returned errors upon uploading the information on GenMAPP, it is possible that many of the genes were not taken into account by GenMAPP, and were therefore not taken into consideration in regards to what processes they affect and the GO terms associated with them.
  • I then searched all of the genes mentioned by Merrell et al. by typing the identifier for these genes into the MAPPFinder browser gene ID search field and selected "OrderedLocusNames" in the drop-down menu to the right of the search field. The GO terms associated with each of those genes are listed below:
    • VC0028: metal ion binding, iron-sulfur cluster binding, 4 iron 4 sulfur cluster binding, catalytic activity, lyase activity
    • VC0941: pyridoxal phosphate binding, catalytic activity, transferase activity, glycine hydroxymethyltransferase activity
    • VC0869: nucleotide binding, ATP binding, catalytic activity, lyase activity
    • VC0051: nucleotide binding, ATP binding, catalytic activity, lyase activity, carboxy-lyase activity
    • VC0647: 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity
    • VC0468: metal ion binding, nucleotide binding, ATP binding, catalytic activity, ligase activity, glytathione synthase activity
    • VC2350: catalytic activity, lyase activity, deoxyribose-phosphate aldolase activity
    • VCA0583: outer membrane-bounded periplasmic space
    • My partner did not have the same results. In fact, my partner only found GO terms for two of the mentioned genes. This is mostly likely because he is using the 2009 gene database that does not have information regarding the mentioned genes. This becomes more clear when thinking about the number of genes that returned errors when uploading the data to GenMAPP.
  • I then clicked on the GO term "outer membrane-bounded periplasmic space" and a MAPP open opened that listed all of the genes associated with that GO term. I found VCA0583 in that list by using the Uniprot site to find the identifier for that gene, Q9KM06.
    • This gene did not have a significant change in expression
  • I then double clicked on this gene in order to find out the functions of this gene
    • This gene is responsible for transporter activity, transport, and outer membrane-bounded periplasmic space
  • I then created a copy of the results file (5-Criterion0-GO.txt) and renamed it to identify it as the MAPPFinder results sheet (This file can be found in the .zip file mentioned at the top of this journal)
  • I opened this file in Microsoft Excel.
  • Below is an image of the top of my Criterion file:
  • Top-of-criterion-sheet-AV-10162015.png
  • Compare this information with your partner who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different? Record this information in your individual journal entry.
    • The numbers that were different between me and my partner were the numbers associated with:
      • probes meeting the filter linked to a UniProt ID
      • genes meeting the criterion linked to a GO term
      • probes linked to a UniProt ID
      • genes linked to a GO term
      • N and R distinct genes in the GO
    • In all instances, the numbers that appeared on my criterion sheet were higher than those on Brandon's. This is most likely because the older version of the database that Brandon was using was not as extensive in its linking of probes to UniProt ID's and genes to GO terms. Therefore, the number of probes and genes linked overall and within the filter were fewer in his results than mine. This would also affect the N and R values that are used to determine the z score.
  • I then filtered the data to show the top GO terms represented in my data for both the "increased" and "decreased" criteria. The following are the filters that I applied:
    • Z Score (in column N) greater than 2
    • PermuteP (in column O) less than 0.05
    • Number Changed (in column I) greater than or equal to 5 AND less than 100
    • Percent Changed (in column L) greater than or equal to 26%
    • These filters returned the top 20 GO terms
  • The GO terms that are closely related to each other were highlighted in similar colors in the Excel spreadsheet (This file can be found in the .zip file at the top of this journal)
  • My GO terms are all related to reactions and pathways either involving or forming branched-chain amino acids, IMP, nibonucleosides, nucleosides, arginine, or cellular nitrogen. Merrell et al. (2002) indicated that something about stomach acid buffering is necessary for spreading V. cholerae. My data signifies that the processes involving the previously mentioned compounds are most significant among all of the genes in this study (at least 3 genes are with a p value < 0.05 are associated this GO term). This means that the changes in expression of these genes is necessary for V. cholerae to digest and remain infectious, which means that branched-chain amino acids, IMP, ribonucleosides, nucleosides, arginine, and cellular nitrogen are all key factors in the hyper infectious bacterial state of V. cholerae after dissemination in humans.

Conclusion

After reading the article presented by Merrell et al. (2002) regarding the infectious nature of V. cholerae after dissemination, and understanding that this infectious nature is somehow related to stomach acid buffering in humans, we took their data (after they already completed their log fold changes) and statistically analyzed it to determine which genes experienced a statistically significant change between laboratory grown V. cholerae (which did not interact with stomach acid) and the V. cholerae bacteria that was excised in human stool. We did so by first normalizing the data, and then conducting a series of calculations to determine each genes p-value that signified whether or not there was a significant difference between the two sets of bacteria. We also applued a Bonferroni and Benjimini-Hochberg correction to the p values to adjust for the volume of t-tests that we ran, and calculate more stringent p values. We then took this data and uploaded it to GenMAPP, where we colored the data points according to their significant p value and their increased or decreased expression. We then used MAPPFinder to take the significant increased (though some groups did decreased) genes and determine which gene ontology terms they most closely relate to. Using these terms, we were able to understand that branched-chain amino acids, IMP, ribonucleosides, nucleosides, arginine, and cellular nitrogen are key compounds that are either involved in the processes or are formed by a majority of the genes with significant increased expression. This indicated that these compounds are vital to the stomach acid buffering that occurs in vitro in comparison with laboratory grown V. cholerae.

Another aspect to this data analysis was the use of a 2009 vs. 2010 V. cholerae database. While I used the 2010 version of the database, my partner, Brandon Litvik, used the 2009 version but followed the same procedure. The results between the two databases were similar in regards to the top GO terms, so we had similar concepts to work with in the construction of our conclusion. Throughout the calculation and GenMAPP processes, however, we were receiving different results which indicated that the 2009 version of the database lacked the extensiveness of the 2010 database in regards to each gene being associated with UniProt IDs and GO terms. The numerical differences implied that around 600 genes lacked information in 2009 that was provided in the 2010 database. This speaks volumes regarding the speed by which science progresses, and the importance of updated databases in scientific studies.

Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15