Mbalducc Week 8

From LMU BioDB 2017
Jump to: navigation, search

Microarray Data Analysis Lab Notebook

I analyzed the strain dCIN5, there were 4 replicates per data point.

5871 replacements were made when I deleted the NAs from the spreadsheet.

Excel file for the ANOVA of dCIN5

Table for Comparing dCIN5 with WT

Statistical Data Analysis Part 1

  1. Created a new worksheet, naming it dCIN5_ANOVA".
  2. Copied the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for the strain dCIN5 and pasted it into my new worksheet. Copied the columns containing the data for dCIN5 and pasted it into my new worksheet.
  3. At the top of the first column to the right of my data, created five column headers of the form dCIN5_AvgLogFC_(TIME) where (TIME) is 15, 30, 60, 90, and 120.
  4. In the cell below the dCIN5_AvgLogFC_t15 header, typed =AVERAGE(D2:G2) and hit enter.
  5. Then highlighted all the data in row 2 associated with dCIN5 and t15, pressed the closing paren key (shift 0),and pressed the "enter" key.
  6. This cell now contained the average of the log fold change data from the first gene at t=15 minutes.
  7. Clicked on this cell and positioned my cursor at the bottom right corner. I saw the cursor change to a thin black plus sign (not a chubby white one). When it did, double clicked, and the formula is copied to the entire column of 6188 other genes.
  8. Repeated steps (4) through (8) with the t30, t60, t90, and the t120 data.
  9. In the first empty column to the right of the dCIN5_AvgLogFC_t120 calculation, created the column header dCIN5_ss_HO.
  10. In the first cell below this header, typed =SUMSQ(D2:W2)
  11. Highlighted all the LogFC data in row 2 for the dCIN5 (but not the AvgLogFC), pressed the closing paren key (shift 0), and pressed the "enter" key.
  12. In the next empty column to the right of dCIN5_ss_HO, created the column headers dCIN5_ss_(TIME) as in (3).
  13. Made a note of how many data points I had at each time point for my strain. For dCIN5, there are 4 data points for each time point. There are 20 total data points.
  14. In the first cell below the header dCIN5_ss_t15, typed =SUMSQ(D2:G2)-COUNTA(D2:G2)*X2^2 and hit enter.
    • The COUNTA function counts the number of cells in the specified range that have data in them (i.e., does not count cells with missing values).
    • Upon completion of this single computation, used the Step (7) trick to copy the formula throughout the column.
  15. Repeated this computation for the t30 through t120 data points.
  16. In the first column to the right of dCIN5_ss_t120, created the column header dCIN5_SS_full.
  17. In the first row below this header, typed =sum(AD2:AH2) and hit enter.
  18. In the next two columns to the right, created the headers dCIN5_Fstat and dCIN5_p-value.
  19. Recalled the number of data points from (13): 20.
  20. In the first cell of the dCIN5_Fstat column, typed =((20-5)/5)*(AC2-AI2>)/AI2 and hit enter.
    • Copied to the whole column.
  21. In the first cell below the dCIN5_p-value header, typed =FDIST(AJ2,5,20-5). Copied to the whole column.
  22. Performed a quick sanity check to see if I did all of these computations correctly.
    • Clicked on cell A1 and clicked on the Data tab. Selected the Filter icon (looks like a funnel). Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
    • Clicked on the drop-down arrow on my dCIN5_p-value column. Selected "Number Filters". In the window that appeared, set a criterion that filtered my data so that the p value had to be less than 0.05.
    • Excel now only displayed the rows that corresponded to data meeting that filtering criterion. A number appeared in the lower left hand corner of the window giving me the number of rows that meet that criterion.

Calculate the Bonferroni and p value Correction

  1. I performed adjustments to the p value to correct for the multiple testing problem. Labelled the next two columns to the right with the same label, dCIN5_Bonferroni_p-value.
  2. Typed the equation =AK2*6189, Upon completion of this single computation, used the Step (10) trick to copy the formula throughout the column.
  3. Replaced any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second dCIN5_Bonferroni_p-value header: =IF(AL2>1,1,AL2). Used the Step (10) trick to copy the formula throughout the column.

Calculate the Benjamini & Hochberg p value Correction

  1. Inserted a new worksheet named "dCIN5_ANOVA_B-H".
  2. Copied and pasted the "MasterIndex", "ID", and "Standard Name" columns from my previous worksheet into the first three columns of the new worksheet.
  3. Used Paste special > Paste values. Copied my unadjusted p values from my ANOVA worksheet and pasted it into Column D.
  4. Selected all of columns A, B, C, and D. Sorted by ascending values on Column D. Clicked the sort button from A to Z on the toolbar, in the window that appeared, sorted by column D, smallest to largest.
  5. Typed the header "Rank" in cell E1. Created a series of numbers in ascending order from 1 to 6189 in this column. This was the p value rank, smallest to largest. Typed "1" into cell E2 and "2" into cell E3. Selected both cells E2 and E3. Double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 6189.
  6. Typed dCIN5_B-H_p-value in cell F1. Typed the following formula in cell F2: =(D2*6189)/E2 and pressed enter. Copied that equation to the entire column.
  7. Typed "dCIN5_B-H_p-value" into cell G1.
  8. Typed the following formula into cell G2: =IF(F2>1,1,F2) and pressed enter. Copied that equation to the entire column.
  9. Selected columns A through G. Sorted them by my MasterIndex in Column A in ascending order.
  10. Copied column G and used Paste special > Paste values to paste it into the next column on the right of your ANOVA sheet.
  • Zipped and uploaded the .xlsx file that I have created to the wiki.

Sanity Check: Number of genes significantly changed

  • Opened my dCIN5_ANOVA worksheet.
  • Selected row 1 and selected the menu item Data > Filter > Autofilter. This will enabled me to filter the data according to criteria I set.
  • Clicked on the drop-down arrow for the unadjusted p value. Set a criterion to filter my data so that the p value has to be less than 0.05.
    • How many genes have p < 0.05? and what is the percentage (out of 6189)?
      • 2290 genes have p < 0.05. The percentage out of 6189 is 37%.
    • How many genes have p < 0.01? and what is the percentage (out of 6189)?
      • 1380 genes have p < 0.01. The percentage is 22%.
    • How many genes have p < 0.001? and what is the percentage (out of 6189)?
      • 691 genes have p < 0.001. The percentage is 11%.
    • How many genes have p < 0.0001? and what is the percentage (out of 6189)?
      • 358 genes have p < 0.0001. The percentage is 6%.
  • I have just performed 6189 hypothesis tests. Another way to state what we are seeing with p < 0.05 is that we would expect to see this a gene expression change for at least one of the timepoints by chance in about 5% of our tests, or 309 times. Since we have more than 309 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
    • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?
      • 151 genes are p < 0.05. The percentage is 2%.
    • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)?
      • 1453 genes are p < 0.05. The percentage is 23%.
  • In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
  • We will compare the numbers we get between the wild type strain and the other strains studied, organized as a table. Use this sample PowerPoint slide to see how your table should be formatted. Upload your slide to the wiki.
    • Note that since the wild type data is being analyzed by one of the groups in the class, it will be sufficient for this week to supply just the data for your strain. We will do the comparison with wild type at a later date.
  • Comparing results with known data: the expression of the gene NSR1 (ID: YGR159C)is known to be induced by cold shock. Find NSR1 in your dataset. What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values? What is its average Log fold change at each of the time points in the experiment? Does NSR1 change expression due to cold shock in this experiment?
    • Unadjusted p-value: 6.37596E-08
    • Bonferroni-corrected: 0.000394608
    • B-H corrected: 2.19227E-05
    • AvgLogFC t15: 4.070048368
    • AvgLogFC t30: 3.611460213
    • AvgLogFC t60: 4.298496857
    • AvgLogFC t90: -2.900930452
    • AvgLogFC t120: -0.931494963
    • The NSR1 gene does change expression due to cold shock in the dCIN5 experiment.
  • For fun, find "your favorite gene" (from your web page) in the dataset. What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values? What is its average Log fold change at each of the timepoints in the experiment? Does your favorite gene change expression due to cold shock in this experiment?
    • My favorite gene is HSF1
    • Unadjusted p-value: 0.004551209
    • Bonferroni-corrected: 1
    • B-H corrected: 0.025723684
    • AvgLogFC t15: -1.230282375
    • AvgLogFC t30: -1.272313728
    • AvgLogFC t60: -1.500144335
    • AvgLogFC t90: -0.075703512
    • AvgLogFC t120: 0.422687452
    • HSF1 does appear to change expression due to cold shock in the dCIN5 experiment.

Summary Paragraph

In this experiment, data for the expression of yeast genes in cold shock with the deletion of CIN5 were analyzed. There were 5 time points, each corresponding to different temperatures, first the cold shock, then the recovery. The data shows the log fold change for the expression of each gene at each time point. From this data, the p-values were found for each gene. Then the Bonferroni corrected and B-H corrected p-values were found. The Bonferroni correction gave the most strict results, with only 151 genes (2% of the total) having a p-value of less than 0.05. This is compared to the B-H adjusted, which showed 1453 genes (23% of the total) having a p-value of less than 0.05, and compared to the unadjusted p-value, where 2290 genes (37% of the total) had p-values of less than 0.05. The data was compared to a gene which is known to change expression during cold shock: NSR1. Our data shows that the NSR1 still did respond to cold shock in our strain, dCIN5. The gene HSF1 (my favorite gene) was also looked at. This gene did not appear to change expression in response to cold shock in the strain dCIN5.

Acknowledgements

I worked with my homework partner, Simon Wroblewski, on this assignment. We worked together in class and compared results for our formulas in Excel. We also spoke outside of class to compare what we were doing to make sure we were both getting the same results for the same strain, dCIN5. I also used the instructions from the Week 8 page in my journal, edited to show that I did them in the past.

While I worked with the people noted above, this individual journal entry was completed by me and not copied from another source.

Mbalducc (talk) 15:32, 22 October 2017 (PDT)

References

LMU BioDB 2017. (2017). Week 8. Retrieved October 19, 2017, from https://xmlpipedb.cs.lmu.edu/biodb/fall2017/index.php/Week_8

Other Pages

Individual Journals

Mary Balducci

Week 2 Journal

Week 3 Journal

Week 4 Journal

Week 5 Journal

Week 6 Journal

Week 7 Journal

Week 8 Journal

Week 9 Journal

Week 10 Journal

Week 11 Journal

Week 12 Journal

No Assignment Week 13

Week 14 Journal

Week 15 Journal


Assignments

Week 1 Assignment

Week 2 Assignment

Week 3 Assignment

Week 4 Assignment

Week 5 Assignment

Week 6 Assignment

Week 7 Assignment

Week 8 Assignment

Week 9 Assignment

Week 10 Assignment

Week 11 Assignment

Week 12 Assignment

No Assignment Week 13

Week 14 Assignment

Week 15 Assignment

Shared Journals

Class Journal Week 1

Class Journal Week 2

Class Journal Week 3

Class Journal Week 4

Class Journal Week 5

Class Journal Week 6

Class Journal Week 7

Class Journal Week 8

Class Journal Week 9

Class Journal Week 10

Page Desiigner