Emmatyrnauer Week 8

From LMU BioDB 2017
Jump to: navigation, search

Electronic Notebook

Microarray Data Analysis

  1. I downloaded the starting Excel spreadsheet from Box.
  2. I change the filename so it contained my initials
  3. I begin by recording in my wiki, the strain comparison and individual dataset that I analyzed, the filename, the number of replicates for each strain and each time point in my data: This week I analyzed the wild type dataset (file name: wildtype_analysis_ET.xslx). This dataset has 5 time points: 15 min (4 replicates), 30 min (5 replicates), 60 min (4 replicates), 90 min (5 replicates), and 120 min (5 replicates).
  4. I deleted the data columns of the strains that I was not analyzing (all except wild type) so that my file contained only the strain's data that I was working with.
  5. I replaced cells that have "NA" in them (which indicates missing data) with an empty cell by using the keyboard shortcut Control+F to open the "Find" dialog box and selecting the "Replace" tab).
  6. I typed "NA" in the search field and didn't type anything in the "Replace" field. I clicked the button "Replace all" 4109 replacements for NA were made

Part 1: Statistical Analysis Part 1

  1. I created a new worksheet, naming it "wt_ANOVA"
  2. I copied the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for wt and pasted it into my new worksheet. I then copied the columns containing the data for my strain and pasted it into my new worksheet.
  3. At the top of the first column to the right of my data, I created five column headers of the form wt_AvgLogFC_(TIME) where (TIME) is 15, 30, etc.
  4. In the cell below the wt_AvgLogFC_t15 header, I typed =AVERAGE(
  5. I then highlighted all the data in row 2 associated with wt and t15, pressed the closing paren key (shift 0), and pressed the "enter" key.=AVERAGE(D2:G2)
  6. This cell now contained the average of the log fold change data from the first gene at t=15 minutes.
  7. I clicked on this cell and positioned my cursor at the bottom right corner. I say my cursor change to a thin black plus sign (not a chubby white one). When it did, I double clicked , and the formula was copied to the entire column of 6188 other genes.
  8. I repeated steps (4) through (8) with the t30, t60, t90, and the t120 data.
  9. Now in the first empty column to the right of the wt_AvgLogFC_t120 calculation, I created the column header wt_ss_HO.
  10. In the first cell below this header, I typed =SUMSQ(
  11. I highlighted all the LogFC data in row 2 for my wt (but not the AvgLogFC), pressed the closing parenthesis key (shift 0), and pressed the "enter" key. =SUMSQ(D2:Z2)
  12. In the next empty column to the right of wt_ss_HO, I created the column headers wt_ss_(TIME) as in (3).
  13. I made a note of how many data points I had at each time point for my strain. For wild type it was "4" or "5". Total number of data points for wt was 23.
  14. In the first cell below the header wt_ss_t15, I typed =SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2 and hit enter.=SUMSQ(D2:G2)-COUNTA(D2:G2)*AA2^2
    • The COUNTA function counted the number of cells in the specified range that had data in them (i.e., did not count cells with missing values).
    • The phrase <range of cells for logFC_t15> was replaced by the data range associated with t15 (2D:2G).
    • The phrase <AvgLogFC_t15> was replaced by the cell number in which I computed the AvgLogFC for t15, and the "^2" squares that value.
    • Upon completion of this single computation, I used the Step (7) trick to copy the formula throughout the column.
  15. I repeated this computation for the t30 through t120 data points. Again, I made sure to get the data for each time point, type the right number of data points, and get the average from the appropriate cell for each time point, and copied the formula to the whole column for each computation.
  16. In the first column to the right of wt_ss_t120, I created the column header wt_SS_full.
  17. In the first row below this header, I typed =sum(<range of cells containing "ss" for each timepoint>) and hit enter.=sum(AG2:AK2)
  18. In the next two columns to the right, I created the headers wt_Fstat and wt_p-value.
  19. I recalled the number of data points from (13) and called that total n. (n=23 for wt)
  20. In the first cell of the wt_Fstat column, I typed =((n-5)/5)*(<wt_ss_HO>-<wt_SS_full>)/<wt_SS_full> and hit enter. =((23-5)/5)*(AF2-AL2)/AL2
    • I didn't actually type the n but instead used the number from (13). That is, 23.
    • I replaced the phrase wt_ss_HO with the cell designation.
    • I replaced the phrase <wt_SS_full> with the cell designation.
    • I copied to the whole column.
  21. In the first cell below the wt_p-value header, I typed =FDIST(<wt_Fstat>,5,23-5) and copied to the whole column. =FDIST(AM2,5,23-5)
  22. Before I moved on to the next step, I performed a quick sanity check to see if I did all of these computations correctly.
    • I clicked on cell A1 and clicked on the Data tab. I selected the filter icon (looked like a funnel). Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
    • I clicked on the drop-down arrow on my wt_p-value column. I selected "Number Filters". In the window that appeared, I set a criterion that filtered my data so that the p value had to be less than 0.05.
    • Excel now only displayed the rows that corresponded to data meeting that filtering criterion. A number appeared in the lower left hand corner of the window giving me the number of rows that met that criterion.

Calculating the Bonferroni and p value Correction

  1. Now I performed adjustments to the p value to correct for the multiple testing problem. I labeled the next two columns to the right with the same label, wt_Bonferroni_p-value.
  2. I typed the equation =<wt_p-value>*6189, Upon completion of this single computation, I used the Step (10) trick to copy the formula throughout the column.=AN2*6189
  3. I replaced any corrected p value that was greater than 1 by the number 1 by typing the following formula into the first cell below the second wt_Bonferroni_p-value header: =IF(wt_Bonferroni_p-value>1,1,wt_Bonferroni_p-value), where "STRAIN_Bonferroni_p-value" refered to the cell in which the first Bonferroni p value computation was made. I used the Step (10) trick to copy the formula throughout the column.=IF(AO2>1,1,AO2)

Calculating the Benjamini & Hochberg p value Correction

  1. I inserted a new worksheet named "wt_ANOVA_B-H".
  2. I copied and pasted the "MasterIndex", "ID", and "Standard Name" columns from my previous worksheet into the first two columns of the new worksheet.
  3. For the following, I Pasted special > Paste values. I copied my unadjusted p values from my ANOVA worksheet and pasted it into Column D.
  4. I selected all of columns A, B, C, and D. I sorted by ascending values on Column D. I clicked the sort button from A to Z on the toolbar, and in the window that appeared, I sorted by column D, smallest to largest.
  5. I typed the header "Rank" in cell E1. I created a series of numbers in ascending order from 1 to 6189 in this column. This was the p value rank, smallest to largest. I typed "1" into cell E2 and "2" into cell E3. I selected both cells E2 and E3. I then double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 6189.
  6. I then calculated the Benjamini and Hochberg p value correction. I typed wt_B-H_p-value in cell F1. I then typed the following formula in cell F2: =(D2*6189)/E2 and pressed enter. I copied that equation to the entire column.
  7. I typed "wt_B-H_p-value" into cell G1.
  8. I typed the following formula into cell G2: =IF(F2>1,1,F2) and pressed enter. I copied that equation to the entire column.
  9. I selected columns A through G. I then sorted them by my MasterIndex in Column A in ascending order.
  10. I copied column G and used Paste special > Paste values to paste it into the next column on the right of my ANOVA sheet.

File

Media:Wt_Microarraydata_ET.zip

Sanity Check: Number of genes significantly changed

Before moving on to further analysis of the data, I wanted to perform a more extensive sanity check to make sure that I performed my data analysis correctly. I found out the number of genes that were significantly changed at various p value cut-offs.

  • I went to wt_ANOVA worksheet.
  • I selected row 1 (the row with my column headers) and selected the menu item Data > Filter > Autofilter (The funnel icon on the Data tab). Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
  • I clicked on the drop-down arrow for the unadjusted p value. I set a criterion that filtered my data so that the p value had to be less than 0.05.
    • How many genes have p < 0.05? and what is the percentage (out of 6189)? 2528; 40.85%
    • How many genes have p < 0.01? and what is the percentage (out of 6189)? 1652; 26.70%
    • How many genes have p < 0.001? and what is the percentage (out of 6189)? 919; 14.85%
    • How many genes have p < 0.0001? and what is the percentage (out of 6189)? 496; 8.0142%
  • When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero by chance less than 5% of the time.
  • I just performed 6189 hypothesis tests. Another way to state what we are seeing with p < 0.05 is that we would expect to see this a gene expression change for at least one of the timepoints by chance in about 5% of our tests, or 309 times. Since we have more than 309 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
    • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)? 248; 4.01%
    • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)? 1822; 29.44%
  • In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
  • We will compare the numbers we get between the wild type strain and the other strains studied, organized as a table. Use this sample PowerPoint slide to see how your table should be formatted. Upload your slide to the wiki.
    • Note that since the wild type data is being analyzed by one of the groups in the class, it will be sufficient for this week to supply just the data for your strain. We will do the comparison with wild type at a later date.
  • Comparing results with known data: the expression of the gene NSR1 (ID: YGR159C)is known to be induced by cold shock. Find NSR1 in your dataset. Unadjusted p-value: 2.86901E-10, Bonferroni-corrected p-value: 1.77563E-06, and B-H-corrected p-value: 7.52066E-10. Average Log fold change at t15=3.279, t30=3.621, t60=3.527, t90=-2.050, t120=-0.6062 Note that the average Log fold change is what we called "STRAIN)_AvgLogFC_(TIME)" in step 3 of the ANOVA analysis. Does NSR1 change expression due to cold shock in this experiment? Since the Log Fold Change is a log2 value, a Log Fold Change greater than or equal to 1, or less than or equal to -1 has at least a 2 fold difference in expression (meaning the gene is changing expression). Because the log fold change meets these parameters at t15 (3.279), t30 (3.621), t60 (3.527), and t90 (-2.050) the gene definitely changes expression. Because NSR1 has a Bonferroni-corrected p value < 0.05, it is among the genes that are the most significantly changed in the dataset because this parameter is the most selective (therefore, we have a lot of confidence that expression in this gene is changing).
  • For fun, find "your favorite gene-ADA2" (from your web page) in the dataset. Unadjusted p-value: 0.008083323, Bonferroni-corrected p-value: 1, and B-H-corrected p-value: 1. Average Log fold change at t15=-1.063078929, t30=-0.44188769, t60=-0.999519175, t90=-1.265472205, t120=-0.479548402. My favorite gene changes expression due to cold shock because t15=-1.063078929, and t90=-1.265472205 (both less than -1). Unlike NSR1 which has a Bonferroni-corrected p value < 0.05, ADA2's unadjusted p-value is ~0.008 which becomes greater than 0.05 in both the Bonferroni and B-H corrections (less confidence that expression in this gene is changing). So comparing NSR1 to ADA2, NSR1 has more significant changes in expression compared to ADA2.


Powerpoint Slide

Media:WT_p-value_slide_ET.pptx

Summary Paragraph

In this week's assignment, microarray data for wild type Saccharomyces (with ratios that had previously been normalized using a statistics package) were analyzed. The data included the log fold change of the red/green ratio at different time points: t15, t30, t60 (cold shock at 13°C) and t90 and t120 (cold shock at 13°C followed by 30 or 60 minutes of recovery at 30°C). Multiple measurements were taken at each time point and statistical analysis was performed on the data to determine the effects of cold shock on the alteration of gene expression. Data analysis included calculations of the unadjusted, Bonferroni-corrected, and B-H-corrected p-values. Changes in gene expression in response to cold shock were recorded from 40.9%, 26.7%, 14.9% and 8.0% of genes with limitations of p < 0.05, p < 0.01, p < 0.001, and p < 0.0001, respectively. Based on this data, less than half of the genes examined are changing expression (specifically, only 40.9% of the genes--and this is when we are designating changes in expression as p < 0.05). However, even less are said to be changing expression when p < 0.05, and only 8.0% of genes are said to be changing expression if we set p < 0.0001. We can have a lot of confidence that Bonferroni-corrected (most stringent), and B-H-corrected p-values (less stringent) were calculated to determine how significantly some genes were affected over others (4.0071% and 32.4123%, respectively displaying a change in gene expression). These p-values help measure confidence level of changes in gene expression (smaller p cutoff corresponds to higher confidence while larger p cutoff corresponds to lower confidence). When comparing NSR1 data to ADA2 (my favorite gene) I concluded that ADA2 was not affected much by cold shock due to the lack of log fold change between time points (remained negative throughout).

Acknowledgements

  1. I worked with my homework partner Eddie Azinge outside of class. We met face-to-face to compare our data.
  2. Dr. Dahlquist for teaching and assisting us with the data analysis
  3. Microsoft Excel to allow for statistical analysis
  4. I copied and modified the instructions from the Week 8 assignment page.
  5. I analyzed Dr. Dahlquist's microarray data. The excel file was downloaded from the Week 8 assignment page.
  6. While I worked with the people noted above, this individual journal entry was completed by me and not copied from another source.

Emmatyrnauer (talk) 15:52, 23 October 2017 (PDT)

References

LMU BioDB 2017. (2017). Week 8. Retrieved October 23, 2017, from https://xmlpipedb.cs.lmu.edu/biodb/fall2017/index.php/Week_8

Links

  1. My User Page
  2. List of Assignments
  3. List of Journal Entries
  4. List of Shared Journal Entries