QLanners Week 8

From LMU BioDB 2017
Jump to: navigation, search

General Information and Metadata

  • Strain analyzed: dASH1
  • Filename: DASH1_analysis QL.xlsx
  • Time points: 4
    • 15 min
    • 30 min
    • 60 min
    • 90 min
  • Number of replicates: 15 total
    • 15 min: 4
    • 30 min: 4
    • 60 min: 4
    • 90 min: 3
  • Number of NAs in data: 10,854


Electronic Journal

Part 1: Statistical Analysis Part 1

The purpose of the witin-stain ANOVA test is to determine if any genes had a gene expression change that was significantly different than zero at any timepoint.

  1. Download the excel spreadsheet "BIOL367_Fall2017_Dahlquist-microarray-data-master_20171017.xlsx" from the BIOL367_Fall2017 DropBox.
  2. Before beginning any analysis, immediately change the filename to the format dASH1_analysis_[Your initials]
  3. Create a new worksheet, naming it "dASH1_ANOVA".
  4. Copy the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet and paste it into your new worksheet. Copy the columns containing the data for dASH1 and paste them into your new worksheet to the right of the first three columns.
  5. Next you will replace cells that have "NA" in them (which indicates missing data) with an empty cell.
    • Use the keyboard shortcut Control+F to open the "Find" dialog box and select the "Replace" tab.
    • Type "NA" in the Search field and don't type anything in the "Replace" field.
    • Click the button "Replace all". There should be 10,854 cells of NAs.
  6. At the top of the first column to the right of your data, create five column headers of the form dASH1_AvgLogFC_(TIME) where (TIME) is 15, 30, 60, and 90. The dASH1 sample data has no 120 min data.
  7. In the cell below the dASH1_AvgLogFC_t15 header, type =AVERAGE(
  8. Then highlight all the data in row 2 associated with dASH1 and t15, press the closing paren key (shift 0),and press the "enter" key.
  9. This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
  10. Click on this cell and position your cursor at the bottom right corner. You should see your cursor change to a thin black plus sign (not a chubby white one). When it does, double click, and the formula will magically be copied to the entire column of 6188 other genes.
  11. Repeat steps (7) through (11) with the t30, t60, and t90 data.
  12. Now in the first empty column to the right of the dASH1_AvgLogFC_t120 calculation, create the column header dASH1_ss_HO.
  13. In the first cell below this header, type =SUMSQ(
  14. Highlight all the LogFC data in row 2 for dASH1 (but not the AvgLogFC), press the closing paren key (shift 0),and press the "enter" key.
  15. In the next empty column to the right of dASH1_ss_HO, create the column headers dASH1_ss_(TIME) as in (6).
  16. In the first cell below the header dASH1_ss_t15, type =SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2 and hit enter.
    • The COUNTA function counts the number of cells in the specified range that have data in them (i.e., does not count cells with missing values).
    • The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.
    • The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.
    • Upon completion of this single computation, use the Step (10) trick to copy the formula throughout the column.
  17. Repeat this computation for the t30 through t90 data points.
  18. In the first column to the right of dASH1_ss_t120, create the column header dASH1_SS_full.
  19. In the first row below this header, type =sum(<range of cells containing "ss" for each timepoint>) and hit enter.
  20. In the next two columns to the right, create the headers dASH1_Fstat and dASH1_p-value.
  21. In the first cell of the dASH1_Fstat column, type =((15-4)/4)*(<dASH1_ss_HO>-<dASH1_SS_full>)/<dASH1_SS_full> and hit enter. Where the 15 is the total number of replicates and the 4 comes from the number of distinct time intervals measured.
    • Replace the phrase dASH1_ss_HO with the cell designation.
    • Replace the phrase <dASH1_SS_full> with the cell designation.
    • Copy to the whole column.
  22. In the first cell below the dASH1_p-value header, type =FDIST(<dASH1_Fstat>,4,15-4) replacing the phrase <dASH1_Fstat> with the value from the correct cell.
  23. Copy to the whole column.
  24. Click on cell A1 and click on the Data tab. Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
  25. Click on the drop-down arrow on dASH1_p-value column. Select "Number Filters". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.
  26. Excel will now only display the rows that correspond to data meeting that filtering criterion. A number will appear in the lower left-hand corner of the window giving you the number of rows that meet that criterion. This number should be 1630 of the 6189 total rows.

Calculate the Bonferroni and p value Correction

  1. Now we will perform adjustments to the p value to correct for the multiple testing problem. Label the next two columns to the right with the same label, dASH1_Bonferroni_p-value.
  2. Type the equation =<dASH1_p-value>*6189, Upon completion of this single computation, use the Step (10) trick to copy the formula throughout the column.
  3. Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second dASH1_Bonferroni_p-value header: =IF(dASH1_Bonferroni_p-value>1,1,dASH1_Bonferroni_p-value), where "dASH1_Bonferroni_p-value" refers to the cell in which the first Bonferroni p value computation was made. Use the Step (10) trick to copy the formula throughout the column.

Calculate the Benjamini & Hochberg p value Correction

  1. Insert a new worksheet named "dASH1_ANOVA_B-H".
  2. Copy and paste the "MasterIndex", "ID", and "Standard Name" columns from the previous worksheet into the first two columns of the new worksheet.
  3. For the following, use Paste special > Paste values. Copy your unadjusted p values from your ANOVA worksheet and paste it into Column D.
  4. Select all of columns A, B, C, and D. Sort by ascending values on Column D. Click the sort button from A to Z on the toolbar, in the window that appears, sort by column D, smallest to largest.
  5. Type the header "Rank" in cell E1. We will create a series of numbers in ascending order from 1 to 6189 in this column. This is the p value rank, smallest to largest. Type "1" into cell E2 and "2" into cell E3. Select both cells E2 and E3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
  6. Now you can calculate the Benjamini and Hochberg p value correction. Type dASH1_B-H_p-value in cell F1. Type the following formula in cell F2: =(D2*6189)/E2 and press enter. Copy that equation to the entire column.
  7. Type "dASH1_B-H_p-value" into cell G1.
  8. Type the following formula into cell G2: =IF(F2>1,1,F2) and press enter. Copy that equation to the entire column.
  9. Select columns A through G. Now sort them by your MasterIndex in Column A in ascending order.
  10. Copy column G and use Paste special > Paste values to paste it into the next column on the right of your ANOVA sheet.

Sanity Check: Number of genes significantly changed

  • Go to your dASH1_ANOVA worksheet.
  • Select row 1 (the row with your column headers) and select the menu item Data > Filter > Autofilter (The funnel icon on the Data tab). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
  • Using the filter option on the various p value columns, the number of rows corresponding to the following p values should be:
    • Unadjusted p value
      • Less than 0.05
        • 1630
        • Percentage= 26.34%
      • Less than 0.01
        • 880
        • Percentage= 14.22%
      • Less than 0.001
        • 356
        • Percentage= 5.75%
      • Less than 0.0001
        • 142
        • Percentage= 2.29%
    • Bonferroni-corrected p value
      • Less than 0.05
        • 53
        • Percentage= 0.856%
    • Benjamin and Hochberg-corrected p value
      • Less than 0.05
        • 730
        • Percentage= 11.80%


Gene Analysis

NSR1 gene

  • p values
    • unadjusted: 2.92024E-06
    • Bonferroni-corrected: 0.018073394
    • Benjamin and Hochberg-corrected: 0.000430319
  • Average Log fold changes
    • At 15 min: 1.886696138
    • At 30 min: 2.611946831
    • At 60 min: 1.394879511
    • At 90 min: -2.673801047

The NSR1 gene most certainly changes expression due to the cold shock in this experiment. This is apparent from the large values for all of the Log fold changes at every time interval (as a 0 Log value would mean no change in expression and thus a large non-zero Log value means a large change in expression). And the very small p-values suggest that these change in expressions are likely not just due to chance.


ADA2 gene

  • p values
    • unadjusted: 0.78502135
    • Bonferroni-corrected: 1
    • Benjamin and Hochberg-corrected: 0.945600843
  • Average Log fold changes
    • At 15 min: -0.021259798
    • At 30 min: -0.172685741
    • At 60 min: -0.622498949
    • At 90 min: -0.347491051

The ADA2 gene did not change expression due to the cold shock in this experiment. This is evident from the smaller average log change values and the larger p values.


Summary Paragraph

In this experiment, the effect of cold shock on the expression of all of the genes in the dASH1 strain of yeast was measured. Data was collected at four separate time intervals (15 min, 30 min, 60 min, and 90 min). The first sixty minutes where under cold shock, while the last thirty were recovering from the cold shock. At each time interval, 4 measurements were taken (except for 3 at 90 min) for a total of 15 measurements for each gene in the dASH1 yeast strain. In this week's assignment, the data from this experiment were analyzed. The average Log fold change at each time interval was analyzed for each gene, and different p values were calculated to determine the number of genes whose expression was affected by the cold shock. At an unadjusted p-value of less than 0.05, 26.34% of the genes displayed changed expression in cold shock. At a more confident unadjusted p value of less than 0.01, 14.22% of the genes displayed changed expression in cold shock. At an even more confident unadjusted p value of 0.001, 5.75% of the genes displayed changed expression in cold shock. And at the highest unadjusted p value confidence of 0.0001, 2.29% of the genes displayed changed expression in cold shock.
However, due to the high number of genes, a way to correct the multiple testing problem was needed. Thus the very strict Bonferroni-corrected p value and the slightly more lenient Benjamin and Hochberg-corrected p value were used to account for this. At a Bonferroni-corrected p value of less than 0.05, 0.856% of the genes displayed changed expression in cold shock. And at a Benjamin and Hochberg-corrected p value of less than 0.05, 11.80% of the genes displayed changed expression in cold shock.
Finally, the data on the gene NSR1 suggested that this gene was strongly affected by the cold shock, while my favorite gene, ADA2, was not affected by the cold shock (these findings are outlined in the section above).

Deliverables

Media:DASH1 analysis QL.zip
Media:DASH1 Table QL.pptx


Acknowledgements

  • I worked in class with my homework partner Eddie Bachoura to assure that we were performing the correct data manipulations in excel. I also met with Eddie once outside of class to ensure that we had both performed the data anslysis correctly and that we had finished all of the deliverables.
  • I copied the template of the procedure from the Week 8 assignment page to use as the template for my electronic journal. I modified this template to be specific for my particular strand dASH1.
  • Dr. Dahlquist for providing guidance on how to perform data analysis in Excel
  • Microsoft Excel for providing the software with which to perform this data analysis.

Qlanners (talk) 21:40, 19 October 2017 (PDT)

References

LMU BioDB 2017. (2017). Week 8. Retrieved October 19, 2017, from https://xmlpipedb.cs.lmu.edu/biodb/fall2017/index.php/Week_8

Links

Main Page
User Page
Assignment Pages: Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 | Week 12 | Week 14 | Week 15
Journal Entry Pages: Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 | Week 12 | Week 14 | Week 15
Shared Journal Pages: Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10
Group Project Page: JASPAR the Friendly Ghost