Kevin Wyllie Week 14

From LMU BioDB 2015
Jump to: navigation, search

Electronic Lab Notebook

Statistical Analysis and Formatting

We looked at Dr. Dahlquist's comments on our prior worksheet to make a second attempt at processing and formatting for GenMAPP. Our biggest error in the the previous attempt was forgetting to account for the fact that each gene was spotted in (technical) quadruplicates.

  1. Dr. Dahlquist prepared a sheet for us to start with, which separated each quadruplicate of the genome across columns. So the ID column for the first technical replicate was in column A, followed by fold change data for one of the four technical replicates within each of the biological replicates, for the (untreated) biofilm and tobramycin-treated biofilm samples. Next was the second ID column (identical to the first) in column J, followed by the second set of technical replicates within each biological replicate. These sheet was named "quadruplicated_spots_separated".
  2. After all four sets of technical replicates, yet another ID column (column AK) was added, following by the averaged fold change values across each technical replicate (for each biological replicate - columns AL through AS).
    • For these averages the =AVERAGE(*four technical replicates*) command was used and pasted through the entirety of each column. The *four technical replicates* is a placeholder for the cell coordinates corresponding to each technical replicate within a biological replicate.
    • Because all five untreated and all three treated samples will be averaged later in the protocol, the word "consolidated" (instead of "average") was used in the headings for these columns. For example, the column heading for the first biological untreated sample was "consolidated_Biofilm_1_scaled_centered_4".
  3. Next, a new sheet was created, named "statistics". Pasted into this sheet (using the "Paste Values" function) were the "consolidated" columns (preceded by the ID column). Thus, the ID's were in column A while the consolidated fold changes for each biological replicate were in columns B through J. The average across the five untreated samples were calculated in column K (using the previously mentioned command), with the header "AVG_Biofilm_scaled_centered". In column L, the same was done using the three treated samples.
  4. In column M, the ratios between the averages for treated and untreated samples were calculated. However, because these fold changes are in log space the fold changes for the treated samples were subtracted from that of the untreated samples. The header for this column was "Biofilm_Tobramycin_ratio".
    • An example command for the gene in 2, is =L2-K2.
  5. Next, the p-value's for each fold change ratio was calculated in column M. A type-3, 2-tailed T-test was used.
    • The command for the first gene (row 2) is =TTEST(C2:G2,H2:J2,2,3). As before, this was pasted through all genes.
  6. The next sheet was named "bonferroni_pval". Pasted into this sheet were the gene ID's (column A), the biofilm-tobramycin fold change ratios (column B) and the previously calculated P-values (column C).
  7. The header for column D was "bonferroni_p_value," and in this column were the Bonferroni adjusted P-values. To calculate this value, each previous P-value was multiplied by 7,251 (the number of genes on the sheet).
    • For example, the command for the first gene (cell D2) was =C2*7251.
  8. Because this yields many P-values which are over 1 (which statistically speaking, should be impossible), a second "bonferroni_p_value" column was added next to the previous one (column E). In this column, P-values over 1 were replaced with a 1.
    • The command for cell E2 was =IF(D2>1,1,D2).
  9. The next sheet was named "BH_pval".
  10. First, the same information pasted into the "bonferroni_p_value" sheet was pasted into this new sheet.
  11. A column was then inserted between columns A and B. The header for this new column (now column B), was "Master Index." With the gene ID's in numerical order, an ascending count was added to the Master Index column. This was done by adding values 1 and 2 to cells B2 and B3 respectively, highlighting these two cells, and right-clicking on the black square at the bottom-right of the highlighted region. This will automatically fill in the ascending count.
  12. A filter was then added to the P-value column (these are the unadjusted P-values, not the previously calculated Bonferroni-adjusted ones), and was used to sort this column in ascending order.
  13. After the genes (rows) were ordered by ascending P-value, they were ranked in column D (with the header "Rank"). This was done by adding values 1 and 2 to cells D2 and D3 respectively, highlighting these two cells, and right-clicking on the black square at the bottom-right of the highlighted region. This will automatically fill in the rank (an ascending count).
  14. In column E (with the header "BH_p_value") the Benjamini-Hochberg adjusted P-values were calculated. This was done by multiplying each unadjusted P-value by its own rank.\
    • The command for row 2 was =C2*D2.
  15. As done for the Bonferroni adjustment, BH P-values which were over 1 were replaced with a 1 in column F (also with the header "BH_p_value").
    • The command for row 2 was =IF(E2>1,1,E2).
  16. The final sheet was named "forGenMAPP". This included columns A-M of the "statistics" sheet (gene ID's, consolidated fold change values, averages for treated and untreated samples, fold change ratios between treated and untreated samples, and corresponding P-values), pasted into those same columns. These were followed by the calculated Bonferroni and BH P-values in columns O and P respectively.
  17. All fold change values in the "forGenMAPP" sheet were formatted to include two decimals, while P-values were formatted to include four.

The final spreadsheet is available here.

Sanity Check

KWVP Sanitycheck.png
  • The criteria we chose for significance was BH P-value > 0.05. This results in 609 genes being considered significant, or 8.4% of the total genes. Obviously, most of Van Acker et al's reportedly-significant genes (by unadjusted P-value > 0.05) were considered unsignificant by our criteria. However, we are confident in using the BH-adjusted P-value, as Van Acker et al's unadjusted criteria considers 2688 genes (~31%) to be significant, and this is number is quite high.
  • The difference in actual log fold change values may be alarming as well. However, the main point in our data processing protocol at which this log fold changes decreased so dramatically was in finding the ratios between fold changes for treated and untreated samples (which were actually subtracted, as the values are in log space), and we are confident that this step is necessary, as both treated and untreated samples were hybridized to the microarray slide against genomic DNA.



There are 7251 genes in the sheet.

  • How many genes have p value < 0.05? And what is the percentage?
    • 4318, ~60%
  • What about p < 0.01?
    • 2971, ~41%
  • What about p < 0.001?
    • 1460, ~20%
  • What about p < 0.0001?
    • 645, ~9%
  • How many genes are p < 0.05 for the Bonferroni-corrected p value?
    • 179, ~2.4%
  • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value?
    • 609, ~8.4%
  • "Pvalue" < 0.05, and "Biofilm_Tobramycin_ratio" > 0.
    • 2169, ~30%
  • "Pvalue" < 0.05, and "Biofilm_Tobramycin_ratio" < 0.
    • 2149, ~30%

This is a more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology:

  • "Biofilm_Tobramycin_ratio" > 0.25 and "Pvalue" < 0.05.
    • 1488, ~21%
  • "Biofilm_tobramycin_ratio" < -0.25 and "Pvalue" < 0.05.
    • 2830, ~40%

GenMAPP Import

  1. With the "forGenMAPP" sheet open, the excel file was then saved as a Text Tab Delimited file.
  2. GenMAPP 2.1 was opened.
  3. The newly-created database was loaded into the program with "Data" > "Choose Gene Database". This database is available here.
  4. Next "Data" > "Expression Dataset Manager" was selected.
  5. In the Expression Dataset Manager window, "Expression Datasets" > "New Dataset" was selected, and the txt file from the "forGenMAPP" sheet was selected.
  6. Two criteria were created in the Expression Dataset Manager window for increases and decreases in expression.
    • Note: these criteria may have not been done correctly. Further analysis in GenMAPP will be conducted next week, and these criteria will be more closely examined. The main goal for this week is to generate the exceptions file (see below).
  7. GenMAPP processed the data and generated the following exceptions file, citing 284 errors.
    • The exceptions file can be found here.





Links

Weekly Group Assignments Shared Group Journals Project Links Team Members

User:kwyllie