Vpachec3 Week 14

From LMU BioDB 2015
Jump to: navigation, search

Thursday,December 3

  • Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.

Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis. Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells. We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms. We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

  • We then started a new worksheet and called it "statistics"

On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns. The additional columns added to the statistics worksheet are the following:


Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)


Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)


Biofilm_Tobramycin_ratio column

Formula: =L2-K2


NOTE: Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.


Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)


NOTE: The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.



  • We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

NOTE: This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

NOTE: This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.


  • Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

NOTE: This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.


BH_Pvalue

Formula:=IF(E2>1,1,E2)



  • The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.


Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.


NOTE: The bonferroni and BH columns copied and pasted were the plain formula columns, not the IF statement columns.


There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

Sanity Check: Number of genes significantly changed

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

  • Open your spreadsheet and go to the "forGenMAPP" tab.
  • Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
  • Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
    • How many genes have p value < 0.05? and what is the percentage (out of 7251)?
      • 4318 genes which is 60%
    • What about p < 0.01? and what is the percentage (out of 7251)?
      • 2971 genes which is 41%
    • What about p < 0.001? and what is the percentage (out of 7251)?
      • 1460 genes which is 20%
    • What about p < 0.0001? and what is the percentage (out of 7251)?
      • 645 genes which is 9%


  • When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
  • We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: http://xkcd.com/882/.) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
    • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?
      • 179 genes which is 2.4%
    • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?
      • 605 genes which is 8.3%


  • In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
  • The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
    • Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.

How many are there? (and %)

  • 3279 genes which is 45%


Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero. How many are there? (and %)

  • 3127 genes which is 43%

What about an average log fold change of > 0.25 and p < 0.05? (and %)

  • 1613 genes which is 22%


Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)

  • 1519 genes which is 21%

Saturday, December 5

Kevin and I met up to run our data through genMAPP. We took the .gdb file uploaded by Brandon, our QA, and followed the protocol [1] used in the Vibrio run we did in week 8.


  • NOTE: I ran the data looking at the tree through the DECREASED perspective while Kevin was through the INCREASED perspective.


Screenshoterrorwindow12052015.png

Caption: This screen shot is the window of the errors produced once the data was run through GennMAPP.

Expandedtree20151205.png

Caption: This is the initial tree that comes up after putting in the criterion in GennMapp.

Rankedgeneontologyresults20151205.png

Caption: After collapsing the expanded tree, we pushed the "Show Ranked List" button to get the Gene Ontology Results as seen in this screenshot.

Sunday, December 6

Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.


We took the data from Van Acker et. al paper which is represented in Table 5[[2]].

We transferred the table onto Excel.

We inserted a column between the column "Van Acker et al" and "qPCR" column and we called it "US". NOTE: This spreadsheet is still being worked on thus the headers are not set to these names.

In the "Us" column we inserted the " Biofilm_Tobramycin_ratio" values from the forGenMAPP worksheet for the corresponding gene.

The values we inserted were done at random and not yet complete. We stopped because Kevin and I noticed that there are large discrepancies between the values we have transformed and the ones reported. Although, the direction seems to be almost complimentary, the values themselves are very different.

At this point, we emailed Dr. Dahlquist and Dr.Dionisio for further instruction for our concern.


UPDATE: After we received a response to our conundrum, the sanity check is now complete thanks to the diligent work of Kevin Wyllie


Sanity Check: Compare individual genes with known data

OLD VERSION: File:KWVP sanitycheck 20151206.xlsx

UPDATED VERSION:
KWVP Sanitycheck.png


In the spreadsheet, the values considered significant are marked in red.

Links

Vpachec3 User Page

GÉNialOMICS Links

Weekly Group Assignments Shared Group Journals Project Links Team Members