Lenaolufson Week 14

From LMU BioDB 2015
Jump to: navigation, search

Lena Olufson

12/1/15

  • I followed the protocol given by Dr. Dahlquist left on my talk page to perform the correct steps for editing the excel data sheet. The protocol is as follows:
    1. I renamed Sheet1 to "CompiledRawData".
    2. I renamed my column heading as follows:
      • I called my leftmost column "ID" instead of Code.
      • For the data columns, I got rid of the "(635/532)" from each header. I named them like this as an example: "LogRatio_SampleA_Cy3-Cy5".
    3. Once I renamed the columns, I did all further manipulations in a different sheet. I copied and pasted all of the data into Sheet2 which I renamed to "DyeSwap".
    4. I created a "MasterIndex" column as follows. I inserted a new column to the right of the "ID" column and named it "MasterIndex". In this column I created a numerical index of genes so that I can always sort them back into the same order that they started out in.
      • I typed a "1" in cell B2 and a "2" in cell B3.
      • I selected both cells. I hovered my mouse over the bottom-right corner of the selection until it made a thin black + sign. I double-clicked on the + sign to fill the entire column with a series of numbers from 1 to 8448 (the number of spots on the microarray).
    5. Then, I selected all of the data and sorted it A-->Z on the "ID" column.
    6. I deleted all of the rows that had an ID of "_". The number of records after deleting the "_" columns: 7104.
    7. Then I swapped the dye orientation so that all of the samples were Cy5/Cy3.
      • I inserted a column to the right of the columns that needed to be swapped. I named the new column the same as I did before, but added "_swapped" to the header to designate that I swapped the samples.
      • Then, I typed a formula in the column: =C2*(-1). I copied and pasted the formula to the entire column.
    8. I created a new worksheet that I named "MasterSheet". I copied and Pasted special > Paste values the ID, MasterIndex, and data columns that were all in the orientation of Cy5/Cy3 (the original ones and the ones I just swapped).
    9. This was then the starting point for the normalization and statistics. I copied and pasted the data from this sheet into a new worksheet, which I renamed "ScalingCentering".
    10. In this new sheet, I performed the scaling and centering according to the Vibrio cholerae instructions found here.]
      • When I computed the average and standard deviation calculations for the log ratios, all of the values that came out were much too high to make sense with the data. Upon looking at the data and consulting with Dr. Dahlquist, we found that some of the values from the raw data were extremely large such as 100000.
    11. At this point, I posted my spreadsheet and e-mailed Dr. Dahlquist the link to it.

12/3/15

  • I followed the protocol given by Dr. Dahlquist after she reviewed the most updated version of the excel spreadsheet:
  • First I fixed my MasterSheet by getting rid of the columns I didn't need.
  • What did next was not to delete the genes, but instead I did a search and replace. I searched for "100000" and "-1000000" and replaced it with an empty cell. In the find/replace window, I typed 100000 or -100000 in the find field and nothing at all in the replace field and then I clicked on the replace all button.
    • replaced: 14 "100000" entries with nothing
  • Then I copied and paste special > paste values into my scaling and centering sheet and tried again. Things are OK because my Average for each column is a very small number near zero and my standard deviation is near 1.
  • Following the instructions from the Vibrio cholerae instructions found here.]
  • To scale and center the data (between chip normalization) I performed the following operations:
  • I inserted a new Worksheet into my Excel file, and named it "scaled_centered".
  • I went back to the "compiled_raw_data" worksheet, Select All and Copy. I went to my new "scaled_centered" worksheet, clicked on the upper, left-hand cell (cell A1) and Pasted.
  • I inserted two rows in between the top row of headers and the first data row.
  • In cell A2, I typed "Average" and in cell A3, I typed "StdDev".
  • I then computed the Average log ratio for each chip (each column of data). In cell B2, I typed the following equation:
    • =AVERAGE(B4:B7107)
  • and pressed "Enter". Excel computed the average value of the cells specified in the range given inside the parentheses.
  • I then computed the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, I typed the following equation:
    • =STDEV(B4:B7107)
  • and pressed "Enter".
  • I copied these two equations (cells B2 and B3) and pasted them into the empty cells in the rest of the columns. Excel automatically changed the equation to match the cell designations for those columns.
  • I copied the column headings for all of my data columns and then pasted them to the right of the last data column so that I had a second set of headers above blank colums of cells. I edited the names of the columns so that they now read: LogRatio_SampleA_Cy5-Cy3_scaled_centered, etc.
  • In cell I4, I typed the following equation:
    • =(C4-C$2)/C$3
  • In this case, I wanted the data in cell C4 to have the average subtracted from it (cell C2) and be divided by the standard deviation (cell C3). I used the dollar sign symbols in front of the "2" and "3" to tell Excel to always reference that row in the equation.
  • I copied and pasted this equation into the entire column.
  • I copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header, altering the formula for each column so that the correct cells were being used in the calculations.
  • Here is the most updated excel data sheet at this point: File:Bpertussis CompiledRawData MS2015.xlsx

12/7/15

  • Today I used the protocol from the Vibrio cholerae instructions found here.] but I altered it to fit my specific data.
  • I inserted a new worksheet and named it "statistics".
  • I went back to the "scaling_centering" worksheet and copied the first column ("ID").
  • I pasted the data into the first column of my new "statistics" worksheet.
  • I went back to the "scaling_centering" worksheet and copied the columns that were designated "_scaled_centered".
  • I went to my new worksheet and clicked on the B1 cell. I selected "Paste Special" from the Edit menu. A window opened: I clicked on the radio button for "Values" and clicked OK. This pasted the numerical result into my new worksheet instead of the equation which must have made calculations on the fly.
  • I deleted Rows 2 and 3 where it said "Average" and "StDev" so that my data rows with gene IDs were immediately below the header row 1.
  • I went to a new column on the right of my worksheet. I typed the header "Avg_SampleA", "Avg_SampleB", and "Avg_SampleC", etc. into the top cell of the next three columns.
  • I computed the average log fold change for the replicates for each patient by typing the equation:
=AVERAGE(Q2,R2)
into cell X2. I copied this equation and pasted it into the rest of the column.
  • I created the equation for samples B and C and pasted it into their respective columns.
  • I then computed the average of the averages. I typed the header "Avg_ABC_Samples" into the first cell in the next empty column. I created the equation that computed the average of the three previous averages I calculated and pasted it into this entire column.
  • I inserted a new column next to the "Avg_ABC_Samples" column that I computed in the previous step. I labeled the column "Tstat". This computed a T statistic that told me whether the scaled and centered average log ratio was significantly different than 0 (no change). I entered the equation:
=AVERAGE(X2:Z2)/(STDEV(X2:Z2)/SQRT(3))
  • I labeled the top cell in the next column "Pvalue". In the cell below the label, I entered the equation:

=TDIST(ABS(AB2),2,2) The number of degrees of freedom was the number of replicates minus one, so in my case there were 2 degrees of freedom. I copied the equation and pasted it into all rows in that column.

Calculate the Bonferroni p value Correction

  • Then I performed adjustments to the p value to correct for the multiple testing problem. I labeled the next two columns to the right with the same label, Bonferroni_Pvalue.
  • I typed the equation =AC2*3552, Upon completion of this single computation, I used the trick to copy the formula throughout the column.
  • I replaced any corrected p value that was greater than 1 by the number 1 by typing the following formula into the first cell below the second Bonferroni_Pvalue header: =IF(AD2>1,1,AD2). I used the trick to copy the formula throughout the column.

Calculate the Benjamini & Hochberg p value Correction

  • I inserted a new worksheet named "B-H_Pvalue".
  • I copied and pasted the "ID" column from my previous worksheet into the first column of the new worksheet.
  • I inserted a new column on the very left and named it "MasterIndex". I created a numerical index of genes so that I could always sort them back into the same order.
    • I typed a "1" in cell A2 and a "2" in cell A3.
    • I selected both cells. I hovered my mouse over the bottom-right corner of the selection until it made a thin black + sign. I double-clicked on the + sign to fill the entire column with a series of numbers from 1 to 3552 (the number of genes on the microarray).
  • For the following, I used Paste special > Paste values. I copied my unadjusted p values from my previous worksheet and pasted it into Column C.
  • I selected all of columns A, B, and C. I sorted by ascending values on Column C. I clicked the sort button from A to Z on the toolbar, in the window that appeared, sorted by column C, smallest to largest.
  • I typed the header "Rank" in cell D1. I created a series of numbers in ascending order from 1 to 3552 in this column. This was the p value rank, smallest to largest. I typed "1" into cell D2 and "2" into cell D3. I selected both cells D2 and D3. I double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 3552.
  • Then I calculated the Benjamini and Hochberg p value correction. I typed B-H_Pvalue in cell E1. I typed the following formula in cell E2: =(C2*3552/D2) and pressed enter. I copied that equation to the entire column.
  • I typed "B-H_Pvalue" into cell F1.
  • I typed the following formula into cell F2: =IF(E2>1,1,E2) and pressed enter. I copied that equation to the entire column.
  • I selected columns A through F. I then sorted them by your MasterIndex in Column A in ascending order.
  • I copied column F and used Paste special > Paste values to paste it into the next column on the right of my "statistics" sheet.

Here is the most updated excel data sheet at this point in time: File:Bpertussis CompiledRawData MS2015-3.xlsx

Loyola Marymount University: website


Weekly Assignments Individual Journal Pages Shared Journal Pages