Kevin Wyllie Week 12

From LMU BioDB 2015
Jump to: navigation, search

Electronic Lab Notebook

Sample-Data Relationship:

KWVPMethoddiagram.jpg

Compiling Data

  1. From each raw individual data file, columns M ("GeneName") and R ("LogRatio") were copied and pasted into a new Excel file.
    • All of the GeneName columns were pasted adjacently (columns A-H).
    • Similarly, all LogRatio columns were pasted adjacently (columns I-P).
    • To rule out the possibility of confusing any two files with each other, a header was added at the top of each column, with each corresponding file name for example (125_2, for example).
  2. The GeneName columns were scanned for any discrepancies n terms of amount of rows or ordering of gene ID's. The LogRatio columns were scanned for discrepancies in amount of rows.
  3. This sheet was named "allgenename_logratio".
  4. A new sheet was created, named "genename_logratio"
    • Only one GeneName column was used for this sheet (column A), as all of the columns had been confirmed to be identical between files.
    • The LogRatio data for each file was pasted into columns B through I, maintaining the previously mentioned file name headers. The data was pasted in the order of the file name numbers, from smallest to largest.
    • Note: Rows not containing genes existing in the B. cenocepacia genome must be removed. In theory, this is very easy to do. Search results for examples of the several gene ID formats in burkholderia.com (set to Burkholderia cenocepacia J2314) suggest that the formats used are those that start with "BCAS," "BCAM," "BCAL" and "pBCA." However, we were not sure, logistically, how to apply a filter in Excel which would select for these gene ID's, primarily because you cannot apply two "Begins with" filters simultaneously (required to include the "pBCA" genes). However, filtering for both groups separately yielded the same combined amount of rows as filtering for "Contains: BC" (29004 rows - there appear to be quadruplets for every gene). So we will use this filter, for now, although this may not be considered optimal.
  5. A new sheet was created, named "compiled_raw_data".
    • All content from the "genename_logratio" sheet was pasted into the "compiled_raw_data" sheet.
    • Finally, a new row was inserted under the header row. These row was titled "ExpName". The purpose of this row is to indicate what kind of cells were used for the corresponding experiment. Cells B2-F2 contain "Biofilm" as these columns correspond to experiments using biofilm cells that were not treated with tobramycin (125_1, 125_2, 125_3, 125_4, 126_1). Cells G2-I2 contained "Tobramycin" as these columns correspond to experiments using biofilm cells that were treated with tobramycin (126_2, 126_3, 126_4).

Normalization

  1. A new sheet was created, named "scaled_centered". All data from "compiled_raw_data" was pasted into these new sheet.
  2. Above the gene names, two new rows were inserted. Cell A4 read "Average" and A5 read "StdDev". In cell B4, the following command was entered: =AVERAGE(B6:B29009). In cell B5, the command was =STDEV(B6:B29009). These codes were then pasted into columns C-I using the drag feature (which adjusts for the column change), so that the averages and standard deviations of log ratios for each sample were calculated.
  3. All headers were then repeated, in the same order, to the right of the existing headers. Each ExpName was edited, adding "_scaled_centered" to the end of each heading.
  4. The scaled/centered log ratios were then computed.
    • The following command was entered into cell J6: =(B6-B$4)/B$5. This takes the first log ratio for sample Biofilm_1, subtracts the average log ratio, and divides by the standard deviation of the log ratios. This function was then pasted for the remaining data, using the drag feature. The placement of the $'s ensures that the cell references for the averages and standard deviations will change horizontally (to account for different averages/stdev's between samples), but will not change vertically.

Statistical Analysis

  1. A new sheet was created, named "statistics."
  2. The gene ID columnn (column A) along with all previously created "_scaled_centered" data were pasted into the "statistics" sheet.
  3. The "Average" and "StdDev" rows were deleted.
  4. Also, the GeneName row (not column) and FileName row (not column) were deleted. These rows are no longer necessary. "GeneName" was entered into cell A1.
  5. In cell J1, "Avg_LogFC_Biofilm" was entered, and in cell K1, "Avg_LogFC_Tobramycin" was entered.
  6. In cell J2, the following command was entered: =AVERAGE(B2:F2). This takes the average of the centered log ratios for the gene corresponding to that row, across all five biofilm samples. This formula was pasted down the entire column.
  7. Similarly, in cell K2, =AVERAGE(G2:I2). This finds the same average, but across all three tobramycin treated biofilm samples. This formula was pasted down the entire column.
  8. The header "Avg_LogFC_All" was added to cell L1. In cell L2, the average of the two previously computed averages was found using =AVERAGE(J2:K2). Again, this was pasted down the column.
  9. The next column (M) was named "Tstat".
  10. Note: We are not sure how to move on, as computing the T statistic calls for the number of replicates, however, the number of replicates between treated and untreated samples are different (3 and 5, respectively). We will stop here and consult professors regarding how to continue with statistical analysis.

Links

Weekly Group Assignments Shared Group Journals Project Links Team Members

User:kwyllie