Vpachec3 Week 12

From LMU BioDB 2015
Jump to: navigation, search

Thursday, November 19

  • Met with Dr. Dahlquist before class to talk about the paper
    • Took note of the microarray data versus the qPCR results. Some results were off and will keep that in mind during our analysis
  • GenMapp users with Dr.Dahlquist during class to go over progress
    • Rough summary of experiment design:
    • 5 untreated biofilms vs. genomic DNA
    • 3 tobramycin-treated biofilm vs. genomic DNA
  • Noticed that our sdrf file was imputed wrong
  • Need to ask authors what they did with the dye swaps
  • Started computing master raw data
    • For each text file(8), we are copying rows M(gene name) and R (log ratio) into one data sheet
      • Decided to make names side by side for our own organization style
      • Made three sheets in the file:allgenename_logratio, genename_logratio, clean_genename_logratio

Monday, November 23

  • Kevin and I met in the computer lab at 6:00pm.
  • We worked on our presentation for tomorrow and we also worked on the master data sheet.
    • We had some decent headway on Thursday so we picked it up from there.
  • Credit is due to [Kevin] for helping going through the process and writing the procedure.

Compiling Data

  1. Columns M ("GeneName") and R ("LogRatio") were copied and pasted into a new Excel file as stated in the Thursday section.
    • GeneName columns were pasted next to one another (columns A-H).
    • All LogRatio columns were pasted after the GeneName columns (columns I-P).
    • A header was added at the top of each column, with each corresponding file name for example (125_2, for example) four our own organization.
  2. The GeneName columns were scanned for any discrepancies n terms of amount of rows or ordering of gene ID's. The LogRatio columns were scanned for discrepancies in amount of rows.
  3. This sheet was named "allgenename_logratio".
  4. New sheet called "genename_logratio"
    • Only one GeneName column was used for this sheet (column A), as all of the columns had been confirmed to be identical between files.
    • LogRatio data for each file was pasted into columns B-I, maintaining the previously mentioned file name headers. The data was pasted in the order of the file name numbers, from smallest to largest.
    • Note: Rows not containing genes existing in the B. cenocepacia genome must be removed. In theory, this is very easy to do. Search results for examples of the several gene ID formats in burkholderia.com (set to Burkholderia cenocepacia J2314) suggest that the formats used are those that start with "BCAS," "BCAM," "BCAL" and "pBCA." However, we were not sure, logistically, how to apply a filter in Excel which would select for these gene ID's, primarily because you cannot apply two "Begins with" filters simultaneously (required to include the "pBCA" genes). However, filtering for both groups separately yielded the same combined amount of rows as filtering for "Contains: BC" (29004 rows - there appear to be quadruplets for every gene). So we will use this filter, for now, although this may not be considered optimal.
  5. New sheet called "compiled_raw_data".
    • All content from the "genename_logratio" sheet was pasted into the "compiled_raw_data" sheet.
    • Finally, a new row was inserted under the header row. These row was titled "ExpName". The purpose of this row is to indicate what kind of cells were used for the corresponding experiment. Cells B2-F2 contain "Biofilm" as these columns correspond to experiments using biofilm cells that were not treated with tobramycin (125_1, 125_2, 125_3, 125_4, 126_1). Cells G2-I2 contained "Tobramycin" as these columns correspond to experiments using biofilm cells that were treated with tobramycin (126_2, 126_3, 126_4).

Normalization

  1. New Sheet called "scaled_centered". All data from "compiled_raw_data" was pasted into these new sheet.
  2. Above the gene names, two new rows were inserted. Cell A4 read "Average" and A5 read "StdDev". In cell B4, the following command was entered: =AVERAGE(B6:B29009). In cell B5, the command was =STDEV(B6:B29009). These codes were then pasted into columns C-I using the drag feature (which adjusts for the column change), so that the averages and standard deviations of log ratios for each sample were calculated.
  3. All headers were then repeated, in the same order, to the right of the existing headers. Each ExpName was edited, adding "_scaled_centered" to the end of each heading.
  4. The scaled/centered log ratios were then computed.
    • The following command was entered into cell J6: =(B6-B$4)/B$5. This takes the first log ratio for sample Biofilm_1, subtracts the average log ratio, and divides by the standard deviation of the log ratios. This function was then pasted for the remaining data, using the drag feature. The placement of the $'s ensures that the cell references for the averages and standard deviations will change horizontally (to account for different averages/stdev's between samples), but will not change vertically.

Statistical Analysis

  1. A new sheet was created, named "statistics."
  2. The gene ID columnn (column A) along with all previously created "_scaled_centered" data were pasted into the "statistics" sheet.
  3. The "Average" and "StdDev" rows were deleted.
  4. Also, the GeneName row (not column) and FileName row (not column) were deleted. These rows are no longer necessary. "GeneName" was entered into cell A1.
  5. In cell J1, "Avg_LogFC_Biofilm" was entered, and in cell K1, "Avg_LogFC_Tobramycin" was entered.
  6. In cell J2, the following command was entered: =AVERAGE(B2:F2). This takes the average of the centered log ratios for the gene corresponding to that row, across all five biofilm samples. This formula was pasted down the entire column.
  7. Similarly, in cell K2, =AVERAGE(G2:I2). This finds the same average, but across all three tobramycin treated biofilm samples. This formula was pasted down the entire column.
  8. The header "Avg_LogFC_All" was added to cell L1. In cell L2, the average of the two previously computed averages was found using =AVERAGE(J2:K2). Again, this was pasted down the column.
  9. The next column (M) was named "Tstat".
  10. Note: We are not sure how to move on, as computing the T statistic calls for the number of replicates, however, the number of replicates between treated and untreated samples are different (3 and 5, respectively). We will stop here and consult professors regarding how to continue with statistical analysis.

Links

Vpachec3 User Page

GÉNialOMICS Links

Weekly Group Assignments Shared Group Journals Project Links Team Members