Kzebrows Week 14

Electronic Lab Notebook

Corrections from talk page

Downloaded the most recent version of the file from the OTS Files page
Renamed all columns by replacing LR with LogFC
Re-named Sheet 1 "CompiledRawData" and copied all of the data from it and pasted it into Sheet 2, which I re-named "MasterSheet":
Deleted all ID columns except for column A, which I re-named "ID"
Inserted a column to the left of column B and re-named it "MasterIndex"
Typed "1" and cell B2 and 2 and Cell B3 and selected both cells. I double-clicked on the + sign to fill the entire column with numbers 1-4208.
Selected the data and sorted A-->Z on the "ID" column
Deleted rows that have "Empty" or "Blank####" ID. This left me with 3,926 files (3,927 minus 1 header row).
Sorted by MasterIndex column to put IDs back in original order, smallest to largest.
Replaced "Error" with nothing and got 585 replacements.
Copy and pasted data into Sheet 3, which I re-named "ScalingCentering".

The next portion of the assignment was done by following the instructions in Sample Microarray Analysis Vibrio cholerae.

Normalizing the log ratios for the set of slides in the experiment

I inserted a new worksheet and named it Scaled_Centered
I copied all data from the MasterSheet and pasted it into cell A1 of the Scaled_Centered sheet
I inserted two rows in between the top row of headers and first data row. I named cell A2 "Average" and named cell A3 "StdDev".
In cell C2 I typed =AVERAGE(C4:C3929) and in cell C3 I typed =STDEV(C4:C3929). I pressed enter and copied this equation across the rest of the columns through column AL.
I then copied the column headings for all data columns and pasted them to the right of the last column. Using the copy/paste tool I renamed each column with "_Scaled_Centered" at the end.
In cell AM4 I typed =(C4-C$2)/C$3 indicating that I wanted data in cell C4 to have the average subtracted from it and then to divide it by the standard deviation. I used the "$" sign to indicate that I did not want the average and standard deviation values to change even when the equation was pasted for the entire column of genes.
I then copy and pasted that equation across the entire column by clicking on the original cell and double-clicking on the black plus sign. I copy and pasted this equation for each column of the data.

Perform statistical analysis on the ratios

I inserted a new worksheet and named it "Statistics"
I copied and pasted the ID column from the ScalingCentering worksheet into the first column of the new worksheet
I copied all Scaled_Centered columns from the ScalingCentering worksheet and pasted the values into column B1 of the new sheet
Deleted "Average" and "StDev" columns
Decided to divide and conquer with the rest of the statistical testing. I did RX and Erich did RP. I deleted Erich's RP columns to make my calculations easier.
Inserted column to the right and typed headers Avg_LogFC RX-0.5-10, Avg_LogFC RX-0.5-30, Avg_LogFC RX-0.5-60, Avg_LogFC RX-1-10, Avg_LogFC RX-1-30, and Avg_LogFC RX-1-60 into the top cells of the next 6 columns.
Computed the average log fold change per treatment and time period by typing the following equations: =AVERAGE(B2:D2), =AVERAGE(E2:G2), =AVERAGE(H2:J2), =AVERAGE(K2:M2), =AVERAGE(N2:P2), and =AVERAGE(Q2:S2) below each corresponding Avg_LogFC column. I copied and pasted these equations for the whole column.
I then inserted a new column to the right and named it "Avg_LogFC_all" and typed the equation =AVERAGE(T2:Y2).
I inserted a new column next to Avg_LogFC_all and labeled it "Tstat". Into column AA2 I typed =AVERAGE(T2:Y2)/(STDEV(T2:Y2)/SQRT(6)), indicating 6 replicates. I copied this equation into the whole column.
I inserted a new column and called it "Pvalue". I entered the equation =TDIST(ABS(AA2),5,2) for AA2, the T-stat column and 5 degrees of freedom.

Calculate Bonferroni p value correction

I inserted a new column to the right with the label Bonferroni_Pvalue (did this twice)
In the first Bonferroni_Pvalue column I typed =AB2*3926 and copied this equation for the entire column.
In order to replace any corrected P value that was greater than 1 with the number 1, I typed the equation =IF(AC2>1,1,AC2) and pasted it through the entire second Bonferroni_Pvalue column.
I then saved the most updated version of the file to my ThawSpace to resume working on it on Thursday.

Analysis corrections

After uploading my 12/1 file to the OTS Files page and consulting with Dr. Dahlquist, I realized that I needed to make several corrections. Originally, I calculated the T-stat, Pvalue, Bonferroni P value, and adjusted Bonferroni P value using a LogFC_all column, which was incorrect. Because this experiment involved multiple treatments (of which I was analyzing 6 for the RX samples) I needed to calculate those four things for all 6 treatments (RX-0.5-10, RX-0.5-30, RX-0.5-60, RX-1-10, RX-1-30, and RX-1-60). I inserted columns for all four items for all 6 treatments I was looking at, and following the same procedure as before, I calculated the T-stat, P value, Bonferroni P value, and adjusted Bonferroni P value for each treatment. Here are some sample equations for the RX-0.5-10 treatment:

Column Z named Tstat_LogFC RX-0.5-10: =AVERAGE(B2:D2)/(STDEV(B2:D2)/SQRT(3)), denoting that I want a T statistic representative of all three RX-0.5-10 replicates.
Column AA named Pvalue_LogFC RX-0.5-10: =TDIST(ABS(Z2),2,2), indicating that i wanted the P value calculated from the T statistic column.
Column AB named Bonferroni_Pvalue_LogFC RX-0.5-10: =AA2*3926
Column AC named Bonferroni_Pvalue_LogFC RX-0.5-10: =IF(AB2>1,1,AB2) in reference to the first Bonferroni P value column.

Dr Dahlquist also made a change to the file where she replaced all of the nothing boxes with a space/character/space. The amount of replacements was still the same as before but this was done to insure that the analysis was as accurate as possible. I proceeded to do a sanity check before moving on to the Benjamini-Hochberg corrections.

Prepare file for GenMAPP

I inserted a new worksheet and named it "forGenMAPP."
I selected all from the Statistics worksheet and pasted the values on the new sheet.
I selected all fold changes and changed them to 2 decimal places by selecting Format > Cells.
I selected all columns with p values including the T stat column and changed them to 4 decimal places by selecting Format > Cells.
I deleted the left-most Bonferroni p value columns (the ones without the "IF" statements) from each treatment.
I inserted a column to the right of the ID column and named it "SystemCode". I filled the whole column with the letter N by typing N in the first cell after the heading and double-clicking on the little black cross.
I selected File > Save As "Text (Tab-delimited) (*.txt). I clicked through the different warnings and uploaded both files to the team's Wiki page.

Sanity Check

Next I performed a sanity check to ensure that I performed the data analysis for RX-treated samples correctly. I opened the "forGenMAPP" tab and selected "Custom Filter" from the filter options. I then performed the following procedure for all 6 RX treatments.

Filtered P value column by less than 0.05, 0.01, and 0.001. Recorded results and percentages out of 3,926.
Filtered Bonferroni p value column by less than 0.05 and recorded results and percentages out of 3,926. The table can be located on my updated excel file from 12/3 found on the team Wiki.

From the check I found that RX-1-30 had the most significantly changed genes at 66%, which was at least 7% higher than the treatment with the next most significantly changed genes (59%). I chose this set of p values for the Benjamini & Hochberg p value correction.

Benjamini & Hochberg p value correction

I inserted a new worksheet named "B-H_Pvalue"
Copy and pasted the ID column from the previous worksheet into the first column of the B-H_Pvalue worksheet.
Inserted a column on the far left and named it "MasterIndex". I typed 1 into Cell A2 and 2 into Cell A3 and selected both, filling the entire column with a series of numbers from 1 to 3926.
I pasted the unadjusted p values from column AN on the Statistics worksheet for Pvalue_LogFC RX-1-30.
I selected columns A, B, and C and sorted by ascending values on Column C (Sort & Filter > Custom Sort).
Typed "Rank" into cell D1 and typed 1 into cell D2 and 2 into cell D3. I selected both and filled the column with numbers 1 to 3926.
Typed B-H_Pvalue in cell E1 and typed the formula =(C2*3926)/D2 into cell E2. I copied the equation for the whole column.
Typed B-H_Pvalue into cell F1 and typed =IF(E2>1,1,E2) into cell F2. Copied to entire column.
I selected columns A through F and sorted them by MasterIndex Column A in ascending order.
I copied column F and pasted values into the right column of the Stats sheet.

RX Compiled Raw Data 12/3

RX .txt format GenMAPP ready 12/3

Running GenMAPP and MAPPFinder

The following portion of the assignment was based off instructions found at GenMAPP and MAPPFinder Protocols.

I downloaded the .gdb file from Trixie's page and opened GenMAPP. I selected my Thaw Space as the root folder and went to Choose Gene Database > Sf-Std_20151201.gdb. Then, I went to the Expression Dataset Manager and attempted to choose my dataset, the CompiledRawData .txt file from 12/3. I received the following error message:

At that point I went to speak to Dr. Dahlquist. She recommended three things:

Replace the periods in the headers with a word
Replace all #DIV/0! with a space character
Communicate all gene IDs from the exceptions file listed as "not found" with the QA and Coder

I then went back to my Compiled Raw Data file to make the necessary changes and perform a re-export for GenMAPP. In the "ForGenMAPP" file I replaced all periods in the headings with the word "pt" for 21 total replacements. Then, I replaced all "#DIV/0!" with a space character which yielded 187 replacements. I saved this file as an Excel file and then also as a tab-delimited (.txt) file. I chose this new dataset in the Expression Dataset Manager and clicked "OK" to convert the raw expression data and generate the exceptions file. 416 errors were detected in the raw data.

RX Compiled Raw Data as of 12/6

RX .txt format updated as of 12/6

RX Exceptions file in .txt format

RX .gex file

RX Exceptions file in Excel format (filtered)

I then clicked on "Expression Dataset Manager" and named the file "RX-0pt5-10". For Gene Value I chose Avg_LogFC RX-0pt5-10. Then I created a new criteria with the label "Increased", color green, and criterion "[Avg_LogFC RX-0pt5-10] > 0.25 and [Pvalue_LogFC RX-0pt5-10] < 0.05." I created another criteria for RX-0pt5-10 with the label "Decreased", color red, and criterion "[Avg_LogFC RX-0pt5-10] < -0.25 and [Pvalue_LogFC RX-0pt5-10] < 0.05." This gave me the following:

I then tried to run MappFINDER. I went to Tools > Run MAPPFinder and loaded the file. I selected the Increased criteria to filter by and checked the boxes for Gene Ontology and p value. I then tried to save the file as RX-0pt5-10 Increased and clicked Run MAPPFinder. After 10 minutes of the spinning wheel I received a "Program Not Responding" message. I tried to run this on two more computers and after about 10 minutes I received the same message indicating the program had frozen. I then tried it again on my initial computer and left it running even after the "Program Not Responding" message popped up. Eventually the browser appeared. It looked like this:

Upon clicking on the first ontology for biological_process, however, I received this message:

Communicating with QA and Coder

Regarding our 416 errors, Erich and I both sent our files to the QA and Coder for analysis for troubleshooting. It was found that there were IDs that didn't follow the SF#### format and instead were in the format of b#### and CP####, which was resulting in many "Gene ID Not Found" errors. Trixie then collaborated with Jake to identify and add the location of the necessary gene IDs that were listed as not found in the XML file. Then they re-exported the database. Unfortunately, none of the b####s exist in the XML file and only 49 CP#### IDs existed. Next, Erich and I will be running our dataset through the new database.

Instructions from Dr. Dahlquist

Increased LogFC>0.25 and p<0.05
Decreased LogFC<-0.25 and p<0.05
Do Benjamini-Hochberg correction for the treatment with the most significant genes out of all 6