Kzebrows Week 8

Electronic Lab Notebook

Statistical Analysis of Vibrio cholerae Microarray Data

The instructions below are adapted from the Sample Microarray Analysis page and the Protocols page. This page is hosted by OpenWetWare.org.

Normalize the log ratios for the set of slides in the experiment

First, I scaled and centered the data. I inserted a new Worksheet into my excel file and named it "scaled_centered". I went back to the "compiled_raw_data" worksheet, selected All and Copy, and went to the new "scaled_centered" worksheet. I copy and pasted by clicking on the upper hand left cell (A1) and pasting. I then inserted two rows in between the top row of headers and the first data row and typed "Average" in cell A2 and "StdDev" in type A3. By using the equation =AVERAGE(B4:B5224) and pressing enter in cell B2 and typing STDEV(B4:B5224) in cell B3, I calculated the average and standard deviation of the log ratios for each chip. I then copy and pasted all columns to the right of the last column and renamed them all AI_scaled_centered, A2_scaled_centered, etc. In cell N4 I typed the equation =(B4-B$2)/(B$3) which is the log change minus the mean divided by the st. dev. The mean and standard deviation values both have dollar signs because we did not want to change the reference mean and standard deviation. I copy and pasted this into the entire column using ctrl + copy and double clicking on the corner to copy the formula across the column, a trick that Dr. Dahlquist taught us that I ended up using often in this assignment. I then used the same scaling and centering equation in all "scaled_centered" columns.

Perform statistical analysis on the ratios

I inserted a new worksheet and named it "statistics" and copied the first ID column from the scaling_centering worksheet. I pasted the data into the new Statistics worksheet and then went back and copied all columns designated "scaled_centered" and pasted them as values into the new worksheet (starting in cell B1). I deleted rows 1 and 2 (average and standard deviation) and added a new column on the right with the headers "Avg_LogFC_A" and then two more columns with the same formula for B and C. I computed the average log fold change for each patient in each of these columns. Then, I computed an average of the patients' averages in the next column called "Avg_LogFC_all" all. I then inserted a new column next to that column and called it Tstat, into which I entered the equation =Average(N2:P2)/(STDEV(N2:P2)/SQRT(3)) and I copied and pasted this into all rows of the column. In this case, the number of replicates was 3. The next column I labeled "Pvalue". I entered the equation <code>=TDIST(ABS(R2),2,2) where the middle 2 is degrees of freedom. I copied and pasted it into all rows of the column.

Calculating the Bonferroni p value Correction

Next, to calculate the Bonferroni p value, I labeled the next two right columns Bonferroni and Bonferroni_Pvalue. I then typed the equation =S2 x 5221 into the column and replaced any corrected p value greater than 1 by the number 1 by using the formula =IF(T2>1,1,T2) in the Bonferroni_Pvalue column.

Calculate the Benjamini & Hochberg p value Correction

I inserted a new worksheet named B-H_Pvalue and copy and pasted the ID column from the statistics worksheet into this one. I inserted a new column on the left and named it MasterIndex. I then typed a 1 in cell A2 and a 2 in cell A3 and selected both cells. By double-clicking the + sign at the bottom right I filled the entire column with a series of numbers. I copied the unadjusted Pvalues from the previous worksheet and pasted them into Column C. I selected columns A, B, and C, and sorted them by ascending value. I ten typed Rank into cell D1 and ranked them from 1 to 5221 into this new column. To calculate the B-H p value I typed the equation =(C2*5221)/D2 into cell E1 and named the column B-H_Pvalue. In cell F2, I typed the formula =IF(E2>1,1,E2) and copied the equation in all of column F under the column heading B-H_Pvalue. I then sorted columns A through F in ascending order. I copied only column F, the B-H P value, and pasted the values into the next column on the right of the statistics worksheet.

Prepare File for GenMAPP

I inserted a new worksheet and named it forGenMAPP. I selected all from the statistics worksheet and copied the values and pasted them into the new worksheet. I selected columns B through Q and formatted them to 2 decimal places. I then selected all p value columns and formatted them to 4 decimal places. I deleted the left-most Bonferroni p value column and inserted a column to the right of the ID column, naming it SystemCode, and filled each column with the letter N.

I saved this as a txt file and as an Excel file in class. NOTE: Unfortunately, only my text file saved accurately. Because I used the txt file for GenMAPP, i was able to carry on for Part 2 of this assignment; however, I had to re-do parts of my Excel file and re-upload it. Below are the text file that I used and the original Excel file, which did not save correctly, as well as the re-done Excel spreadsheet.

File:Kzebrows microarray20151020.EX.txt

File:Kzebrows microarray20151020.xls

File:Kzebrows microarrayanalysis20151025.xls

Sanity Check: Number of genes significantly changed

In the p value column I filtered the data so the P value is less than 0.05

How many genes have p value <0.05? What is the percentage? 948 genes or 18.16%
What about p<0.01? What is the percentage? 235 or 4.5%
What about p<0.001? What is the percentage? 24 or 0.46%
What about p<0.0001? What is the percentage? 2 or 0.04%

Then I filtered the data to determine the following:

How many genes are p<0.05 for the Bonferroni-corrected p-value? What is the percentage? 0, or 0%
How many genes are p<0.05 for the Benjamini and Hochberg-corrected p value? What is the percentage? 0, or 0%

I experimented with different filters for the Avg_LogFC_all column.

Keeping the unadjusted Pvalue filter at p<0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? 352, or 6.7%
Keeping the unadjusted Pvalue filter at p<0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? 596, or 11.4%
What about an average log fold change of >0.25 and p<0.05? 339, or 6.5%
Or an average log fold change of <-0.25 and p<0.05? 579, or 11.01%

What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method? Merrell et al. used a two-class SAM analysis (Statistical Analysis for Microarrays). Genes were considered significantly changed if there was at least a twofold change in levels of expression. We used p-values to determine significant gene expression changes, or the probability that the changes in expression levels could be due to chance, while they used the actual levels of change to determine significance. We used the fold change cut-off with a range of 0.5 (-0.25 to +0.25) to include more genes in our analysis (918 total) while Merrell's analysis was more stringent (they only found 237 genes that were statistically significant, indicating differential regulation).

Sanity Check: Compare individual genes with known data

In the next part of the sanity check, I looked up the following genes in my spreadsheet to compare them to the analysis done by Merrell et al. (2002), with the objective of finding their fold changes and p values and finding if they were significantly changed in my analysis. Fold change is column R and p value is column T.

VC0028: Two entries
- Fold change: 1.27 for the first, 1.65 for the second
- P value: 0.0692 for the first, 0.0474 for the second
- Not significantly changed for the first, significant for the second.
VC0941: Two entries
- Fold change: 1 -0.28 and 0.09
- P value: 0.1636 and 0.6759
- Not significantly changed
VC0869: Five entries
- Fold changes: 2.12, 1.50, 1.59, 1.95, 2.20
- P value: 0.02, 0.0174, 0.0463, 0.0227, 0.002
- All are significantly changed
VC0051
- Fold change: 1.89, 1.92
- P value: 0.016, 0.0139
- Both are significantly changed
VC0647: I have 3 entries for this ID.
- Fold change: -1.11, -0.94, -1.05
- P value: 0.0003, 0.0125, 0.0051
- All significantly changed
VC0468
- Fold change:-0.17
- P value: 0.3350
- Not significantly changed
VC2350
- Fold change: -2.40
- P value: 0.0130
- This is significantly changed.
VCA0583
- Fold change: 1.06
- P value: 0.1011
- Not significantly changed

Part 2

Following instructions found on GenMAPP and MAPPFinder protocols I downloaded the gene database for V. cholerae for 2009. My partner Mary worked with the 2010 file. We met on Sunday and Monday to work on this assignment.

I saved the file into the folder C:\GenMAPP 2 Data\Gene Databases and unzipped it to extract the files. Next I launched the GenMAPP program and selected the Data Menu, from which I chose Expression Dataset Manager. I selected the tab-delimited file that I formatted for GenMAPP and allowed the Expression Dataset Manager to convert the data. A messaged appeared letting me know how many errors occurred, meaning that the manager could not convert those lines of data. I got 772 errors, while my partner Mary got 121 errors for the 2010 file. In the text file, the errors appeared as "Gene not found in OrderedLocusNames or any related system." It's possible that because my database was older than Mary's I would have more errors. Her 2010 database contains more genes and less errors. The 2010 file was updated more recently and so names could have been changed, information added, or faulty information deleted since my file had been uploaded.

Next I customized the new Expression Dataset by creating new Color Sets that contained instructions to GenMAPP for displaying data. I created this by filling in different fields in the color set area of the Data Set manager. I named the color set "LogFoldChange" and created two criteria, increased and decreased. The Gene Value was "Avg_LogFC_all" for the dataset. The criteria were established using the formula [Avg_LogFC_All] > 0.25 AND [Pvalue] < 0.05 and [Avg_LogFC_All] < -0.25 AND [Pvalue] < 0.05. I coded "increased" as pink and "decreased" as purple. I then saved the data set and exited the Expression Dataset.

File:Kzebrows microarray20151020.gex (Expression Dataset file)

MappFinder Procedure

Mary and I signed up for the "increased" criterion.

I launched the MAPPFinder program and checked that the right name of the Gene Database was appearing at the bottom of the window. I clicked "Calculate New Results" and "Find File." I chose my Expression Dataset file and clicked OK. I chose the "increased" criteria and checked the "gene ontology" and "p value" boxes and clicked Browse to create a filename for my results.

I then ran MAPPFinder, which led me to a browser showing me all of my results. I clicked "Show Ranked List" to see the most significant Gene Ontology terms. Here were the top 10:

Cellular macromolecule metabolic process
Macromolecule metabolic process
Localization
Macromolecule biosynthetic process
Transporter activity
Biopolymer metabolic process
Cell projection organization
Cellular biopolymer metabolic process
Cellular biopolymer biosynthetic process
Cellular macromolecule biosynthetic process

I compared these terms with Mary's terms from the 2010 database. Our databases were completely different with no overlap. There may have been significant discoveries in 2009 that resulted in significant gene changes for the 2010 upload.

Next I needed to find the Gene Ontology terms with which a certain gene was associated. I clicked "Collapse the Tree" and searched for the genes mentioned by Merrell et. al (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. I typed the identifier for the first gene, VC0028, and came up with no entries for that gene. I went through the list until I found one with entries, VC0647.

Here are the GO terms associated with each of the following entries:

VC0028: No entries
VC0941: No entries
VC0869: No entries
VC0051: No entries
VC0647: mRNA catabolic process, RNA processing, cytoplasm, RNA binding, 3'-5' exoribonuclease activity, transferase activity, nucleotidyltransferase activity
VC0468: No entries
VC2350: No entries
VCA0583: Transport, outer membrane-bounded periplasmic space, transporter activity

Mary got entries for everything using the 2010 file while I only had entries for two out of the eight genes. The only two I had entries for, VC0647 and VCA0583, were similar, though. She had many of the same entries for VC0647 plus a few more and then the same entries for VCA0583. Since the 2010 file is newer, it is much bigger--genes were added between 2009 and 2010, including genes for VC0028, VC0941, VC0869, VC0051, VC0468, and VC2350, as well as some additional terms with which VC0647 could be associated.

I clicked on 3'-5' exoribonuclease activity for VC0647 as my gene of choice. From the UniProt site I found that this protein is known as polyribonucleotide nucleotidyltransferase (PNP_VIBCH) and it is involved in degrading mRNA. To do this, it catalyzes the phosphorolysis of polyribonucleotides that are single-stranded from 3' to 5'. Expression of this gene decreased significantly, as indicated by the purple color.

I then uploaded the MAPP file and CriterionX-GO.txt file.

File:Kzebrows3’-5’-exoribonuclease activity.mapp

File:Kzebrows microarray20151020.gmf

NOTE: With the .gmf file, I attempted to open it on the lab computer and it wouldn't open, so I tried to open it with Microsoft Word. This converted the file into a weird version with symbols, but it is the actual file.

I then launched Microsoft Excel and opened the .txt. file. From there I looked at the top of the spreadsheet at the background information and compared it with Mary's spreadsheet.

*I had 291 probes meeting the filter linked to UniProt and Mary had 338.

I had 184 genes meeting the criterion linked to a GO term and Mary had 219.
I had 449 probes linked to a UniProt ID and Mary had 5100
I had 1990 genes linked to a GO term and Mary had 2475.

Everything that's linked is different between our spreadsheets. This is likely because Mary's database is more recently updated, so the links between genes changed.

Then I needed to filter the list to show the top GO terms represented in the data for the criteria. I set the following filters on the spreadsheet for a total of 22 results:

Z score greater than 2
PermuteP less than 0.05
Number changed greater than or equal to 5 and less than 100
Percent change greater than or equal to 21

By comparing the filtered GO terms with the MAPPFinder browser I was able to determine that some of the files were closely related. Files that are related I highlighted in the Excel spreadsheet below. Each grouping is a different color.

File:Kzebrows microarray20151020-Criterion0-GO.txt (original file)

File:Kzebrows microarray20151020-Criterion0-GO - Copy.xlsx (filtered and highlighted)

Next I interpreted my results. I understood what most of the terms meant (being a biology major was helpful here) but I did have to look up a few to see because I was unsure how they would be related.

magnesium ion binding: interacting selectively and non-covalently with Mg ions.
hydrolase activity acting on carbon-nitrogen (but not peptide) bonds: catalysis of the hydrolysis of any carbon-nitrogen bond, C-N, with the exception of peptide bonds.

When I searched terms on the Gene Ontology website I didn't find it very helpful. It pretty much told me what I already knew, e.g. that Mg ion binding was interacting with Mg ions. However, I was still able to put together some idea of what the terms may mean as a whole. Looking at the spreadsheet, I found that the main theme for the 22 results was increased expression of the genes for development and organization (and morphogenesis) of cellular parts, particularly flagellum. Looking at my highlighted Excel file, the orange category involves such organization. From my Microbiology class I know that flagellum are crucial for motility and can be a crucial virulence factor in bacteria. Flagellum are motor appendages that the bacteria use to swim towards or away from a stimulus (chemotaxis) and contribute extensively to pathogenecity by allowing them to move towards cells that they could potentially infect.

The green category, concerning metabolic processes, and the blue category concerning transport of acids (organic, carboxylic, and amino acid) could contribute to pathogenecity simply because these are processes that need to function properly in order for bacteria to be in their prime, so to speak. Metabolic processes are a way of converting something into usable energy and bacteria, like all life, needs an energy source to survive and function. Other things such as hydrolase activity (hydrolase is an enzyme that hydrolyzes chemical bonds between two molecules, in this case carbon-nitrogen bonds) can also contribute to the bacteria's metabolic function and therefore to its pathogenecity.

Conclusion

We took the data uploaded by Merrell et al. (2002) at the Stanford Microarray Database concerning an experiment that measured gene expression in Vibrio cholerae from stool samples of three separate patients to a lab sample. With this data, we performed our own statistical analysis using t-tests in order to determine the number of genes that were statistically changed in terms of expression levels over the course of the experiment. We adjusted the data using the Bonferroni p value correction and the Benjamini & Hochberg p value correction due to the problem of multiple testing. We found, however, that our tests were less stringent. Merrell et al. used a two-class SAM analysis and considered genes significantly changed if expression levels increased twofold, while we used the fold change cut-off to include more genes in our analysis. In week 2, we launched GenMAPP to analyze two gene databases. I analyzed the 2009 Gene Database and my homework partner Mary used the 2010 version following the same protocol. We found some similarities but overall the 2010 database contained fewer errors than the 2009 database as it had been updated more recently and contained a great deal more information. I found that the production of flagellum are likely very important contributors to pathogenecity as they allow the bacteria to be more motile and infect more cells.

Overall, this assignment provided an insight into DNA microarray data and how we can use statistical testing and programs like GenMAPP and MAPPFinder to analyze it. It also showed me how crucial it is to be updating biological databases regularly--just one year made an enormous difference in the size of the database and its accuracy.