Blitvak Week 8

Steps taken for Parts 1-2 of the Microarray Data Analysis were edited from:

Statistical Analysis of Vibrio cholerae Microarray Data (Part 1)

Merrell_Compiled_Raw_Data_Vibrio.xls was downloaded, saved to the desktop, and renamed with some additional information (initials and the date)

Normalizing the log ratios for the set of slides in the experiment

The following operations were performed in order to scale and center the microarray data:

The renamed Excel file was opened and a new Worksheet was inserted with the name scaled_centered
Everything on the compiled_raw_data worksheet was selected and copied over to scaled_centered (formatting was the same, starting from the left-hand cell, A1)
Two new rows were inserted between the top row of headers and the first data row in scaled_centered
In cell A2, Average was typed in; in A3, StdDev was typed in
The Average log ratio for each chip was computed by typing =AVERAGE(B4:B5224) into cell B2 and pressing enter
The Standard Deviation of the log ratios on each chip was computed by typing =STDEV(B4:B5224) into cell B3 and pressing enter
The equations in B2 and B3 were copied and pasted into the empty cells in the rest of the columns (A2 to C4)
The column headings for all of the data columns were copied and pasted to the right of the last data column; this new set of headers was edited so that they read: A1_scaled_centered, A2_scaled_centered, etc.
The equation =(B4-B$2)/B$3 was typed into cell N4; the dollar sign symbols were used in front of the "2" and "3" in order to ensure that Excel will not change the reference to that row when that same equation is pasted down the entire column of 5221 genes (this is important because the average and standard deviation is the same for the entire row, and therefore, the reference must stay the same). This equation is the scaling and centering equation.
The scaling and centering equation was copied and pasted down the entire A1_scaled_centered by clicking the original cell with the equation and double-clicking the bottom right corner of the cell (cursor should change to a black plus sign prior to double-clicking)
The scaling and centering equation was put in each of the data columns with the _scaled_centered header (was copied and pasted down the entire columns)
The equation for each column was checked to ensure that it was correct (ex. for A2_scaled_centered, the equation should be =(C4-C$2)/C$3)

Performing statistical analysis on the ratios

Initial statistical analysis

A new worksheet was inserted with the name statistics
The first ID column in the scaling_centering worksheet was copied and pasted into the first column of statistics
The columns that are designated with _scaled_centered were copied and pasted into the new worksheet (starting from B1); for the pasting, "Paste Special" was required in order to just paste the numerical results into the new worksheet ("Values" was selected from the "Paste Special" window)
Rows 2 and 3 (corresponding to Average and StDev) were deleted
The headers Avg_LogFC_A, Avg_LogFC_B, and Avg_LogFC_C were typed into the next three empty columns to the right (immediately adjacent to the last _scaled_centered column)
The average log fold change for the replicates for each patient was computed by typing =AVERAGE(B2:E2) into cell N2. The equation was copied and pasted down the entire column.
The equation for the average log fold change was created for patients B and C; this equation was copied and pasted down their respective columns (for patient B the equation was =AVERAGE(F2:I2), for patient C the equation was =AVERAGE(J2:M2))
In the first cell that corresponds to the next empty column, the header Avg_LogFC_all was typed
An equation that can compute the average of the three previously calculated averages was created (=AVERAGE(N2:P2)); this equation was pasted into this entire column (Avg_LogFC_all)
A new column was inserted next to Avg_LogFC_all. This column was given the label/header of Tstat (purpose of this column is to compute a T statistic that will inform whether
The equation =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3)) was entered into Tstat and pasted into all of the rows within that column
The top cell in the next column was labeled with Pvalue; the equation =TDIST(ABS(R2),2,2) was entered in the cell below the label and copied and pasted into all of the rows in that column. The first "2" in that equation is the degrees of freedom (there are 2 degrees of freedom since the number of replicates, which is 3, minus 1 is 2)

Calculating the Bonferroni p-value Correction

Adjustments to the p-value were performed with the purpose of correcting for the multiple testing problem. The next two columns, to the right, in statistics were both labeled with Bonferroni_Pvalue
The equation =S2*5221 was typed into the first cell under the first Bonferroni_Pvalue header; the formula was copied down the entire column
Any corrected p-value that is greater than 1 was replaced with the number 1 by typing =IF(T2>1,1,T2) into the first cell below the second Bonferroni_Pvalue header; the formula was copied throughout the entire column

Calculating the Benjamini & Hochberg p-value Correction

A new worksheet named B-H_Pvalue was inserted
The ID column from the previous worksheet was copied and pasted into the first column of the new worksheet
A new column was inserted to the very left and labeled as MasterIndex (purpose is to create a numerical index of genes)
- In the MasterIndex column, a "1" was typed into cell A2 and a "2" was typed into cell A3
- Both cells were selected and the bottom-right corner (where the cursor becomes a thin black plus sign) was double-clicked. This filled the entire column with the numbers 1 to 5221 (# of genes)
Using Paste special > Paste values, the unadjusted p-values from the previous worksheet were copied and pasted into column C of this worksheet
Columns A, B, and C were all selected and sort by ascending values was performed on Column C (sort button on toolbar -> custom sort -> Sort by Pvalue, Sort on Values, Order Smallest to Largest)
The header Rank was typed into cell D1. "1" was typed into cell D2 and "2" was typed into cell D3; both cells were selected and the double-clicking of the lower right corner was employed in order to fill the column with a series of numbers from 1 to 5221.
B-H_Pvalue was typed into cell E1 and the formula =(C2*5221)/D2 was typed into cell E2; the equation was copied down the entire column
B-H_Pvalue was typed into cell F1 and the formula =IF(E2>1,1,E2) was typed into cell F2; the equation was copied down the entire column
Columns A through F were selected and the columns were sorted by the MasterIndex in column A in ascending order
Column F was copied and the values were pasted via Paste special into the next column on the right

Preparing file for GenMAPP

A new worksheet was inserted with the name forGenMAPP
Everything in the statistics worksheet was selected and copied over to forGenMAPP via Past special > values
Columns B through Q were selected and the number of decimal places was set to 2 via Format > Cells > number tab, set to 2 decimal places
All of the columns containing p-values were selected and the number of decimal places was set to 4
The left-most Bonferroni p-value column was deleted (the one with an "if" statement was kept)
A column to the right of ID column was inserted and given the header SystemCode. The entire column was filled with the letter "N".
While on the forGenMAPP worksheet, the file was saved as "Text (Tab-delimited) (*.txt)"
The resulting file was checked and opened via notepad

Sanity Check: Number of genes significantly changed

The spreadsheet was opened and the forGenMAPP worksheet was selected
Cell A1 was clicked and the and the autofilter was turned on via Data > Filter > Autofilter
The drop-down arrow on the Pvalue column was clicked. "Number filters" was selected, then "Less than...", and then "0.05" was typed into the window that appeared, in order to filter the "Pvalue" column so that only p-values that are less than 0.05 appear
- 948 genes out of 5221 were found to have a p-value < 0.05, which is 18.16% of genes
The Pvalue column was then filtered so that only p-values that are less than 0.01 appear
- 235 genes out of 5221 were found to have a p-value < 0.01, which is 4.50% of genes
The Pvalue column was then filtered so that only p-values that are less than 0.001 appear
- 24 genes out of 5221 were found to have a p-value < 0.001, which is 0.46% of genes
The Pvalue column was then filtered so that only p-values that are less than 0.0001 appear
- 2 genes out of 5221 were found to have a p-value < 0.0001, which is 0.038% of genes
Bonferroni_Pvalue was then filtered in order to determine the genes that are p < 0.05 for the Bonferroni-corrected p-value
- 6 genes out of 5221 were found to have a p < 0.05 for the Bonferroni-corrected p-value, which is 0.115% of genes
B-H_Pvalue was then filtered in order to determine the genes that are p < 0.05 for the Benjamini and Hochberg-corrected p value
- 0 genes out of 5221 were found to have a p < 0.05 for the Benjamini and Hochberg-corrected p value, which is 0% of genes
Avg_LogFC_all was then filtered in order to only show the genes with an average log fold change greater than zero (while keeping the p-value filter at less than 0.05)
- 352 genes were found to have an average log fold change greater than zero, which is 6.74% of genes
Avg_LogFC_all was then filtered in order to only show the genes with an average log fold change less than zero (while keeping the p-value filter at less than 0.05)
- 596 genes were found to have an average log fold change less than zero, which is 11.42% of genes
Avg_LogFC_all was then filtered in order to only show the genes with an average log fold change > 0.25 (while keeping the p-value filter at less than 0.05)
- 339 genes were found to have an average log fold change > 0.25, which is 6.49% of genes
Avg_LogFC_all was then filtered in order to only show the genes with an average log fold change < -0.25 (while keeping the p-value filter at less than 0.05)
- 579 genes were found to have an average log fold change < -0.25, which is 11.09% of genes
What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?
- Merrell et al. employed the Statistical Analysis for Microarrays (SAM) program with the intensity ratios in order to "identify significant differences in gene expression"; they conducted a two-class SAM analysis, with the in vitro strain being class I and the individual stool samples being class II. Merrell et al. selected genes with statistically significant changes in expression (which was at least two fold) in each patient sample and this individual stool sample data was used to identify genes that were significantly changed (in expression) in all three samples. The method used by Merrel et al. is somewhat similar to the method used in this investigation but it involved the use of an other computer program (SAM) and the selection of genes that had at least two fold changes in expression. This method used in this investigation primarily involved the use of p-values (that are less than 0.05) in order to identify significant changes in gene expression; similar to what was done by Merrel et al., this method also involved the use of the data from all three patients in order to find changes in gene expression that are significant among all three samples.

Sanity Check: Compare individual genes with known data

Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. These genes were looked up in the spreadsheet and their fold changes, p-values, and significance was noted:

VC0028 (2 entries were found)
- Fold Change: first entry = 1.65, second entry = 1.27
- P-Value: first entry = 0.0474, second entry = 0.0692
- Significance: first entry = statistically significant, second entry = not statistically significant
VC0941 (2 entries were found)
- Fold Change: first entry = 0.09, second entry = -0.28
- P-Value: first entry = 0.6759, second entry = 0.1636
- Significance: first entry = not statistically significant, second entry = not statistically significant
VC0869 (5 entries were found)
- Fold Change (nth entry): 1 = 1.59, 2 = 1.95, 3 = 2.20, 4 = 1.50, 5 = 2.12
- P-Value (nth entry): 1 = 0.0463, 2 = 0.0227, 3 = 0.0020, 4 = 0.0174, 5 = 0.0200
- Significance (nth entry): 1 = significant, 2 = significant, 3 = significant, 4 = significant, 5 = significant
VC0051 (2 entries were found)
- Fold Change: first entry = 1.92, second entry = 1.89
- P-Value: first entry = 0.0139, second entry = 0.0160
- Significance: first entry = statistically significant, second entry = statistically significant
VC0468
- Fold Change: -0.17
- P-Value: 0.3350
- Significance: not statistically significant
VC2350
- Fold Change: -2.40
- P-Value: 0.0130
- Significance: statistically significant
VCA0583
- Fold Change: 1.06
- P-Value: 0.1011
- Significance: not statistically significant

Statistical Analysis of Vibrio cholerae Microarray Data (Part 2)

GenMAPP was launched and the 2009 Vibrio Cholera database was downloaded and loaded into the program (placed into C:\GenMAPP 2 Data\Gene Databases)
The data Menu from the main Drafting Board window was selected and then Expression Dataset Manager was chosen. In the Expression Dataset Manager window, New Dataset was selected and then the tab-delimited text file that was formatted for GenMAPP (.txt)
Expression Dataset Manager was allowed to convert the data and create a new converted dataset

Error Analysis

After conversion, it was found that 772 errors were detected in the raw data by genMAPP using the 2009 database
My partner, Anindita V., found that 121 errors were detected in the raw data by genMAPP using the 2010 database
The .EX.txt file generated by GenMAPP was opened (which contains error messages, along with the raw data) and analyzed
- It was found that the error message for the unprocessed genes was Gene not found in OrderedLocusNames or any related system.
- Compared to my partner's results (121 errors with the 2010 database), I had many more errors (772) using the 2009 database. Given that the error message for the 772 errors was Gene not found in OrderedLocusNames or any related system., it appears that the old database covers less genes than the new one (it appears that GenMAPP is giving an error because some of the genes that are present in the data set are not present in the 2009 Database)

Creating Color Sets

Increased and Decreased LogFoldChange color sets were created in GenMAPP by going to the Expression Dataset Manager and filling in these fields in the Color Set area: name for the Color Set, the gene value, and the criteria that determines how a gene object is colored (on the MAPP)
- The name of the color set was set as LogFoldChange and Avg_LogFC_all was used as the Gene Value
- For the Increased criterion (increased LogFoldChange), the name was set as Increased, red was used as the color, and the criterion was [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05
- For the decreased criterion (decreased LogFoldChange), the name was set as Decreased, green was used as the color, and the criterion was [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05
- The whole Expression Dataset was saved and the Expression Dataset Manager was exited

MAPPFinder Procedure

Assigned criterion: Increased, using 2009 Database
The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
The LogFoldChange color set was selected and the Increased criterion was selected (to filter the data)
The boxes corresponding to "Gene Ontology" and "p value" were checked
"Browse" button was clicked to add a name to the file that will be created
"Run MAPPFinder" was clicked and the program was allowed to complete its analysis
"Show Ranked List" was clicked to see a list of the most significant Gene Ontology terms
Top 10 Ranked GO Terms, found using the 2009 Database

biopolymer biosynthetic process
macromolecule biosynthetic process
macromolecule metabolic process
localization
transporter activity
cellular biopolymer biosynthetic process
cellular macromolecule metabolic process
transport
establishment of localization
biopolymer metabolic process

The top 10 GO terms, found by my partner Anu using the 2010 Database are:
- 1. branched chain family amino acid metabolic process
  2. branched chain family amino acid biosynthetic process
  3. IMP metabolic process
  4. IMP biosynthetic process
  5. purine ribonucleoside monophosphate metabolic process
  6. purine nucleoside monophosphate metabolic process
  7. purine nucleoside monophosphate biosynthetic process
  8. purine ribonucleoside monophosphate biosynthetic process
  9. arginine metabolic process
  10. cellular nitrogen compound biosynthetic process
- The top 10 GO terms, compared between the 2009 and 2010 database, appear to mostly involve metabolic/biosynthetic processes. The 2010 database seems to have top GO terms that are more specific (than the 2009 database) and most of these terms involve purine metabolism and metabolic pathways tied to amino acids (and other compounds). Unlike the 2010 database, the 2009 database had top GO terms that involved transport (transporter activity) and movement (localization). It is suspected that the variation for the top GO terms is due to differences between the 2009 and 2010 databases with respect to the number of genes covered/represented. The 2009 database had many more errors, and thus, many more genes that were not taken into account by GenMAPP; it is possible that the genes represented by the GO terms related to the amino acid/purine metabolic pathways are not present in the GenMAPP analysis with the 2009 database (potentially, these genes are the ones that are tied to errors).
In the main MAPPFinder Browser window, "Collapse the Tree" was clicked on and these genes were searched for (one by one): VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. The ID for the genes was put in the gene ID search field and "OrderedLocusNames" was selected from the drop-down menu to the right of the search field. GeneID search button was clicked in order to commence the search for the gene. These are the genes mentioned by Merrell et al. in their paper.
VC0028: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
VC0941: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
VC0869: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
VC0051: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
VC0647: 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity
VC0468: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
VC2350: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
VCA0583: transport, outer membrane-bounded periplasmic space, transporter activity
- My partner, Anu, had very different results for these genes. The results were exactly the same for VC0647 and very similar for VCA0583; it is very likely that these genes are both covered in the same way or a similar way in the two databases. Anu found GO terms associated with the rest of the genes (using the 2010 database), however, using the 2009 database, MAPPfinder returned the prompt "No MAPPs or GO terms could be found for that OrderedLocusNames ID.". The prompt for most of the genes, along with the high error count, suggests that the 2009 database did not include these genes (that were mentioned by Merrell et al.). Judging by the search results, it appears that the 2010 database includes all of these genes while the 2009 database only includes a few.

GO Term Investigation for VCA0583, with UniProt

outer membrane-bounded periplasmic space, corresponding to VCA0583, was clicked and a MAPP listing all of the genes associated with that GO term opened up.
In the window that appeared, Q9KP40_VIBCH was selected
Q9KP40_VIBCH had a significant increase of expression (1.67 LogFoldChange)
Using the UniProt database, details regarding Q9KP40_VIBCH were found
- According to UniProt, this gene codes for Thiamin ABC transporter/periplasmic thiamin-binding protein
- Periplasm was searched up; it was found to be a gel-like matrix existing between the inner cytoplasmic membrane and the outer membrane of a bacterium
- ABC transporter was searched up; it was found to be a class of transmembrane proteins that use ATP (via hydrolysis of the phosphoanhydride bonds) in order to carry out certain processes/functions (such as the movement of substrates across membranes)
- Thiamin was found to be a vitamin (vitamin B1), that is utilized by many organisms, including gram-negative bacteria like Vibrio cholerae
- Given this UniProt result, it appears that the gene codes for a protein that binds to thiamin and employs ATP to move it across membranes (is a transmembrane protein that works via active-transport)

Working with the Results File

The results file, in the format "XXX-CriterionX-GO.txt" was copied and pasted in the folder housing all of the files related to this investigation (XXX = name given to file, CriterionX refers to the Criterion #; Criterion0 refers to the Increased criterion, Criterion1 refers to the Decreased criterion, if the Increased criterion was done first)
The "CriterionGO" file was opened via Excel in order to show the results in a tabular format
The top rows of the opened file (in Excel) were examined; these are rows of information that give background information as to how MAPPFinder made the calculations:
Comparing the top rows from the 2010 Database to those found by my partner, Anu, using the 2009 Database, it was noticed that:
- The numbers for probes meeting the filter linked to a UniProt ID, for genes meeting the criterion linked to a GO term, probes linked to a UniProt ID, genes linked to a GO term, and The z score is based on an N of... and a R of... were higher in what was found by Anu than what was found using the older database. This difference is likely due to the fact that the 2010 database covers more genes (less errors) and, thus, would lead to more genes being linked to GO terms and UniProt ID's. The smaller amounts of linked probes/genes was likely responsible for the smaller numbers (last row) that are involved in the z score calculations by MAPPfinder.
The list in the file was filtered in order to show the top GO terms represented in the data for the "Increased" criterion
A cell was clicked in the row of headers and the autofilter was turned on (Data menu -> Filter > Autofilter)
The following filters were applied in order to filter the results down to 20 entries:

Z Score (in column N) greater than 2
PermuteP (in column O) less than 0.05
Percent Changed (in column L) greater than or equal to 26.5

The top 20 GO terms in the file were searched in MAPPFinder in order to find if any relationships existed between the terms. Terms that exhibited a relationship with another term were highlighted (terms that shared a relationship were highlighted with the same color). Terms that did not have any relationships with the other top 20 terms were given bold borders

GO Term Definitions/Grouping

The top 20 GO terms were grouped by color (groups with a lower number had a color that first appeared in an earlier row in the Excel file)
Unfamiliar GO terms were searched and defined using http://geneontology.org/ (definitions are sourced from this resource)
Group 1 Unfamiliar Terms (Highlighted Yellow)
cell projection organization: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of a prolongation or process extending from a cell..."
cell projection assembly: "Formation of a prolongation or process extending from a cell, e.g. a flagellum or axon."
Group 2 Unfamiliar Terms (Highlighted Light Brown)
branched chain family amino acid biosynthetic process: "The chemical reactions and pathways resulting in the formation of amino acids containing a branched carbon skeleton, comprising isoleucine, leucine and valine"
Group 3 Unfamiliar Terms (Highlighted Light Blue)
nucleobase metabolic process: "The chemical reactions and pathways involving a nucleobase, a nitrogenous base that is a constituent of a nucleic acid..."
Group 4 Unfamiliar Terms (Highlighted Dark Blue)
thiamin pyrophosphate binding: "Interacting selectively and non-covalently with thiamine pyrophosphate, the diphosphoric ester of thiamine. Acts as a coenzyme of several (de)carboxylases, transketolases, and alpha-oxoacid dehydrogenases."
acetolactate synthase activity: "Catalysis of the reaction: 2 pyruvate = 2-acetolactate + CO2."
Group 5 Unfamiliar Terms (Highlighted Salmon)
flagellin-based flagellum basal body, distal rod: "The portion of the central rod of the flagellar basal body that is distal to the cell membrane; spans most of the distance between the inner and outer membranes."
Group 7 Unfamiliar Terms (Highlighted Grey)
hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amidines: "Catalysis of the hydrolysis of any non-peptide carbon-nitrogen bond in a linear amidine, a compound of the form R-C(=NH)-NH2."
Group 8 Unfamiliar Terms (Bold Black Borders)
monocarboxylic acid transport: "The directed movement of monocarboxylic acids into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore."
monocarboxylic acid transmembrane transporter activity: "Enables the transfer of monocarboxylic acids from one side of the membrane to the other. A monocarboxylic acid is an organic acid with one COOH group."

Group Summaries

Group 1 (Yellow): Related to the processes behind the assembly/formation of cell projections (like flagella)
Group 2 (Light Brown): Related to the biosynthetic and metabolic processes that involve branched chain family amino acids
Group 3 (Light Blue): Involves the synthesis of nitrogenous bases (purines), and related metabolic processes
Group 4 (Dark Blue): Involves processes/reactions that incorporate ligand binding. One term involves interactions with thiamine pyrophosphate, which is a derivative of thiamine (an essential vitamin); thiamine pyrophosphate is a catalyst of many important biochemical reactions. The other term involves the synthesis of acetolactate from 2 pyruvates via an enzyme (acetolactate is involved in the biosynthesis of several branched chain amino acids).
Group 5 (Salmon): Involves the development of certain structures in the flagellum (distal rod of the "flagellar basal body"); the distal rod is an important flagellar structure that is transmembrane and it is connected to the other parts of the flagellum (like the filament/cap)
Group 6 (Green): Involves the biosynthetic/metabolic processes related to pigments (pigment production)
Group 7 (Grey): Involved with the hydrolysis (catalysis of the process) of the carbon-nitrogen bonds (non-peptide) in linear amidine and amide compounds
Group 8 (Bold Black Borders): This group is composed of somewhat miscellaneous terms that did not have clear relationships with each other and with other terms (in MAPPfinder). Two of the three terms are involved with monocarboxylic acids (transport and transport activity across membranes). The final term is extracellular region and is related to the space outside of the bacterium in question.

Result Interpretation

The work by Merrell et al. indicates that the colonization of V. cholerae in human hosts results in the creation of a bacterial state that is very infectious/pathogenic; this induced bacterial state, compared to the control lab-derived strains, was found to have increased levels of expression of genes involved with nutrient acquisition and motility (movement). The bacteria derived from human hosts, according to Merrell et al., had a "unique physiological and behavioural state" (Merrell et al. 2002). Using the top 20 GO terms, derived from this investigation, it was found that many genes with increased expression involved the development/maintenance of the flagella (motility) and various processes that assist the bacteria in producing and acquiring certain nutrients/compounds; these observations are in line with what was observed by Merrell et al. At first, it was not clear exactly how the development of a flagella is tied to pathogenicity but it was suspected that the increased motility permitted the bacterium to adapt better to the host environment (which, in turn, led to a higher degree of pathogenicity). A paper by Haiko and Westerlund-Wikström was found, which helped explain how the development of a flagella leads to pathogenic character/behavior in bacteria. According to Haiko and Westerlund-Wikström, the motility associated with flagella assist bacteria in cell invasion, adhesion, and in colonization; it seems apparent that an increased level of movement help bacteria in colonizing and staying within a human host. With respect to the GO terms related to compound production (those related to the metabolic pathways involving branched chain family amino acids and nucleobases), it is possible that the pathogenic state is more demanding and more rapidly dividing than the "normal" state. The increased expression in the genes involving amino acid synthesis could indicate that the pathogenic bacteria exhibit increased levels of translation (possibly of proteins that are related to the bacteria's pathogenicity; virulence proteins that assist the bacteria in colonizing a host, in invading cells, and in adhesion to cell surfaces/environment). The GO terms involving the hydrolysis of carbon-nitrogen bonds in amidine and amide compounds (which indicate increased levels of expression of genes related to those processes) suggest that the bacteria will have to process more compounds with amidine/amide groups, which are likely found in the host environment or in host cells. An enhanced ability to process these kinds of compounds could potentially allow the pathogenic bacteria to better acquire certain nutrients for its own use (like nitrogen). The GO terms related to monocarboxylic acid transport could also be related to the host body environment (increased presence of these types of compounds). A better ability to transport and process monocarboxylic compounds can enhance the nutrient acquisition ability of the pathogenic bacteria. Finally, with respect to the GO terms related to pigment synthesis, it is possible that these produced pigment compounds are toxic or exhibit properties that interfere with the immune system/with host cells; the GO term extracellular region could also refer to the human host environment that the bacteria interacts with (genes related to it could enhance the pathogenic character of the bacteria by allowing it to interact with the host and its cells in a certain kind of way, such as the secretion of certain compounds, like toxic pigments, into the extracellular space).

Conclusion

Through this assignment, the work of Merrell et al. regarding the gene expression differences between human host V. Cholerae and a control sample (no host) was reviewed and the microarray data that was used in that work was processed, statistically analyzed, and biologically investigated. The data that was worked with, specifically, were the log fold changes for the genes (expression differences between the two groups, experimental/human-host and control). The original experiment by Merrell et al. involved three subjects from which 3 samples of V. Cholerae were taken (from each subject). In our investigation, similarly to what was done by Merrell et al., the log fold changes for every set of three samples was averaged (resulting in three averages, one for each subject) and, using these three averages, an average log fold change covering all subjects/samples was found; the "average log fold change, all" value was the focus of this investigation into microarray data and its analysis. This data was taken, normalized/prepared, and statistically analyzed through the calculations of p-values (which were adjusted via the Bonferroni and Benjimini-Hochberg corrections). The resulting modified data was formatted for use in GenMAPP and opened in the program, which allowed a sophisticated analysis of the genes (and their expression) tied to the microarray data. In doing work with GenMAPP, and MAPPfinder (which helped visualize the relationships between genes and the GO terms related to each gene), the use of different gene databases was explored (a 2009 versus a 2010 gene database for V. cholerae); the importance of a well updated database was clearly seen since the older database resulted in more errors/issues with the program when compared to the newer database. Through the work with GenMAPP and MAPPfinder, the top 20 GO terms related to genes that were significantly increased in expression in the human-host samples (pathogenic bacteria) were determined. Through these GO terms, it was noticed that the results found by Merrell et al. were very similar; the paper suggests that human colonization is tied to an induction of a pathogenic state in the bacteria, which exhibits an increased expression of genes related to bacterial motility and nutrient acquisition. Many of the GO terms involved the flagella and processes that assist the bacteria in acquiring nutrients in the host environment, which appear to be very similar to what was found by Merrell et al. using the same data. This investigation, I feel, built a lot of skill with preparing, statistically analyzing, and interpreting microarray data. It also felt valuable in the way that it pushed the investigator to make their own personal conclusions regarding the final results.

Uploaded Files

For Part I

For Part II

Zip containing the .EX.txt, .gex, XXX-CriterionX-GO.txt, .xls, .mapp, and .gmf files for Part II

Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments

Individual Journal Pages

Shared Journal Pages

Blitvak Week 8

Contents