Ksherbina Week 8

Katrina Sherbina

Class Page User Page

Assignment Description

Week 1

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

Week 13

Week 15

Class Journal

Week 1

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Individual Journal

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Other

Week 5: Database Wiki

Final Project

Team H(oo)KD Project Page

Journal Club Presentation

Project Individual Journal

1 Lab Journal: Working with the Vibrio cholerae Microarray Data
2 Sanity Check: Number of genes significantly changed
3 Sanity Check: Compare individual genes with known data
4 Lab Journal: Analysis of Vibrio cholerae microarray data using GenMAPP and MAPPFinder
- 4.1 Convert normalized microarray data using the GenMAPP Expression Dataset Manager
- 4.2 MAPPFinder Procedure
  - 4.2.1 Interpretation of GO Results
5 Files generated in the above procedures

Lab Journal: Working with the Vibrio cholerae Microarray Data

Download the file Merrell_Compiled_Raw_Data_Vibrio.xls from the Sample Microarray Analysis for Vibrio cholerae page.
- Save the file with the following format for the filename: Merrell_Compiled_Raw_Data_Vibrio_<Initials>_<Date>.xls. In my case, the filename is "Merrell_Compiled_Raw_Data_Vibrio_KS_20131010.xls".
This file contains the Log₂ of Red Dye/Green Dye Normalized Ratio (Median) organized in the following manner:
- Patient A
  - Sample 1: 24047.xls (A1)
  - Sample 2: 24048.xls (A2)
  - Sample 3: 24213.xls (A3)
  - Sample 4: 24202.xls (A4)
- Patient B
  - Sample 5: 24049.xls (B1)
  - Sample 6: 24050.xls (B2)
  - Sample 7: 24203.xls (B3)
  - Sample 8: 24204.xls (B4)
- Patient C
  - Sample 9: 24053.xls (C1)
  - Sample 10: 24054.xls (C2)
  - Sample 11: 24205.xls (C3)
  - Sample 12: 24206.xls (C4)

Normalize the set of microarray chips in the experiment

Open the file. Insert a new worksheet and name it "scaled_centered".
Go back to the "compiled_raw_data" worksheet. Select All and Copy. Go to your new "scaled_centered" worksheet, click on the upper, left-hand cell (cell A1) and Paste.
Insert two rows in between the top row of the column headers and the first row of data.
In cell A2, type "Average" and in cell A3, type "StdDev".
Compute the average log fold change for each chip, which corresponds to each column of the data. In cell B2, type the following equation:

=AVERAGE(B4:B5224)

and hit Enter.

Alternately, to select the range within the parenthesis of the AVERAGE formula, click on the first cell of the range for which the computation will be performed, scroll to the bottom of the the worksheet, and Shift+click on the last cell of the range.
Click and hold the lower right hand corner of the cell B2. Then, drag the cursor to the last column for which you would like to compute the average log fold change.

Compute the standard deviation of the log fold change ratios for each chip, which corresponds to each column of the data. In cell B3, type the following equation:

=STDEV(B4:B5224)

and hit Enter.

Click and hold the lower right hand corner of the cell B3. Then, drag the cursor to the last column for which you would like to compute the standard deviation of the log fold change ratios.

Copy and Paste the first column (column A) to the first blank column after the data (column N).
Now, the data can be scaled and centered. After column N, label a new column for each chip that will be scaled and centered: A1_scaled_centered, A2_scaled_centered, etc.
In cell O4, type the following equation:

=(B4-B$2)/B$3

In this case, the average of the first chip A1 (in cell B2) is subtracted from the data in cell B4. The difference is then divided by the standard deviation of the first chip A1 (in cell B3).

Double click on the lower right hand corner of cell O4 to performing scaling and centering for the rest of the chip (the column of the data).

Repeat the scaling and centering procedure for the rest of the chips changing the cells in the formula corresponding to the average and the standard deviation for each chip (column of the data) that you scale and center.

Perform statistical analysis of the normalized microarray data

Insert a new worksheet and name it "statistics".
Go back to the "scaled_centered" worksheet and copy the first column ("ID").
Paste the data into the first column of your new "statistics" worksheet.
Go back to the "scaled_centered" worksheet and click on the cell "A1_scaled_centered". Then, hold the Shift and Ctrl keys and hit the Right Arrow key to select all of the cells that have the column names for the scaled and centered data. Copy the selection of cells.
Go to the "statistics" worksheet and Paste the column names into the columns in first row after the "ID" column.
Go back to the "scaled_centered" worksheet and Select All and Copy the scaled and centered data.

To do so, click on the first cell of the data (cell O4). Then hold the Shift and Ctrl keys, hit the Right Arrow key, and then hit the Down Arrow key (making sure that you are still holding down the Shift and Ctrl keys). Then, Copy the selection.

Go to the "statistics" worksheet and right click on cell B2. Highlight the "Paste Special..." option and then click on "Paste Special...". A window will open: click on the radio button for "Values" and click OK. This pastes the data as numerical values rather than equations.
To the right of the data you just pasted into the worksheet, type the following headers into the first cell of the next three columns: "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C".
Compute the average log fold change for the replicates for each patient A by typing the following equation:

=AVERAGE(B2:E2)

into cell N2 and hit Enter.

Double click on the lower right hand corner of cell N2 to compute the average of the replicates for Patient A for the remainder of the genes.

Repeat the calculation for Patients B and C in their respective columns.
Type the header "Avg_LogFC_all" into the first cell in the next empty column (column Q). Compute the average of the averages by typing the following equation into cell Q2:

=AVERAGE(N2:P2)

and hit Enter.

Double click on the lower right hand corner of cell Q2 to compute the average of the averages for the rest of the genes.

Now, compute a T statistic to determine how much the average log fold change of all the patients deviates from 0, which corresponds to now change. Type the header "Tstat" into the first cell in the next empty column (column R). Type the following equation into cell R2:

=Q2/(STDEV(N2:P2)/SQRT(COUNT(N2:P2))

and hit Enter. (The command COUNT() counts the number of patients in the experiment.)

Double click on the lower right hand corner of cell R2 to compute the T statistic for the remainder of the genes.

Now, compute the P value to determine how significant is the deviation of the average log fold change of all the patients from 0. Type the header "Pvalue" into the first cell in the next empty column (column S). Type the following equation into cell S2:

=TDIST(ABS(R2),COUNT(N2:P2)-1,2)

and hit Enter. Here, the command COUNT(N2:P2)-1 computes the degrees of freedom, which is one less the number of replicates. The "2" specifies that a two-tailed distribution is used to compute the p value.

Double click on the lower right hand corner of cell R2 to compute the p value for the remainder of the genes.

Format the data for GenMAPP

Insert a new worksheet and name it "forGenMAPP".
Go back to the "statistics" worksheet and Select All and Copy.
Go to the "forGenMAPP" worksheet and right click on cell A1. Highlight the "Paste Special..." option and then click on "Paste Special...". A window will open: click on the radio button for "Values" and click OK. This pastes the data as numerical values rather than equations.
Insert a column to the left of column B. Label this column (type into the first cell of the column) "SystemCode".
In cell B2, type "N". Double click on the lower right hand corner of cell B2 to fill the rest of the column with "N".
Save this worksheet as a tab-delimited file.

Go to File > Save As.
In the drop-down menu next to "Save as type:" select "Text(Tab Delimited)". Click "Save".
Select "OK" or "Yes" for any error messages that may pop up.

Sanity Check: Number of genes significantly changed

The number of genes with a

p value < 0.05 is 948.
p value < 0.01 is 235.
p value < 0.001 is 24.
p value < 0.0001 is 2.

Keeping the filter p value < 0.05:

There are 352 genes with an average log fold change for all patients that is greater than 0.
There are 596 genes with an average log fold change for all patients that is less than 0.
There are 339 genes with an average log fold change for all patients that is greater than 0.25.
There are 579 genes with an average log fold change for all patients that is less than -0.25.

To determine significant gene expression changes, Merrell et al. (2002) used the Statistical Analysis Microarray program to determine which genes had at least a twofold change in expression from the control.

Sanity Check: Compare individual genes with known data

In the data that I normalized,

VC0028 has an average log fold change for all patients of 1.6526 and a p value of 0.0474.
VC0941 has an average log fold change for all patients of 0.0934 and a p value of 0.6759.
VC0869 has an average log fold change for all patients of 1.4990 and a p value of 0.0174.
VC0051 has an average log fold change for all patients of 1.9218 and a p value of 0.0139.
VC0647 has an average log fold change for all patients of -1.1126 and a p value of 0.0003.
VC0468 has an average log fold change for all patients of -0.1686 and a p value of 0.3350.
VC2350 has an average log fold change for all patients of -2.4029 and a p value of 0.0130.
VCA0583 has an average log fold change for all patients of 1.0628 and a p value of 0.1011.

Looking at the p values, VC0028, VC0869, VC0051, VC0647, and VC2350 are significantly changed in my analysis.

Lab Journal: Analysis of Vibrio cholerae microarray data using GenMAPP and MAPPFinder

Installed GenMAPP Classic from this page onto my computer.
Download the 2009 Gene Database for Vibrio cholerae Vc-Std_External_20090622.gdb

Download the file to the folder C:\GenMAPP 2 Data\Gene Databases.

Convert normalized microarray data using the GenMAPP Expression Dataset Manager

Launch GenMAPP 2.

Look at the lower-left hand corner to see what gene database is loaded. For this assignment, the gene database "Vc-Std_External_20090622.gdb" should appear in the corner.
If another database appears or if there is "No Gene Database", go to Data > Choose Gene Database and find the database you need to use.

Go to Data > Expression Dataset Manager.
In the window that pops up, go to Expression Datasets > New Dataset and open the tab-delimited file you created for GenMAPP.

In the "Data Type Specification" window that pops up, only check the box next to a column header if that column has character data. For the Merrell data set, do not check any boxes because all the data is numerical.

Give the Expression Dataset Manager time to convert your data into a GEX file.

An error message may appear that states that the Expression Dataset Manager was unable to convert some of the lines of the data. These lines of data are not incorporated into the Expression Dataset but rather recorded in an exception file that contains all of your raw data and an additional column called ~Error~.

The exception file is a tab-delimited file with the suffix .EX appended to the name of the raw data file you loaded into the Expression Dataset Manager.

Open the the exception file in Excel.
Go to Data > Filter.
To determine what the errors were for the rows that were not converted, locate the ~Errors~ column, click on the down arrow in the cell, and select the "Sort Z to A" option.

Using the 2009 Gene Database, there were 772 errors, each of which was "Gene not found in OrderedLocusNames or any related system."
It is likely that my buddy may have a different number of errors because she is using a newer gene database for Vibrio cholerae than I am. The newer database may include genes that were not included in the older database that are a part of the expression data.

Customize the new Expression Dataset by creating Color Sets, which contain the instructions to GenMAPP for displaying data on MAPPs.

In the "Color Sets" section, type in "Pathogenic v lab" in the "Name" field.
To specify what value appears next to each gene on a MAPP, select "Avg_LogFC_all" in the drop down menu in the "Gene Value" field.
In the "Criteria Builder" section, click on the "New" button. Now, we will construct the criterion to query the data.
We will set the criterion to query for all the genes that have a significant (i.e. [Pvalue] < 0.05) decrease in the average log fold change (i.e. [Avg_LogFC_all] < -0.25).

In the menu under "Columns" in the "Criteria Builder" section, select "Avg_LogFC_all", which will then appear in the "Criterion" field.
Under "Ops", click on the "<" operator. Then, type -0.25 (this will appear in the "Criterion" field).
Under "Ops", click on the "AND" operator.
In the menu under "Columns" in the "Criteria Builder" section, select "Pvalue".
Under "Ops", click on the "<" operator. Then, type 0.05 (this will appear in the "Criterion" field).
Enter a name for the criterion in the "Label in Legend" field (ex. "Decreased").
Choose a color for the criterion by left-clicking on the box next to "Color". Choose a color from the Color window that appears and click OK.

Once done specifying the criterion (look at the screenshot below to see an example of what is in each of the fields when specifying the criterion), click on the "Add" button.

To add more criteria, repeat the steps mentioned above to specify a new criterion.

To set a criterion to query for all the genes that have a significant increase in the average log fold change, the Criterion "field" should look like

[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05

Save the entire Expression Dataset by going to Expression Datasets > Save.
Exit the Expression Dataset to view the Color Sets on a MAPP.

MAPPFinder Procedure

Launch MAPPFinder from within GenMAPP by selection Tools > MAPPFinder.

Click on the button "Calculate New Results".

Click on "Find File" and choose the the GEX file you created of your Expression Dataset and click OK.
Choose the Color Set and Criteria with which to filter the data. Click on "Decreased" criteria in the right-hand box.
Check the boxes next to "Gene Ontology" and "p value".
Click the "Browse" button and create a meaningful filename for your results (ex. "Merrell_Vibrio_Data_MAPPFinder_Analysis_Decreased_KS_20131017").
Click "Run MAPPFinder".

When the results have been calculated, a Gene Ontology browser will open showing your results.

To see a list of the most significant Gene Ontology terms, click on the menu item "Show Ranked List".

The screenshot below shows the top 10 Gene Ontology terms from the results:

No, my buddy and I did not have the same top 10 GO terms. From comparing the GO result text files (the process to obtain them is described in later steps) between the two of us, I believe that this is a result of a discrepancy in setting the criterion for decreased expression in the Expression Dataset Manager.

In the main MAPPFinder Browser window, click on the button "Collapse the Tree". Then, you can search for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583.

Type the identifier of one of the genes into the MAPPFinder browser gene ID search field.
Choose "OrderedLocusNames" from the drop-down menu to the right of the search field.
Click on the GeneID Search button. The GO term(s) that are associated with that gene will be highlighted in blue.
Below are the genes that were found and the GO terms associated with them:

VC0647: mRNA catabolic process, RNA processing, cytoplasm, RNA binding, 3'-5' exonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucelotidyltransferase activity
VCA0583: transport, outer membrane-bounded periplasmic space, transporter activity

Click on the RNA processing GO term, which is associated with the gene VC0647, the expression of which did change significanly in the experiment (refer to the section Sanity Check: Compare individual genes with known data for the p value). A MAPP will open listing all of the genes (as boxes) associated with that GO term.

To match the gene of interest to its identification go to the UniProt site and type in the ID for your gene into the search bar.
In the MAPP, double click on the box PNP_VIBCH.
An Internet Explorer window will pop up that has links to different pages for the gene in public databases.

The VC0647 gene is involved in mRNA degradation. It hydrolyzes single-stranded polyribunucleotides in the 3'-5' direction. (As described in the gene entry in UniProt.)

In Windows, make a copy of the results (i.e. Merrell_Vibrio_Data_MAPPFinder_Analysis_Decreased_KS_20131017-Criterion0-GO.txt) file.
Open the copy of the results file in Excel.

Comparing the results file between my buddy and I, it seems as if there is a discrepancy in the criterion for Avg_LogFC_All. This discrepancy resulted in different numbers for probes that satisfied the other criteria listed under the "Calculation Summary".

Click on a cell in the row of headers. On the tool bar, select Sort & Filter > Filter. Set the following filters:

Z Score (in column N) greater than 2
PermuteP (in column O) less than 0.05
Number Changed (in column I) greater than or equal to 5 and less than 100
Percent Changed (in column L) greater than or equal to 25

Save the file as a different Excel spreadsheet named, for example "Merrell_Vibrio_Data_MAPPFinder_Analysis_Decreased_KS_20131017-Criterion0-GO_Filtered", by selecting File > Save As and select Excel workbook (.xls) from the drop-down menu.
Use the MAPPFinder browser to determine which GO terms in the spreadsheet are closely related.

Interpretation of GO Results

It would make sense that GO terms for intracellular transport are enriched for the pathogenic V. cholerae strain because its ability to infect a host cell is dependent upon its ability to transfer proteins to the host, such as transcriptional regulators that would influence the gene expression within the host. In addition, there are several GO terms relating to metabolism, such as glucose metabolic process, that were enriched in the pathogenic strain. This is expected because the pathogenic strain will not be able to live, reproduce, or respond to environmental stimuli without performing vital metabolic processes. Also, there are several GO terms relating to gene expression, specifically translation, that were enriched in the pathogenic strain. This is also makes sense because the pathogenic strain must be able to translate all of the proteins that is necessary for it to live and to reproduce so as it infect a host cell successfully.

Files generated in the above procedures

From the way I performed the analysis, Criterion0 in this case refers to the "Decreased" criterion.

Ksherbina Week 8

Contents

Lab Journal: Working with the Vibrio cholerae Microarray Data

Normalize the set of microarray chips in the experiment

Perform statistical analysis of the normalized microarray data

Format the data for GenMAPP

Sanity Check: Number of genes significantly changed

Sanity Check: Compare individual genes with known data

Lab Journal: Analysis of Vibrio cholerae microarray data using GenMAPP and MAPPFinder

Convert normalized microarray data using the GenMAPP Expression Dataset Manager

MAPPFinder Procedure

Interpretation of GO Results

Files generated in the above procedures

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox