Laurmagee: Week 8

Part 1

First download the following data file: File:Merrell Compiled Raw Data Vibrio.xls
Insert a new spreadsheet in the label it "scaled_centered".
Select all the data in your previous worksheet and copy it into the new spreadsheet
Insert two rows in between the top row of headers and the first data row.
In cell A2, type "Average" and in cell A3, type "StdDev".
Compute the Average log ratio for each chip (each column of data). In cell B2, type the following equation: =AVERAGE(B4:B5224)
Compute the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, type the following equation: =STDEV(B4:B5224)
Drag these lines into the rest of the columns, because Excel will automatically change the equation to match your specific column.
Copy the column headings for all of your data columns and then paste them to the right of the last data column so that you have a second set of headers above blank colums of cells. Edit the names of the columns so that they now read: A1_scaled_centered, A2_scaled_centered, etc.
In cell N4, type the following equation: =(B4-B$2)/B$3 The dollar signs are important because we will be using the same mean (B2) and standard deviation (B3) in all the succeeding columns values as well.
Copy and paste the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header. Be sure that your equation is correct for the column you are calculating (i.e. change the values to match the cells with the correct average and standard deviation).
Insert a new worksheet and name it "statistics".
Go back to the "scaling_centering" worksheet and copy the first column ("ID"). And paste these values into the first column of your new spreadsheet.
Go back to the "scaling_centering" worksheet and copy the columns that are designated "_scaled_centered". Click on the B1 cell and select "Paste Special" from the Edit menu. A window will open: click on the radio button for "Values" and click OK.
Go to a new column on the right of your worksheet. Type the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
Compute the average log fold change for the replicates for each patient by typing the equation: =AVERAGE(B2:E2) into cell N2. Copy this equation and paste it into the rest of the column.
Create the equation for patients B and C and paste it into their respective columns.
Now you will compute the average of the averages. Type the header "Avg_LogFC_all" into the first cell in the next empty column.
Create the equation that will compute the average of the three previous averages you calculated and paste it into this entire column.
Insert a new column labeled "Tstat".
Enter the equation: =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3)) and paste it into all rows in that column.
Label the top cell in the next column "Pvalue" and enter the equation: =TDIST(ABS(R2),2,2) paste it into all rows in that column.
Insert a new worksheet and name it "forGenMAPP".
Go back to the "statistics" worksheet and Select All and Copy.
Go to your new sheet and click on cell A1 and select Paste Special, click on the Values radio button, and click OK.
Select Columns B through Q (all the fold changes). Select the menu item Format > Cells. Under the number tab, select 2 decimal places. Click OK.
Select Columns R and S. Select the menu item Format > Cells. Under the number tab, select 4 decimal places. Click OK.
Select Columns N through S and Cut. Select Column B by left-clicking on the "B" at the top of the column. Then right-click on the Column B header and select "Insert Cut Cells". This will insert the data without writing over your existing columns.
Insert a column to the right of the "ID" column. Type the header "SystemCode" into the top cell of this column. Fill the entire column (each cell) with the letter "N".
Select the menu item File > Save As, and choose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.

To make sure that we performed our data analysis correctly, first open your spreadsheet and go to the "forGenMAPP" tab.
Click on cell A1 and select the menu item Data > Filter > Autofilter.
Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
How many genes have p value < 0.05?
- 5221 genes.
What about p < 0.01?
- 169 genes.
What about p < 0.001?
- 19 genes.
What about p < 0.0001?
- 1 gene.
Keeping the "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there?
- 300 genes
Keeping the "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. How many are there?
- 454 genes
What about an average log fold change of > 0.25 and p < 0.05?
- 281 genes
Or an average log fold change of < -0.25 and p < 0.05? (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
- 442 genes
What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?
- The Statistical Analysis for Microarrays (SAM) program was used to determine a significant gene expression change, whereas we used Excel.
Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. What are their fold changes and p values? Are they significantly changed in our analysis?
1. VC0028
- Fold Change: 1.37
- P-value: 0.0466
1. VC0941
- Fold Change: -.21
- P-value: 0.2467
1. VC0869
- Fold Change: 1.63
- P-value: 0.0464
1. VC0051
- Fold Change: 1.79
- P-value: 0.0152
1. VC0647
- Fold Change: -.78
- P-value: 0.0776
1. VC0468
- Fold Change: -.01
- P-value: 0.9443
1. VC235
- Fold Change: -1.12
- P-value: 0.2265
1. VCA0583
- Fold Change: .50
- P-value: 0.2768
These values were not significantly changed in our analysis.

Part 2

Launch GenMapp and note that the lower left-hand corner should display the gene database you would like to select.
To find the gene database you want to use, select Data > Choose Gene Database. I used the gene database Vc-Std_External_20101022.gdb that can be downloaded here: [1].
Select the Data menu from the main Drafting Board window and choose Expression Dataset Manager from the drop-down list. The Expression Dataset Manager window will open.
Select New Dataset from the Expression Datasets menu. Select the tab-delimited text file that you formatted for GenMAPP (.txt) in Part 1.
The Data Type Specification window will appear. GenMAPP is expecting that you are providing numerical data. If any of your columns has text (character) data, you would check the box next to the field (column) name. In this case there are not any text data, so the box does not need to be checked.
Once the data has been converted, save your new .gex file:
File:Merrell Compiled Raw Data Vibrio LM 20131010.gex
A message may appear saying that the Expression Dataset Manager could not convert one or more lines of data. Lines that generate an error during the conversion of a raw data file are not added to the Expression Dataset. Instead, an exception file is created. The exception file is given the same name as your raw data file with .EX before the extension. The exception file will contain all of your raw data, with the addition of a column named “Error”. This column contains either error messages or, if the program finds no errors, a single-space character.
File:Merrell Compiled Raw Data Vibrio LM 20131010.EX.txt
The above file notes 121 errors all due to an error message of "Gene not found in OrderedLocusNames or any related system". Since I am working with the 2010 database, I have less errors than my partner. This is most likely due to the fact that more genes have been added to the system, so many are now identifiable with the 2010 update.
Customize the new Expression Dataset by creating new Color Sets. This is done by filling in the following different fields in the Color Set area of the Expression Dataset Manager: a name for the Color Set, the gene value, and the criteria that determine how a gene object is colored on the MAPP.
Select the column of data to be used as the Gene Value from the drop down list or select [none]. We will use "Avg_LogFC_all" for the Vibrio dataset you just created.
Note the criteria as follows: with the column name always enclosed in brackets and character values enclosed in single quotes. The Ops (operators) list contains the relational operators that may be used in the criteria: equals ( = ) greater than ( > ), less than ( < ), greater than or equal to ( >= ), less than or equal to ( <= ), is not equal to ( <> ).
In this example, you will create two criterion. "Increased" will be [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05 and "Decreased will be [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05.
Save the entire Expression Dataset by selecting Save from the Expression Dataset menu and exit the window to view the Color Sets on a MAPP.
Launch the MAPPFinder program
Make sure that the Gene Database for the correct species is loaded.
Click on the button "Calculate New Results" and then "Find File" and choose your Expression Dataset file.
Choose the Color Set and Criteria with which to filter the data. Click on the "Decreased" criteria in the right-hand box.
Check the boxes next to "Gene Ontology" and "p value". Then click the "Browse" button and create a meaningful filename for your results.
Click "Run MAPPFinder". When the results have been calculated, a Gene Ontology browser will open showing your results. To see a list of the most significant Gene Ontology terms, click on the menu item "Show Ranked List".
The top ten gene ontology terms in my list are the following:
- Glucose Catabolic Process
- Hexose Catabolic Process
- Glycolysis
- Monosaccharide Catabolic Process
- Cytoplasm
- Alcohol CCtabolic Process
- Cellular Carbohydrate Catabolic Process
- Glucose Metabolic Process
- Protein Folding
- Hexose Metabolic Process
These top ten terms are different from that of those shown in an older version of the database, most likely due to the fact that more research was added to the database that allowed for a different, and probably more accurate, top ten.
One of the things you can do in MAPPFinder is to find the Gene Ontology term(s) with which a particular gene is associated. To use this feature, first, in the main MAPPFinder Browser window, click on the button "Collapse the Tree". Then, you can search for desired genes.
For example, to search for the genes mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. Type the identifier for one of these genes into the MAPPFinder browser gene ID search field. *Choose "OrderedLocusNames" from the drop-down menu to the right of the search field and click on the GeneID Search button.
The GO term(s) that are associated with that gene will be highlighted in blue and appear as follows:
1. VC0028:
  - Branched chain family amino acid biosynthetic process
  - Cellular Amino Acid Biosynthetic process
  - Metabolic Process
  - Metal Ion Binding
  - Iron-Sulfur Cluster Binding
  - 4 Iron, 4 Sulfur Cluster Binding
  - Catalytic Activity
  - Lyase Activity
  - Dihydroxy-acid Dehydratase Activity
2. VC0941:
  - Glycine Metabolic Process
  - L-serine Metabolic Process
  - One-Carbon Metabolic Process
  - Cytoplasm
  - Pyridoxal Phosphate Binding
  - Catalytic Activity
  - Transferase Activity
  - Glycine Hydroxymethyltransferase Activity
3. VC0869
  - Glutamine Metabolic Process
  - Purine Nucleotide Biosynthetic Process
  - 'de novo' IMP Biosynthetic Process
  - Cytoplasm
  - Nucleotide Binding
  - ATP binding
  - Catalytic Activity
  - Ligase Activity
  - Phosphoribosylformyglycinamidine Synthase Activity
4. VC0051:
  - Purine Nucleotide Biosynthetic Process
  - 'de novo' IMP Bisynthetic Process
  - Nucleotide Binding
  - ATP Binding
  - Catalytic Activity
  - Lyase Activity
  - Carboxy-lyase Activity
  - Phosphoribosylaminoimidazole
5. VC0647:
  - mRNA Catabolic Process
  - RNA Processing
  - Cytoplasm
  - Mitochondrion
  - RNA Binding
  - 3'-5'-exoribonuclease Activity
  - Transferase Activity
  - Nucleotidyltransferase Activity
  - Polyribonucleotide Nucleotidyltransferase Activity
6. VC0468:
  - Glutathione Biosynthetic Process
  - Metal Ion Binding
  - Nucleotide Binding
  - ATP Binding
  - Catalytic Activity
  - Ligase Activity
  - Glutathione Synthase Activity
7. VC2350:
  - Deoxyribonucleotide Catabolic Process
  - Metabolic Process
  - Cytoplasm
  - Catalytic Activity
  - Lyase Activity
  - Deoxyribose-phosphate aldolase Activity
8. VCA0583:
  - Transport
  - Outer Membrane-Bounded Periplasmic Space
  - Transporter Activity
These genes are very similar to what show in the earlier version of this database, but there is more genes and information offered in the 2010 database due to research updates.
You can click on specific GO terms for example I chose “RNA Binding” for gene #VC0647 and found that it had not changed significantly in an experiment outlined on the UniProt identification system.
File:New Data File -Decreased-Criterion1-GO.txt
File:New Data File -Decreased-Criterion1-GO.xls
Open the copies of the .txt files in Excel and look at the top of the spreadsheet. There are rows of information that give you the background information on how MAPPFinder made the calculations.

Compare this information with your buddy who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different? Record this information in your individual journal entry.

Click on a cell in the row of headers for the data. Then go to the Data menu and click "Filter > Autofilter".
Drop-down arrows will appear in the row of headers. You can now choose to filter the data. Click on the drop-down arrow for the column you wish to filter and choose "(Custom…)".
You must set these two filters:
- Z Score (in column N) greater than 2
- PermuteP (in column O) less than 0.05
- Number Changed (in column I) greater than or equal to 5 AND less than 100
- Percent Changed (in column L) greater than or equal to 38%
Save your changes to an Excel spreadsheet. Select File > Save As and select Excel workbook (.xls) from the drop-down menu.
File:New Data File-Decreased-Criterion1-Highlighted-GO.xls
File:Merrell Compiled Raw Data Vibrio LM 20131010.gmf

Laurmagee (talk) 22:56, 17 October 2013 (PDT)

Laurmagee: Week 8

Part 1

Part 2

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox