Lenaolufson Week 8

From LMU BioDB 2015
Jump to: navigation, search

Lena Olufson

List of uploaded files:

  1. Media:Merrell_Compiled_Raw_Data_Vibrio_LO_20151015_(1_corrected.EX.txt , Media:Merrell_Compiled_Raw_Data_Vibrio_LO_20151015_(1).txt
  2. Media:Merrell Compiled Raw Data Vibrio LO 20151015 (1 corrected.gex , Media:Merrell_Compiled_Raw_Data_Vibrio_LO_20151015_(1_corrected_(1).gex
  3. Media:Merrell_Compiled_Raw_Data_Vibrio_LO_20151015_(1)-Criterion0-GO.txt , Media:Merrell_Compiled_Raw_Data_Vibrio_LO_20151015_(1)-Criterion0-GO_-_Copy.txt
  4. Media:Merrell_Compiled_Raw_Data_Vibrio_LO_20151015_(1).xls
  5. Media:Nuclease_activity.mapp
  6. Media:Merrell Compiled Raw Data Vibrio LO 20151015 (1 corrected (1).gmf

10/15/15 Protocol

  • I went to the Open Wet Ware website and created an account in order to gain access to edit the pages. After obtaining access, I copied and pasted the text for the part 1 page onto my own and then edited it.

Before we begin...

  • The data from the Merrell et al. (2002) paper was accessed from this page at the Stanford Microarray Database.
  • The Log2 of R/G Normalized Ratio (Median) has been copied from the raw data files downloaded from the Stanford Microarray Database.
    • Patient A
      • Sample 1: 24047.xls (A1)
      • Sample 2: 24048.xls (A2)
      • Sample 3: 24213.xls (A3)
      • Sample 4: 24202.xls (A4)
    • Patient B
      • Sample 5: 24049.xls (B1)
      • Sample 6: 24050.xls (B2)
      • Sample 7: 24203.xls (B3)
      • Sample 8: 24204.xls (B4)
    • Patient C
      • Sample 9: 24053.xls (C1)
      • Sample 10: 24054.xls (C2)
      • Sample 11: 24205.xls (C3)
      • Sample 12: 24206.xls (C4)
    • Stationary Samples (We will not be using these, they are listed here for completeness, but do not appear in your compiled raw data file.)
      • Sample 13: 24059.xls (Stationary-1)
      • Sample 14: 24060.xls (Stationary-2)
      • Sample 15: 24211.xls (Stationary-3)
      • Sample 16: 24212.xls (Stationary-4)
  • I downloaded the Merrell_Compiled_Raw_Data_Vibrio.xls file to my Desktop.

Normalize the log ratios for the set of slides in the experiment

I scaled and centered the data (between chip normalization) by performing the following operations:

  • Inserted a new Worksheet into my Excel file, and named it "scaled_centered".
  • Went back to the "compiled_raw_data" worksheet, Selected All and Copy. Went to my new "scaled_centered" worksheet, clicked on the upper, left-hand cell (cell A1) and Pasted.
  • Inserted two rows in between the top row of headers and the first data row.
  • In cell A2, I typed "Average" and in cell A3, typed "StdDev".
  • I went to compute the Average log ratio for each chip (each column of data). In cell B2, I typed the following equation:
=AVERAGE(B4:B5224)
and pressed "Enter". Excel computer the average value of the cells specified in the range given inside the parentheses. Instead of typing the cell designations, I clicked on the beginning cell, scrolled down to the bottom of the worksheet, and shift-clicked on the ending cell.
  • I then computed the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, I typed the following equation:
=STDEV(B4:B5224)
and pressed "Enter".
  • Excel did some work for me. I copied these two equations (cells B2 and B3) and pasted them into the empty cells in the rest of the columns. Excel automatically changed the equation to match the cell designations for those columns.
  • I had now computed the average and standard deviation of the log ratios for each chip. Then I actually did the scaling and centering based on these values.
  • I copied the column headings for all of my data columns and then pasted them to the right of the last data column so that I had a second set of headers above blank colums of cells. I edited the names of the columns so that they read: A1_scaled_centered, A2_scaled_centered, etc.
  • In cell N4, I typed the following equation:
=(B4-B$2)/B$3
In this case, I wanted the data in cell B4 to have the average subtracted from it (cell B2) and be divided by the standard deviation (cell B3). I used the dollar sign symbols in front of the "2" and "3" to tell Excel to always reference that row in the equation, even though I pasted it for the entire column of 5221 genes. Why is this important?
  • I copied and pasted this equation into the entire column. One easy way to do this was to click on the original cell with my equation and positioned my cursor at the bottom right corner. I saw my cursor change to a thin black plus sign (not a chubby white one). When it did, I double clicked, and the formula magically was copied to the entire column of genes.
  • I copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header. I made sure that my equation was correct for the column I was calculating.

Perform statistical analysis on the ratios

I performed this step on the scaled and centered data I produced in the previous step.

  • I inserted a new worksheet and named it "statistics".
  • I went back to the "scaling_centering" worksheet and copied the first column ("ID").
  • I pasted the data into the first column of my new "statistics" worksheet.
  • I went back to the "scaling_centering" worksheet and copied the columns that were designated "_scaled_centered".
  • I went to my new worksheet and clicked on the B1 cell. I selected "Paste Special" from the Edit menu. A window opened: I clicked on the radio button for "Values" and clicked OK. This pasted the numerical result into my new worksheet instead of the equation which must have made calculations on the fly.
  • I deleted Rows 2 and 3 where it said "Average" and "StDev" so that my data rows with gene IDs were immediately below the header row 1.
  • I went to a new column on the right of my worksheet. I typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
  • I computed the average log fold change for the replicates for each patient by typing the equation:
=AVERAGE(B2:E2)
into cell N2. I copied this equation and pasted it into the rest of the column.
  • I created the equation for patients B and C and pasted it into their respective columns.
  • I then computed the average of the averages. I typed the header "Avg_LogFC_all" into the first cell in the next empty column. I created the equation that computed the average of the three previous averages I calculated and pasted it into this entire column.
  • I inserted a new column next to the "Avg_LogFC_all" column that I computed in the previous step. I labeled the column "Tstat". This computed a T statistic that told me whether the scaled and centered average log ratio was significantly different than 0 (no change). I entered the equation:
=AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))
(NOTE: in this case the number of replicates was 3. I was careful that I was using the correct number of parentheses.) I copied the equation and pasted it into all rows in that column.
  • I labeled the top cell in the next column "Pvalue". In the cell below the label, I entered the equation:
=TDIST(ABS(R2),degrees of freedom,2)

The number of degrees of freedom was the number of replicates minus one, so in my case there were 2 degrees of freedom. I copied the equation and pasted it into all rows in that column.

Calculate the Bonferroni p value Correction

  • Then I performed adjustments to the p value to correct for the multiple testing problem. I labeled the next two columns to the right with the same label, Bonferroni_Pvalue.
  • I typed the equation =S2*5221, Upon completion of this single computation, I used the trick to copy the formula throughout the column.
  • I replaced any corrected p value that was greater than 1 by the number 1 by typing the following formula into the first cell below the second Bonferroni_Pvalue header: =IF(T2>1,1,T2). I used the trick to copy the formula throughout the column.

10/20/15 Protocol

  • I continued with the BIOL398-01/S10:Sample Microarray Analysis Vibrio cholerae page to finish the list of actions to perform.

Calculate the Benjamini & Hochberg p value Correction

  • I inserted a new worksheet named "B-H_Pvalue".
  • I copied and pasted the "ID" column from my previous worksheet into the first column of the new worksheet.
  • I inserted a new column on the very left and named it "MasterIndex". I created a numerical index of genes so that I could always sort them back into the same order.
    • I typed a "1" in cell A2 and a "2" in cell A3.
    • I selected both cells. I hovered my mouse over the bottom-right corner of the selection until it made a thin black + sign. I double-clicked on the + sign to fill the entire column with a series of numbers from 1 to 5221 (the number of genes on the microarray).
  • For the following, I used Paste special > Paste values. I copied my unadjusted p values from my previous worksheet and pasted it into Column C.
  • I selected all of columns A, B, and C. I sorted by ascending values on Column C. I clicked the sort button from A to Z on the toolbar, in the window that appeared, sorted by column C, smallest to largest.
  • I typed the header "Rank" in cell D1. I created a series of numbers in ascending order from 1 to 5221 in this column. This was the p value rank, smallest to largest. I typed "1" into cell D2 and "2" into cell D3. I selected both cells D2 and D3. I double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 5221.
  • Then I calculated the Benjamini and Hochberg p value correction. I typed B-H_Pvalue in cell E1. I typed the following formula in cell E2: =(C2*5221)/D2 and pressed enter. I copied that equation to the entire column.
  • I typed "B-H_Pvalue" into cell F1.
  • I typed the following formula into cell F2: =IF(E2>1,1,E2) and pressed enter. I copied that equation to the entire column.
  • I selected columns A through F. I then sorted them by your MasterIndex in Column A in ascending order.
  • I copied column F and used Paste special > Paste values to paste it into the next column on the right of my "statistics" sheet.

Prepare file for GenMAPP

  • I inserted a new worksheet and named it "forGenMAPP".
  • I went back to the "statistics" worksheet and Selected All and Copy.
  • I went to my new sheet and clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK. I then formatted this worksheet for import into GenMAPP.
  • I selected Columns B through Q (all the fold changes). I selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places. I clicked OK.
  • I selected all the columns containing p values. I selected the menu item Format > Cells. Under the number tab, I selected 4 decimal places. I clicked OK.
  • I deleted the left-most Bonferroni p value column, preserving the one that showed the result of my "if" statement.
  • I inserted a column to the right of the "ID" column. I typed the header "SystemCode" into the top cell of this column. I filled the entire column (each cell) with the letter "N".
  • I selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu. Excel made me click through a couple of warnings because it didn't like me going all independent and choosing a different file type than the native .xls. That was OK. My new *.txt file was now ready for import into GenMAPP. But before I did that, I wanted to know a few things about my data as shown in the next section.

Sanity Check: Number of genes significantly changed

Before I moved on to the GenMAPP/MAPPFinder analysis, I wanted to perform a sanity check to make sure that I performed our data analysis correctly. I found out the number of genes that are significantly changed at various p value cut-offs and also compared my data analysis with the published results of Merrell et al. (2002).

  • I opened my spreadsheet and went to the "forGenMAPP" tab.
  • I clicked on cell A1 and selected the menu item Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
  • I clicked on the drop-down arrow on my "Pvalue" column. I selected "Custom". In the window that appeared, set a criterion that will filter my data so that the Pvalue has to be less than 0.05.
    • How many genes have p value < 0.05? and what is the percentage (out of 5221)?
      • 948 genes, 18.15%
    • What about p < 0.01? and what is the percentage (out of 5221)?
      • 235 genes, 4.5%
    • What about p < 0.001? and what is the percentage (out of 5221)?
      • 24 genes, 0.45%
    • What about p < 0.0001? and what is the percentage (out of 5221)?
      • 2 genes, 0.03%
  • When I used a p value cut-off of p < 0.05, what I was saying is that I would have seen a gene expression change that deviated this far from zero less than 5% of the time.
  • I had just performed 5221 T tests for significance. Another way to state what I was seeing with p < 0.05 is that I expected to see this magnitude of a gene expression change in about 5% of my T tests, or 261 times. (Tested my understanding: http://xkcd.com/882/.) Since I had more than 261 genes that pass this cut off, I knew that some genes were significantly changed. However, I didn't know which ones. To apply a more stringent criterion to my p values, I performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction was very stringent. The Benjamini-Hochberg correction was less stringent. To see this relationship, I filtered my data to determine the following:
    • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 5221)?
      • 0 genes, 0%
    • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 5221)?
      • 0 genes, 0%
  • In summary, the p value cut-off was not thought of as some magical number at which data became "significant". Instead, it was a moveable confidence level. If I wanted to be very confident of my data, I used a small p value cut-off. If I was OK with being less confident about a gene expression change and wanted to include more genes in my analysis, I used a larger p value cut-off.
  • The "Avg_LogFC_all" told me the size of the gene expression change and in which direction. Positive values were increases relative to the control; negative values were decreases relative to the control.
    • I kept the (unadjusted) "Pvalue" filter at p < 0.05, filtered the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? (and %)
      • 325 genes, 6.7%
    • I kept the (unadjusted) "Pvalue" filter at p < 0.05, filtered the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. How many are there? (and %)
      • 596, 11.4%
    • What about an average log fold change of > 0.25 and p < 0.05? (and %)
      • 339, 6.4%
    • Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
      • 579, 11.1%
  • In summary, the p value cut-off was not thought of as some magical number at which data became "significant". Instead, it was a moveable confidence level. If I wanted to be very confident of my data, I used a small p value cut-off. If I was OK with being less confident about a gene expression change and wanted to include more genes in my analysis, I used a larger p value cut-off. For the GenMAPP analysis below, I used the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05 for my analysis because I wanted to include several hundred genes in my analysis.
  • What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?
    • Merrell et al. used criteria that consisted of a two-class SAM analysis to determine a significant gene expression change and the genes that had a twofold change in expression (or greater) were significantly changed. This is different from our method because we used p-values to determine the significant in the change in expression.

Sanity Check: Compare individual genes with known data

  • Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. What are their fold changes and p values? Are they significantly changed in our analysis?
    • VC0028: Two entries
      • Fold change: 1.27 for first, 1.65 for second
      • P-value:0.0692 for first, 0.0474 for second
      • not significantly change for first, significantly changed for second
    • VC0941: Two entries
      • fold change: 1 -0.28 and 0.09
      • P-value: 0.1636 and 0.6759
      • not significantly changed
    • VC0869: Five entries
      • Fold changes: 2.12, 1.50, 1.59, 1.95, 2.20
      • P-value: 0.02, 0.0174, 0.0463, 0.0227, 0.002
      • All are significantly changed
    • VC0051: Two entries
      • Fold change: 1.89, 1.92
      • P-value: 0.016, 0.0139
      • Both are significantly changed
    • VC0647: Three entries
      • Fold change: -1.11, -0.94, -1.05
      • P-value: 0.0003, 0.0125, 0.0051
      • All significantly changed
    • VC0468:
      • Fold change: -0.17
      • P-value: 0.3350
      • Not significantly changed
    • VC2350:
      • Fold change: -2.40
      • P-vallue: 0.0130
      • This is significantly changed
    • VCA0583
      • Fold change: 1.06
      • P-value: 0.1011
      • Not significantly changed

</div>

  • Next I moved onto the and part 2 page to continue the assignment now using the programs genMAPP and MAPPFinder. I copied and pasted the text from the Open Wet Ware page onto this page and then edited it.

Map Onto Biological Pathways (GenMAPP & MAPPFinder)

Each time I launched GenMAPP, I needed to make sure that the correct Gene Database (.gdb) was loaded.

  • I looked in the lower left-hand corner of the window to see which Gene Database had been selected.
  • If I needed to change the Gene Database, I selected Data > Choose Gene Database. I navigated to the directory C:\GenMAPP 2 Data\Gene Databases and chose the correct one for my species.
  • For the exercise, I needed to download the appropriate Vibrio cholerae Gene Database.
    • Half of the class will use the Vc-Std_External_20090622.gdb Gene Database that was initially created by the Fall 2008 Biological Databases class.
    • Half of the class will use a new Vc-Std_External_20101022.gdb Gene Database that was created by Drs. Dahlquist and Dionisio a year later.
    • The members of a pair should each choose a different gene database.
      • I downloaded the Vc-Std_External_20090622.gdb Gene Database for this exercise.
  • I clicked on the link for the Gene Database to which I was assigned, downloaded the file, and saved it into the folder C:\GenMAPP 2 Data\Gene Databases, and extracted it.

GenMAPP Expression Dataset Manager Procedure

  • I launched the GenMAPP Program. I checked to make sure the correct Gene Database was loaded.
    • I looked in the lower, left-hand corner of the main GenMAPP Drafting Board window to see the name of the Gene Database that was loaded. If it was not the correct Gene Database or it said "No Gene Database", then went to the Data > Choose Gene Database menu item and selected the Gene Database I needed to perform the analysis.
    • Remember, you and your partner are going to use different versions of the Vibrio cholerae Gene Database for this exercise.
  • I selected the Data menu from the main Drafting Board window and chose Expression Dataset Manager from the drop-down list. The Expression Dataset Manager window opened.
  • I selected New Dataset from the Expression Datasets menu. I selected the tab-delimited text file that I formatted for GenMAPP (.txt) in the procedure above from the file dialog box that appeared.
    • I needed to download my .txt file from the wiki onto my Desktop.
  • The Data Type Specification window appeared. GenMAPP was expecting that I was providing numerical data. If any of my columns had text (character) data, I checked the box next to the field (column) name.
    • The Vibrio data I had been working with did not have any text (character) data in it.
  • I allowed the Expression Dataset Manager to convert my data.
    • This took a few minutes depending on the size of the dataset and the computer’s memory and processor speed. When the process was complete, the converted dataset was active in the Expression Dataset Manager window and the file was saved in the same folder the raw data file was in, named the same except with a .gex extension; for example, MyExperiment.gex.
    • A message appeared saying that the Expression Dataset Manager could not convert one or more lines of data. Lines that generated an error during the conversion of a raw data file were not added to the Expression Dataset. Instead, an exception file was created. The exception file was given the same name as my raw data file with .EX before the extension (e.g., MyExperiment.EX.txt). The exception file contained all of my raw data, with the addition of a column named ~Error~. This column contained either error messages or, if the program found no errors, a single space character.
      • Record the number of errors. For your journal assignment, open the .EX.txt file and use the Data > Filter > Autofilter function to determine what the errors were for the rows that were not converted. Record this information in your individual journal page.
        • Number of errors: 772.
      • It is likely that you will have a different number of errors than your partner who is using a different version of the Vibrio cholerae Gene Database. Which of you has more errors? Why do you think that is? Record your answers in your journal page.
        • I had more errors than my partner does; I had 772 errors and he had only 121 errors. This could be due to the different databases being used between us and since he has a more updated version than me one would expect there to be less errors with the newer version than the older one.
      • Upload your exceptions file: EX.txt to your wiki page.

10/22/15 Protocol

Map Onto Biological Pathways (GenMAPP & MAPPFinder) (continued)

  • I customized the new Expression Dataset by creating new Color Sets which contained the instructions to GenMAPP for displaying data on MAPPs.
    • Color Sets contain the instructions to GenMAPP for displaying data from an Expression Dataset on MAPPs. I created a Color Set by filling in the following different fields in the Color Set area of the Expression Dataset Manager: a name for the Color Set, the gene value, and the criteria that determined how a gene object is colored on the MAPP. I entered a name in the Color Set Name field that is 20 characters or fewer.
    • The Gene Value was the data displayed next to the gene box on a MAPP. I selected the column of data to be used as the Gene Value from the drop down list or select [none]. I used "Avg_LogFC_all" for the Vibrio dataset I just created.
    • I activated the Criteria Builder by clicking the New button.
    • I entered a name for the criterion in the Label in Legend field.
    • I chose a color for the criterion by left-clicking on the Color box. I chose a color from the Color window that appeared and clicked OK.
    • I stated the criterion for color-coding a gene in the Criterion field.
      • A criterion was stated with relationships such as "this column greater than this value" or "that column less than or equal to that value". Individual relationships could be combined using as many ANDs and ORs as needed. A typical relationship is
[ColumnName] RelationalOperator Value
with the column name always enclosed in brackets and character values enclosed in single quotes. For example:
[Fold Change] >= 2
[p value] < 0.05
[Quality] = 'high'
This was the equivalent to queries that I performed on the command line when working with the PostgreSQL movie database. GenMAPP was using a graphical user interface (GUI) to help the user format the queries correctly. The easiest and safest way to create criteria was by choosing items from the Columns and Ops (operators) lists shown in the Criteria Builder. The Columns list contained all of the column headings from my Expression Dataset. I chose a column from the list, I clicked on the column heading. It appeared at the location of the cursor in the Criterion box. The Criteria Builder surrounded the column names with brackets.
The Ops (operators) list contained the relational operators that were used in the criteria: equals ( = ) greater than ( > ), less than ( < ), greater than or equal to ( >= ), less than or equal to ( <= ), is not equal to ( <> ). I chose an operator from the list, by clicking on the symbol. It appeared at the location of the insertion bar (cursor) in the Criterion box. The Criteria Builder automatically surrounded the operators with spaces.
The Ops list also contained the conjunctions AND and OR, which were used to make compound criteria. For example:
[Fold Change] > 1.2 AND [p value] <= 0.05
Parentheses controlled the order of evaluation. Anything in parentheses was evaluated first. Parentheses were nested. For example:
[Control Average] = 100 AND ([Exp1 Average] > 100 OR [Exp2 Average] > 100)
Column names were used anywhere a value could be, for example:
[Control Average] < [Experiment Average]
  • After completing a new criterion, I added the criterion entry (label, criterion, and color) to the Criteria List by clicking the Add button.
    • For the Vibrio dataset, I created two criterion. "Increased" was [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05 and "Decreased was [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05.
      • The buttons to the right of the list represented actions that could be performed on individual criteria. To modify a criterion label, color, or the criterion itself, I first selected the criterion in the list by left-clicking on it, and then clicked the Edit button. This put the selected criterion into the Criteria Builder to be modified. I clicked the Save button to save changes to the modified criterion; I clicked the Add button to add it to the list as a separate criterion. To remove a criterion from the list, I left-clicked on the criterion to select it, and then clicked on the Delete button. The order of Criteria in the list had significance to GenMAPP. When applying an Expression Dataset and Color Set to a MAPP, GenMAPP examined the expression data for a particular gene object and applied the color for the first criterion in the list that was true. Therefore, it was imperative that when criteria overlapped I put the most important or least inclusive criteria in the list first. To change the order of the criteria in the list, I left-clicked on the criterion to select it and then clicked the Move Up or Move Down buttons. No criteria met and Not found were always the last two positions in the list.
  • I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. Changes made to a Color Set were not saved until I do this.
  • I exited the Expression Dataset Manager to view the Color Sets on a MAPP. I chose Exit from the Expression Dataset menu or clicked the close box in the upper right hand corner of the window.
  • Upload your .gex file to your journal entry page for later retrieval.

MAPPFinder Procedure

Note: You and your partner will both do the same criterion, either "Increased" or "Decreased", but your group does not need to do both "Increased" and "Decreased" Sign up for the criterion you want on the group list ( Fall 2010 or Fall 2013) so that we can make sure that as a class we are covering both criteria.

  • I did decreased.
  • I launched the MAPPFinder program (or from within GenMAPP, select Tools > MAPPFinder).
  • I made sure that the Gene Database for the correct species is loaded.
  • I clicked on the button "Calculate New Results".
  • I clicked on "Find File" and chose the my Expression Dataset file, for example, "MyDataset.gex", and clicked OK.
  • I chose the Color Set and Criteria with which to filter the data. I clicked on "Decreased".
  • I checked the boxes next to "Gene Ontology" and "p value".
  • I clicked the "Browse" button and created a meaningful filename for my results.
  • I clicked "Run MAPPFinder". The analysis took several minutes. It looked like the computer was stalled; be patient, it eventually started running.
  • When the results were calculated, a Gene Ontology browser opened showing my results. All of the Gene Ontology terms that had at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 was considered a "significant" result.
  • To see a list of the most significant Gene Ontology terms, I clicked on the menu item "Show Ranked List".
    • List the top 10 Gene Ontology terms in your individual journal entry.
  1. protein folding
  2. aromatic amino acid family biosynthetic process
  3. chorismate metabolic process
  4. cytoplasm
  5. intracellular part
  6. locomotion
  7. unfolded protein binding
  8. signal transducer activity
  9. molecular transducer activity
  10. cis-trans isomerase activity
    • Compare your list with your partner who used a different version of the Gene Database. Are your terms the same or different? Why do you think that is? Record your answer in your individual journal entry.
      • Our terms are different than one another due to the different versions of the Gene Databases used by both of us. The earlier database that I used had fewer gene entries and therefore less gene IDs than my partner Kevin's database. This leads the top 10 gene ontology terms to be different between the two databases because perhaps the genes that were added into the later database were important ones discovered and therefore they replaced the old top 10 genes. The only common gene between our two gene databases was protein folding.
  • One of the things I did in MAPPFinder was to find the Gene Ontology term(s) with which a particular gene was associated. First, in the main MAPPFinder Browser window, I clicked on the button "Collapse the Tree". Then, I searched for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. I typed the identifier for one of these genes into the MAPPFinder browser gene ID search field. I chose "OrderedLocusNames" from the drop-down menu to the right of the search field. I clicked on the GeneID Search button. The GO term(s) that were associated with that gene were highlighted in blue. List the GO terms associated with each of those genes in your individual journal. (Note: they might not all be found.) Are they the same as your partner who is using a different Gene Database? Why or why not?
    • GO terms with VC0028:
      • No terms found.
    • GO terms with VC0941:
      • No terms found.
    • GO terms with VC0869:
      • No terms found.
    • GO terms with VC0051:
      • No terms found.
    • GO terms with VC0647:
      • mRNA catabolic process, RNA processing, RNA binding, 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity
    • GO terms with VC0468
      • No terms found.
    • GO terms with VC2350:
      • No terms found.
    • GO terms with VCA0583:
      • outer membrane-bound periplasmic space
  • I clicked on one of the GO terms that were associated with one of the genes I looked up in the previous step. A MAPP opened listing all of the genes (as boxes) associated with that GO term. The genes named within the map were based on the UniProt identification system. To match the gene of interest to its identification I went to the UniProt site and typed in my gene ID into the search bar. Moreover, the genes on the MAPP were color-coded with the gene expression data from the microarray experiment. List in your journal entry the name of the GO term you clicked on and whether the expression of the gene you were looking for changed significantly in the experiment.
    • I clicked on the nucleotidyltransferase activity term and the expression of the gene I was looking for (entry:Q9KU76) did decrease significantly as it was highlighted green on the map.
    • I double-clicked on the gene box. This opened a Internet Explorer window called the "Backpage" for this gene. This page had links to pages for this gene in the public databases. Click on the links to find out the function of this gene and record your answer in your individual journal page.
      • The function of the Q9KU67_VIBCH gene is to catalyze the DNA-template-directed extension of the 3'-end of an RNA strand by one nucleotide at a time.
    • The MAPP that had just been created was stored in the directory, C:\GenMAPP 2 Data\MAPPs\VC GO. Upload this file and link to it in your journal.
  • In Windows, I made copy of my results (XXX-CriterionX-GO.txt) file.
    • "XXX" refers to the name I gave to my results file.
    • "CriterionX" refers to either "Criterion0" or "Criterion1". Since computers start counting at zero, "Criterion0" was the first criterion in the list I clicked on ("Increased" if you followed the directions) and "Criterion1" is the second criterion in the list I clicked on ("Decreased" if you followed the directions).
    • Upload your results file to your journal page.
  • I launched Microsoft Excel. I opened the copies of the .txt files in Excel. This showed my the same data that I saw in the MAPPFinder Browser, but in tabular form.
  • I looked at the top of the spreadsheet. There were rows of information that gave me the background information on how MAPPFinder made the calculations. Compare this information with your partner who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different? Record this information in your individual journal entry.
    • The numbers that are different are: the genes meeting the criterion linked to a GO term, the genes meeting the criterion linked to a GO term, probes linked to a UniPort ID, genes linked to a GO term, and the z score was based on a different number of N and R distinct genes in the GO. These numbers are different because the databases have a different amount of genes in them since the updated version has more genes added to its supply. This leads the numbers calculated to be different and not include the same amounts.
  • I filtered this list to show the top GO terms represented in my data for both the "Increased" and "Decreased" criteria. I needed to filter my list down to about 20 terms. I clicked on a cell in the row of headers for the data. Then I went to the Data menu and clicked "Filter > Autofilter". Drop-down arrows appeared in the row of headers. I then chose to filter the data. I clicked on the drop-down arrow for the column I wished to filter and chose "(Custom…)". A window opened giving me choices on how I wanted to filter. I set these two filters:
Z Score (in column N) greater than 2
PermuteP (in column O) less than 0.05
I used these two filters depending on the number of terms I had:
Number Changed (in column I) greater than or equal to 4 or 5 AND less than 100
Percent Changed (in column L) greater than or equal to 25-50%
  • Are any of your filtered GO terms closely related to one another, meaning are they a direct child or parent to another term in the list? You can judge this by comparing your spreadsheet with the MAPPFinder browser. Highlight the terms that fit this relationship with the same color in your Excel spreadsheet. Upload your .xls file to your journal page.
  • Interpret your results. Look up the definitions for any GO terms that are unfamiliar to you. The "official" definitions for GO terms can be found at http://www.geneontology.org. You can use one of the online biological dictionaries as a supplement, if needed. Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived Vibrio cholerae. You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenecity of the bacterium? You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words. This is where the real "brain power" comes in with interpreting DNA microarray data. Even experienced scientists struggle with this part. Use your creativity as a scientist to stretch your brain in this question.
      • The data used in this exercise campers the patient-dervired and the lab-grown gene expressions of the parasite, thus the ones that were repressed during pathogenic behavior have the lower expressions between the two. I detected two prominent themes among the filtered GO term list and these were the activities of transducers as well as the processes such as aromatic amino acid family biosynthetic process and the chorismate metabolic process. The signal and molecular transducer activities are both responsible for making sure the correct actions get carried out in the cell. For example, the signal transducer makes sure that there is the right amount of energy provided for the cell when it is in certain environments. The transducer activities relate to pathogenicity because the bacterium requires various levels of ATP and glucose to provide energy for it depending on the location of the bacterium. When the bacterium is in a hostile environment it will demand more processes and activities to be turned on and stimulated in order to maintain the energy demands required to stay at homeostasis. The metabolic processes were also a main theme in the list and these are very relevant and important to the bacterium. When the bacterium is in a host and therefore an undesirable environment, it must suppress its metabolic processes in order to preserve the amount of energy expended. The specific aromatic amino acid family biosynthetic process and the chorismate metabolic process will most likely be decreased when the bacterium is in a energy-lacking environment because the bacterium needs the aromatic amino acids and the chorismates in order to preserve its functionality and alive state. The cytoplasm and intracellular parts relate to the bacterium because the locomotion describes the movement of the bacterium through these named parts of host cells. It is important that the bacterium has a locomotion that allows it to move through any environment it encounters so that it can survive and try to obtain the necessary nutrients to stay alive.
  • There is one other file you need to save to your journal page. It has a .gmf extension and should be in the same fold as the .gex file that you created with the GenMAPP Expression Dataset Manager. You will need this file to re-open your results in MAPPFinder.


Loyola Marymount University: website


Weekly Assignments Individual Journal Pages Shared Journal Pages