Emilysimso Week 8

From LMU BioDB 2015
Jump to: navigation, search

Part One

Directions

To scale and center the data (between chip normalization) I performed the following operations:

  • Inserted a new Worksheet into my Excel file, and name it "scaled_centered".
  • Went back to the "compiled_raw_data" worksheet, Select All and Copy. Went to my new "scaled_centered" worksheet, clicked on the upper, left-hand cell (cell A1) and Paste.
  • Inserted two rows in between the top row of headers and the first data row.
  • In cell A2, typed "Average" and in cell A3, typed "StdDev".
  • Computed the Average log ratio for each chip (each column of data). In cell B2, typed the following equation:
=AVERAGE(B4:B5224)
  • and pressed "Enter".
  • Computed the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, typed the following equation:
=STDEV(B4:B5224)
  • and pressed "Enter".

Copied these two equations (cells B2 and B3) and pasted them into the empty cells in the rest of the columns.

Copied the column headings for all of my data columns and then pasted them to the right of the last data column so that I had a second set of headers above blank columns of cells. Edited the names of the columns so that they read: A1_scaled_centered, A2_scaled_centered, etc.

  • In cell N4, typed the following equation:
=(B4-B$2)/B$3


Copied and pasted this equation into the entire column. Copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header.

I then performed statistical analysis on the ratio

  • Inserted a new worksheet and named it "statistics".
  • Went back to the "scaling_centering" worksheet and copied the first column ("ID").
  • Pasted the data into the first column of my new "statistics" worksheet.
  • Went back to the "scaling_centering" worksheet and copied the columns that were designated "_scaled_centered".
  • Went back to my new worksheet and clicked on the B1 cell. Selected "Paste Special" from the Edit menu. A window opened: clicked on the radio button for "Values" and clicked OK.
  • Deleted Rows 2 and 3 where it said "Average" and "StDev" so that my data rows with gene IDs were immediately below the header row 1.
  • Created a new column on the right of your worksheet. Typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
  • Computed the average log fold change for the replicates for each patient by typing the equation:
=AVERAGE(B2:E2)
  • into cell N2. Copied this equation and pasted it into the rest of the column.

Created the equation for patients B and C and pasted it into their respective columns.

Typed the header "Avg_LogFC_all" into the first cell in the next empty column. Created the equation that computed the average of the three previous averages I calculated and pasted it into this entire column.

Inserted a new column next to the "Avg_LogFC_all" column that I computed in the previous step. Labeled the column "Tstat". This computed a T statistic that told me whether the scaled and centered average log ratio were significantly different than 0 (no change). Entered the equation:

=AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))

(NOTE: in this case the number of replicates is 3) Copied the equation and pasted it into all rows in that column.

Labeled the top cell in the next column "Pvalue". In the cell below the label, entered the equation:

=TDIST(ABS(R2),degrees of freedom,2)

The number of degrees of freedom is the number of replicates minus one, so in my case there were 2 degrees of freedom. Copied the equation and pasted it into all rows in that column.

Calculated the Bonferroni p value Correction

Labeled the next two columns to the right with the same label, Bonferroni_Pvalue. Typed the equation

=S2*5221

Upon completion of this single computation, used the trick to copy the formula throughout the column.

Replaced any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second Bonferroni_Pvalue header:

=IF(T2>1,1,T2)

Used the trick to copy the formula throughout the column.

Calculated the Benjamini & Hochberg p value Correction

  • Inserted a new worksheet named "B-H_Pvalue".
  • Copied and pasted the "ID" column from my previous worksheet into the first column of the new worksheet.
  • Inserted a new column on the very left and named it "MasterIndex".
  • Typed a "1" in cell A2 and a "2" in cell A3.
  • Selected both cells. Hovered my mouse over the bottom-right corner of the selection until it made a thin black + sign. Double-clicked on the + sign to fill the entire column with a series of numbers from 1 to 5221 (the number of genes on the microarray).
  • For the following, used Paste special > Paste values. Copied my unadjusted p values from my previous worksheet and pasted it into Column C.
  • Selected all of columns A, B, and C. Sorted by ascending values on Column C. Clicked the sort button from A to Z on the toolbar, in the window that appears, sorted by column C, smallest to largest.
  • Typed the header "Rank" in cell D1. Typed "1" into cell D2 and "2" into cell D3. Selected both cells D2 and D3. Double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 5221.
  • Typed B-H_Pvalue in cell E1. Typed the following formula in cell E2:
=(C2*5221)/D2 
  • and pressed enter. Copied that equation to the entire column.
  • Typed "B-H_Pvalue" into cell F1.
  • Typed the following formula into cell F2:
=IF(E2>1,1,E2) 
  • and pressed enter. Copied that equation to the entire column.
  • Selected columns A through F. Sorted them by your MasterIndex in Column A in ascending order.
  • Copied column F and used Paste special > Paste values to paste it into the next column on the right of my "statistics" sheet.

Prepared file for GenMAPP

  • Inserted a new worksheet and named it "forGenMAPP".
  • Went to the "statistics" worksheet and hit Select All and Copy.
  • Went to my new sheet and clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK.
  • Selected Columns B through Q (all the fold changes). Selected the menu item Format > Cells. Under the number tab, selected 2 decimal places. Clicked OK.
  • Selected all the columns containing p values. Selected the menu item Format > Cells. Under the number tab, selected 4 decimal places. Clicked OK.
  • Deleted the left-most Bonferroni p value column, preserving the one that showed the result of my "if" statement.
  • Inserted a column to the right of the "ID" column. Typed the header "SystemCode" into the top cell of this column. Filled the entire column (each cell) with the letter "N".
  • Selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.
  • Uploaded both the .xls and .txt files that I created to my journal page in the class wiki.

Sanity Check: Number of genes significantly changed

  • Opened my spreadsheet and went to the "forGenMAPP" tab.
  • Clicked on cell A1 and selected the menu item Data > Filter > Autofilter.
  • Clickd on the drop-down arrow on your "Pvalue" column. Selected "Custom". In the window that appeared, I set a criterion that filtered my data so that the Pvalue has to be less than 0.05.
    • How many genes have p value < 0.05? and what is the percentage (out of 5221)?
    • What about p < 0.01? and what is the percentage (out of 5221)?
    • What about p < 0.001? and what is the percentage (out of 5221)?**What about p < 0.0001? and what is the percentage (out of 5221)?
  • I performed 5221 T tests for significance. To apply a more stringent criterion to my p values, I performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, I filtered my data to determine the following:
    • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 5221)?
    • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 5221)?
    • Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? (and %)
    • Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. How many are there? (and %)
    • What about an average log fold change of > 0.25 and p < 0.05? (and %)
    • Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
    • What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?

Sanity Check: Compare individual genes with known data

  • Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. What are their fold changes and p values? Are they significantly changed in our analysis?

Notes

Opened up the spreadsheet for the first time

Scaled and centered the data

  • It was important to always reference back to the same rows using a dollar sign symbol attached to B2 and B3 (becoming B2$ and B2$) because we want to always compare the other call to these numerical values.

Performed statistical analysis on the data (averages, StdDev, Tstat, Pvalue, Bonferroni Pvalue)

File after work day on October 15, 2015 - File:Copy of Merrell Compiled Raw Data Vibrio ES 20151015.xls

Updated Files

Questions

  • 99 genes have a P value < 0.05 - 1.89%
  • 19 genes have a P value < 0.01 - 0.364%
  • 1 gene has a P value < 0.001 - 0.0192%
  • No genes have a P value < 0.0001 - 0%
    • These values do not match with known values given in instructions from Part 1

Created a new spreadsheet

Responses to Questions

  • 361 genes have a P value < 0.05 - 6.91%
  • 94 genes have a P value < 0.01 - 1.80%
  • 12 genes have a P value < 0.001 - 0.230%
  • 1 gene has a P value < 0.0001 - 0.0192%
  • No genes have a Bonferroni-corrected P value < 0.05 - 0%
  • No genes have a B-H corrected P value < 0.05 - 0%
  • 149 genes have a Pvalue <0.05 and a Avg_LogFC_all > 0 - 2.85%
  • 212 genes have a Pvalue <0.05 and a Avg_LogFC_all < 0 - 4.06%
  • No genes have a Pvalue <0.05 and a Avg_LogFC_all >.25 and <0.05 - 0%
  • 6 genes have a Pvalue <0.05 and a Avg_LogFC_all >.25 and <0.05 - 0.115%

Merrel et al. used the Statistical Analysis for Microarrays program. It was two-class SAM analysis that compared the strain grown in vitro and each individual sample. To be statistically significant, they had to have a two-fold change. This is different from our analysis in that we did not perform a two-fold analysis, but used averages of the data. We also were only looking at one data set, whereas it seems that they were comparing two data sets.

Sanity Check

  • VC0028 - Average Fold Change = 0.491200496, Pvalue = 0.721747872
  • VC0941 - Average Fold Change = -0.28389344, Pvalue = 0.608705768
  • VC0869 - Average Fold Change = 0.611346696, Pvalue = 0.722253317
  • VC0051 - Average Fold Change = 0.287569838, Pvalue = 0.554243951
  • VC0647 - Average Fold Change = 0.185860752, Pvalue = 0.657673082
  • VC0468 - Average Fold Change = 1.327859222, Pvalue = 0.251249028
  • VC2350 - Average Fold Change = -0.116594488, Pvalue = 0.29391318
  • VCA0583 - Average Fold Change = 0.187393646, Pvalue = 0.809011547

According to my analysis, these genes are not statistically significant.

Part Two

Directions

Taken from Sample Minds Part 2

  • Launched the GenMAPP Program and checked to make sure the correct Gene Database was loaded.
  • Looked in the lower, left-hand corner of the main GenMAPP Drafting Board window to see that name of the Gene Database that was loaded.
  • My partner and I used different versions of the Vibrio cholerae Gene Database for this exercise (I used 2009).
  • Selected the Data menu from the main Drafting Board window and chose Expression Dataset Manager from the drop-down list. The Expression Dataset Manager window opened.
  • Selected New Dataset from the Expression Datasets menu. Selected the tab-delimited text file that I formatted for GenMAPP (.txt) in the procedure above from the file dialog box that appeared.
  • The Data Type Specification window appeared.
    • Note: The Vibrio data I worked with did not have any text (character) data in it.
  • Allowed the Expression Dataset Manager to convert my data.
  • When the process was complete, the converted dataset was active in the Expression Dataset Manager window and the file saved in the same folder the raw data file was in, named the same except with a .gex extension; for example, MyExperiment.gex.
  • Recorded the number of errors for my version of the database and my partner's
  • Uploaded my exceptions file: EX.txt to my wiki page.
  • Customized the new Expression Dataset by creating new Color Sets which contain the instructions to GenMAPP for displaying data on MAPPs.
    • Created a Color Set by filling in the following different fields in the Color Set area of the Expression Dataset Manager: a name for the Color Set, the gene value, and the criteria that determine how a gene object is colored on the MAPP. Entered a name in the Color Set Name field that is 20 characters or fewer.
  • The Gene Value is the data displayed next to the gene box on a MAPP. Selected the column of data to be used as the Gene Value from the drop down list or select [none]. I used "Avg_LogFC_all" for the Vibrio dataset I created.
  • Activated the Criteria Builder by clicking the New button.
  • Entered a name for the criterion in the Label in Legend field.
  • Chose a color for the criterion by left-clicking on the Color box. Chose a color from the Color window that appeared and clicked OK.
  • Stated the criterion for color-coding a gene in the Criterion field.
    • A criterion is stated with relationships such as "this column greater than this value" or "that column less than or equal to that value". Individual relationships can be combined using as many ANDs and ORs as needed. A typical relationship is
[ColumnName] RelationalOperator Value
  • with the column name always enclosed in brackets and character values enclosed in single quotes. For example:
[Fold Change] >= 2
[p value] < 0.05
[Quality] = 'high'

The easiest and safest way to create criteria is by choosing items from the Columns and Ops (operators) lists shown in the Criteria Builder. The Columns list contains all of the column headings from my Expression Dataset. To choose a column from the list, I clicked on the column heading. It appeared at the location of the cursor in the Criterion box. The Criteria Builder surrounded the column names with brackets.

The Ops (operators) list contains the relational operators that may be used in the criteria: equals ( = ) greater than ( > ), less than ( < ), greater than or equal to ( >= ), less than or equal to ( <= ), is not equal to ( <> ).

The Ops list also contains the conjunctions AND and OR, which may be used to make compound criteria. For example:

[Fold Change] > 1.2 AND [p value] <= 0.05

Parentheses control the order of evaluation. Anything in parentheses is evaluated first. Parentheses may be nested. For example:

[Control Average] = 100 AND ([Exp1 Average] > 100 OR [Exp2 Average] > 100)

Column names may be used anywhere a value can, for example:

[Control Average] < [Experiment Average]
  • After completing a new criterion, I added the criterion entry (label, criterion, and color) to the Criteria List by clicking the Add button.
  • For the Vibrio dataset, I created two criterion. "Increased" was [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05 and "Decreased will be [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05.
  • Note: The buttons to the right of the list represent actions that can be performed on individual criteria. To modify a criterion label, color, or the criterion itself, first select the criterion in the list by left-clicking on it, and then click the Edit button. This puts the selected criterion into the Criteria Builder to be modified. Click the Save button to save changes to the modified criterion; click the Add button to add it to the list as a separate criterion. To remove a criterion from the list, left-click on the criterion to select it, and then click on the Delete button. The order of Criteria in the list has significance to GenMAPP. When applying an Expression Dataset and Color Set to a MAPP, GenMAPP examines the expression data for a particular gene object and applies the color for the first criterion in the list that is true. Therefore, it is imperative that when criteria overlap the user put the most important or least inclusive criteria in the list first. To change the order of the criteria in the list, left-click on the criterion to select it and then click the Move Up or Move Down buttons. No criteria met and Not found are always the last two positions in the list.

Saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. Changes made to a Color Set are not saved until I did this.

Exited the Expression Dataset Manager to view the Color Sets on a MAPP. I chose Exit from the Expression Dataset menu.

Uploaded my .gex file to my journal entry page for later retrieval.

MAPPFinder Procedure

  • Launched the MAPPFinder program (or from within GenMAPP, select Tools > MAPPFinder).
  • Made sure that the Gene Database for the correct species was loaded. The name of the Gene Database appeared at the bottom of the window.
  • Clicked on the button "Calculate New Results".
  • Clicked on "Find File" and chose the your Expression Dataset file, for example, "MyDataset.gex", and clicked OK.
  • Chose the Color Set and Criteria with which to filter the data. Clicked on either the "Increased" and "Decreased" criteria in the right-hand box, depending on which one your group is doing. (You could select both by holding down the Control key while clicking).
  • Checked the boxes next to "Gene Ontology" and "p value".
  • Clicked the "Browse" button and created a meaningful filename for my results.
  • Clicked "Run MAPPFinder".
  • When the results were calculated, a Gene Ontology browser opened showing my results. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result. Browse through the tree to see my results.
  • To see a list of the most significant Gene Ontology terms, clicked on the menu item "Show Ranked List".
  • Listed the top 10 Gene Ontology terms in my individual journal entry.
  • Compared my list with my partner who used a different version of the Gene Database.

One of the things you can do in MAPPFinder is to find the Gene Ontology term(s) with which a particular gene is associated.

  • Clicked on the button "Collapse the Tree". Then, I searched for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. Typed the identifier for one of these genes into the MAPPFinder browser gene ID search field. Chose "OrderedLocusNames" from the drop-down menu to the right of the search field. Clicked on the GeneID Search button. The GO term(s) that are associated with that gene were highlighted in blue. Listed the GO terms associated with each of those genes in my individual journal.
  • Clicked on one of the GO terms that are associated with one of the genes I looked up in the previous step. A MAPP will open listing all of the genes (as boxes) associated with that GO term. The genes named within the map are based on the UniProt identification system. To match the gene of interest to its identification went to the UniProt site and typed in my gene ID into the search bar. Moreover, the genes on the MAPP were color-coded with the gene expression data from the microarray experiment. Listed in my journal entry the name of the GO term I clicked on and whether the expression of the gene I was looking for changed significantly in the experiment.
  • Double-clicked on the gene box. This opened a Internet Explorer window called the "Backpage" for this gene. This page had links to pages for this gene in the public databases. Clicked on the links to find out the function of this gene and recorded my answer in my individual journal page.
  • The MAPP that was just created was stored in the directory, C:\GenMAPP 2 Data\MAPPs\VC GO. Uploaded this file and linked to it in my journal.
    • In Windows, made a copy of my results (XXX-CriterionX-GO.txt) file.
      • "XXX" refers to the name I gave to my results file.
      • "CriterionX" refers to either "Criterion0" or "Criterion1". Since computers start counting at zero, "Criterion0" is the first criterion in the list I clicked on ("Increased" if you followed the directions) and "Criterion1" is the second criterion in the list I clicked on ("Decreased" if you followed the directions).
  • Uploaded my results file to my journal page.
  • Launched Microsoft Excel. Opened the copies of the .txt files in Excel
  • Looked at the top of the spreadsheet. There are rows of information that give you the background information on how MAPPFinder made the calculations. Compared this information with my partner who used a different version of the Vibrio Gene Database.
  • Filtered this list to show the top GO terms represented in my data for both the "Increased" and "Decreased" criteria. Clicked on a cell in the row of headers for the data. Went to the Data menu and clicked "Filter > Autofilter". Drop-down arrows appeared in the row of headers. Clicked on the drop-down arrow for the column I chose to filter and chose "(Custom…)". Set these two filters:
Z Score (in column N) greater than 2
PermuteP (in column O) less than 0.05

Used these two filters depending on the number of terms:

Number Changed (in column I) greater than or equal to 4 or 5 AND less than 100
Percent Changed (in column L) greater than or equal to 25-50%
  • Saved my changes to an Excel spreadsheet. Selected File > Save As and selected Excel workbook (.xls) from the drop-down menu.
  • Highlighted the terms that fit a close relationship with the same color in my Excel spreadsheet. Uploaded my .xls file to my journal page.
  • Interpreted my results. Looked up the definitions for any GO terms that I was unfamiliar with. The "official" definitions for GO terms can be found at http://www.geneontology.org. Wrote a paragraph relating the results of this GO analysis to the experiment performed
  • Saved the file with a .gmf extension (in the same folder as the .gex file that I created with the GenMAPP Expression Dataset Manager). Needed this file to re-open my results in MAPPFinder.

Conclusion

  • Wrote a paragraph that briefly summarized and gave a scientific conclusion for the work that I did for part 1 and 2 this week.

List of Files to Upload

  • My exceptions file when you imported your data into GenMAPP: .EX.txt
  • My Expression Dataset file: .gex
  • My GO results file: XXX-CriterionX-GO.txt
  • My GO results saved as an Excel spreadsheet with filters applied: .xls
  • The MAPP you looked at: .mapp
  • The MAPPFinder GO mappings file: .gmf

Notes

  • Used the 2009 database
    • There were 772 errors in this data
  • My partner, Nicole, used the 2010 data
    • There were 121 errors in this data
  • It seems reasonable that there would be more errors in the 2009 version of the database, as the 2010 version would have corrected many of the past mistakes.
  • Here is the file EX.txt - File:Copy of Merrell Compiled Raw Data Vibrio ES 20151021.txt

Used red as the color for Increased and green for Decreased for the Color Sets.

Part of the "decreased" group

MAPPFinder Procedure Questions

Top 10 Gene Ontology terms (2009 database)

  1. cAMP metabolic process
  2. cAMP biosynthetic process
  3. adenylate cyclase activity
  4. cyclase activity
  5. cyclic nucleotide biosynthetic process
  6. phosphorus-oxygen lyase activity
  7. cyclic nucleotide metabolic process
  8. cellular component organization
  9. thiamin transport
  10. thiamin-transporting ATPase activity

Comparison with partner - Nicole's top 10 were the following (2010 database)

  1. glucose catabolic process
  2. hexose catabolic process
  3. glycolysis
  4. monosaccharide catabolic process
  5. cytoplasm
  6. alcohol catabolic process
  7. cellular carbohydrate catabolic process
  8. glucose metabolic process
  9. protein folding
  10. hexose metabolic process

It makes sense that these would be different, since the 2010 database has updated information.

Gene Ontology Terms (2009 database)

  • VC0028 - not found
  • VC0941 - not found
  • VC0869 - not found
  • VC0051 - not found
  • VC0647 - mRNA catabolic process, RNA processing, cytoplasm, RNA binding, 3'-5' exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polynucleotide nucleotidyltransferase activity
  • VC0468 - not found
  • VC2350 - not found
  • VCA0583 - transport, outer membrane bounded periplasmic space, transporter activity

Comparison with partner - Nicole's found the following terms (2010 database)

  • VC0028:
    • GO Terms: branched chain family amino acid biosynthetic process, cellular amino acid biosynthetic process, metabolic process, metal ion binding, iron-sulfur cluster binding, 4 iron, 4 sulfur cluster binding, catalytic activity, lyase activity, dihydroxy-acid dehydratase activity
  • VC0941:
    • GO Terms: glycine metabolic process, L-serine metabolic process, one-carbon metabolic process, cytoplasm, pyridoxal phosphate binding, catalytic activity, transferase activity, glycine hydroxymethyltransferase activity,
  • VC0869:
    • GO Terms: glutamine metabolic process, purine nucleotide biosynthetic process, 'de novo' IMP biosynthetic process, cytoplasm, nucleotide binding, ATP binding, catalytic activity, ligase activity, phosphoribosylformyglycinamidine synthase activity
  • VC0051:
    • GO Terms: purine nucleotide biosynthetic process, 'de novo' IMP biosynthetic process, nucleotide binding, ATP binding, catalytic activity, lyase activity, carboxy-lyase activity, phosphoribosylaminoimidazole carboxylase activity
  • VC0647:
    • GO Terms: mRNA catabolic process, RNA processing, cytoplasm, mitochondion, RNA binding, 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nuclotidyltransferase activity
  • VC0468:
    • GO Terms: glutathione biosynthetic process, metal ion binding, nucloetide binding, ATP binding, catalytic activity, ligase activity, glutathione synthase activity
  • VC2350:
    • GO Terms: deoxyribonucleotide catabolic process, metabolic process, cytoplasm, catalytic activity, lyase activity, deoxyribose-phosphase aldolase activity
  • VCA0583:
    • GO Terms: transport, outer membrane-bounded periplasmic space, transporter activity

Again, these were not the same between the 2009 and 2010 databases due to the changed information.

Clicked on "transporter activity" from VCA0583

Looked at the results file in Excel

  • It looked at percent changed, percent present, Z score, PermuteP, and the Adjusted P score
  • Compares to partner - unable to do because my partner did not complete

Used the constraints given of Z Score (in column N) greater than 2 and PermuteP (in column O) less than 0.05. Also used Percent Changed (in column L) greater than or equal to 30%

Some of the GO terms were closely related. See color coding in following file: File:EmilysimsoMAPPFinder1-Criterion1-GO-Criterion1-GO - Copy.xls

Terms that required definitions

  • cyclase activity - Catalysis of a ring closure reaction
  • cyclic nucleotide biosynthetic process - The chemical reactions and pathways resulting in the formation of a cyclic nucleotide, a nucleotide in which the phosphate group is in diester linkage to two positions on the sugar residue
  • phosphorus-oxygen lyase activity - Catalysis of the cleavage of a phosphorus-oxygen bond by other means than by hydrolysis or oxidation, or conversely adding a group to a double bond
  • thiamin transport - Enables the transfer of thiamine from the outside of a cell to the inside across a membrane
  • capsule polysaccharide biosynthetic process - The chemical reactions and pathways resulting in the formation of polysaccharides that make up the capsule, a protective structure surrounding some species of bacteria and fungi

Note: Definitions taken from AmiGO 2

First, each of the GO terms in the filtered list were significantly changed over the course of the experiment, meaning they all have a role to play in the presence of V. cholerae. It appears that many of the GO factors deal with metabolism in some way (cAMP metabolic process, cyclic nucleotide metabolic process, glycerol-3 phosphate metabolic process, extracellular polysaccharide metabolic process), perhaps meaning that when infected with V. cholerae, the body is not able to create energy in an efficient way. This would perhaps help the spread of the disease because the body could not use its normal energy reserves to fight off the pathogen. This relates to the presence of "thiamin-transporting ATPase activity on the list." ATP is an important energy source for the body, and if it cannot be transported, the body may be weakened for V. cholerae. The cyclic nucleotide biosynthetic process also deals with the linkage of sugars, which are incorporated in energy use. Thiamin is a vitamin, which could help with fighting off infection or maintaining the body, so if it was reducing, the body could also be weakened. Adenylate cyclase is another example of a part of the cellular energy homeostasis that naturally occurs. Again, if thrown off, the body would be compromised. Glycerol-3 phosphate helps with the movement of glycerol, a sugar. It appears that the dehydrogenase activity, metabolic process, and dehydrogenase complex of this compound are all affected by V. cholerae, signaling that this may be an important aspect of the disease's effect on the body. I think it also interesting that translation was affected. If the body cannot produce new DNA, the items mentioned above cannot be remedied, meaning the body cannot protect itself against the pathogen. Finally, extracellular polysaccarides, which are carbohydrates, were altered, indicating that V. cholerae affects a wide range of body components, making it a more effective pathogen.

.gmf file - File:Copy of Merrell Compiled Raw Data Vibrio ES 20151020.gmf

Conclusion

Parts 1 and 2 of the above assignment looked at how Microarray data analysis can be applied to a specific pathogen, in this case V. cholerae, to better understand how it interacts with the body.

In Part 1 of the assignment, we learned how to manipulate data to prepare it for analysis. This is an important step, since data must be in the proper format for a given program. We also looked at the Pvalues for various genes. This was important to realize that not all of the genes in an organism are involved in a given action, narrowing the focus for the next Part of the assignment. I think that this Part helped me see the direct implications as to how statistics can be used as a tool to researchers in this field.

Part 2 looked at the specific components of a gene and how they relate to the effects of V. cholerae. By importing the data from Part 1 and imputing it into the various programs, we were able to see which pathways or actions are most significantly affected, which helped create some initial thoughts about how V. cholerae works in the body.

From this first look at V. cholerae through in-depth analysis, it appears that the pathogen attacks multiple pathways in the body, but focuses on metabolism and energy production. Perhaps this is to weaken the body to make it susceptible to greater infection and increase its effects. There are multiple genes involved in this interaction between the body the V. cholerae, and further investigation could be done as to how these genes interact, as well as their specific components. Microarray data analysis has great potential for researchers who look at disease, as it could lead to treatment plans in the future.


Complete List of Files

Weekly Assignment Information

User: Emilysimso

Assignments

Individual Journal Entries

Class Journal Entries

Group Project

Heavy Metal HaterZ