Taur.vil Week 8

Week 8 Individual Journal

Question Answers

Sanity Check

948 genes had a p-value <0.05.
235 genes had a p-value <0.01.
24 genes had a p-value <0.001.
2 genes had a p-value <0.0001.

352 of the genes with a p-value <0.05 had a positive Avg_LogFC_all value.
596 of the genes with a p-value <0.05 had a positive Avg_LogFC_all value.
339 of the genes with a p-value <0.05 had a Avg_LogFC_all greater than 0.25.
339 of the genes with a p-value <0.05 had a Avg_LogFC_all less than -0.25.

In Merrell et al. (2002), the researchers used the SAM (statistical analysis for microarrays) program package to determine if there were significant differences in gene expression. In their analysis, they required a 100% fold change while we are using about a 20% fold change. This change in threshold decreases the chances of a false positive, but may also result in several false negative.

VC0028: p-value=0.047; Average Fold Change=1.653
VC0941: p-value=0.676; Average Fold Change=0.093
VC0869: p-value=0.017; Average Fold Change=1.499
VC0051: p-value=0.014; Average Fold Change=1.922
VC0647: p-value<0.001; Average Fold Change=-1.113
VC0468: p-value=0.335; Average Fold Change=-0.169
VC2350: p-value=0.013; Average Fold Change=-2.402
VCA0583: p-value=0.101; Average Fold Change=1.063

The results of our study and that of Merrell et al. (2002) were mixed. Out of the eight genes, we found 5 to be significant with a p-value less than 0.05 and a fold change greater that 1. However, the other three were not significant (particularly p>0.05).

Creating the GenMAPP

My dataset from 2009 reported 772 errors during the process of creating an expression dataset while Hilda's from 2010 reported 122. This is likely because more research was conducted and the database updated during the intervening year.

MAPPFinder

Top ten gene ontologies:
1. protein folding, z score=5.887, permute p=0, adjusted p=0.001
2. chorismate metabolic process, z score=5.221, permute p=0, adjusted p=0.027
3. aromatic amino acid family biosynthetic process, z score=5.221, permute p=0, adjust p=0.027
4. cytoplasm, z score=3.949, permute p=0, adjust p=0.362
5. zinc ion bonding, z score=3.617, permute p=0, adjusted p=0.941
6. intracellular part, z score=3.551, permute p=0, adjusted p=0.941
7. unfolded protein bonding, z score=4.474, permute p=0.001, adjusted p=0.244
8. aromatic amino acid family metabolic process, z score=3.952, permute p-0.001, adjusted p-0.362
9. sugar:hydrogen symporter activity, z score-3.65, permute p=0.001, adjusted p=0.941
10. cation:sugar symporter activity, z score=3.65, permute p=0.001, adjusted p-0.941
- There was very little agreement between the gene ontologies that were ranked highest in the 2009 and 2010 documents. Out of them, the only ones that appeared in each list were Protein Folding and Cytoplasm.
- Some part of this difference can likely be attributed to unclear entries in the initial data. For example, if the intracellular part in #6 was further identified and given a different name then it could not be matched from one list to the other. Additionally, it is possible the 660 genes that were errors for the 2009 data and present in the 2010 data could account for some of the new genes and the gene families could have been reorganized.
Examples from Merrell et al. (2002)
- VC0028: Was not found
- VC0941: Was not found
- VC0869: Was not found
- VC0051: Was not found
- VC0647: mRNA catabolic process, RNA processing, cytoplasm (an intracellular cell part), RNA binding, 3' to 5' exonuclease activity, nucleotidyltransferase activity (in subset of transferase activity), and polyribonucleotide nucleotidyltransferase activity.
- VC0468: Was not found
- VC2350: Was not found
- VCA0583: transport, transporter activity, and outer membrane-bounded periplasmic space
  - For genes that were found in the 2009 database, results agreed with those found by Hilda in the 2010 database. The one place where the two did not match up was where VC0647 had the additional gene ontology of "mitochondrion" and this is likely the result of new research.
  - That genes were found in the 2010 file and not the 2009 file were likely the result of new research or database updates in the intervening year.
Whether the expression of VC0647 changed:
- Determined PNP_VIBCH expression was decreased as the gene was highlighted green(*).
- The function of this gene is to assist in mRNA degradation by hydrolyzing single-stranded polyribonucleotides processively in the 3'- to 5'-direction (information found on uniprot).

Criterion Files

Criterion File Comparison
- The two criterion files differed slightly in names and had different dates.
- Almost all of the calculation summaries had different values:
  - 579 probes met the significance criteria in the 2009 data and 578 in the 2010 data.
  - 474 of the 2010 probes that met the filter linked with uniprot while 575 of the 2010 probes did so.
  - For the 2009 data, 255 genes meeting the criteria linked to a gene ontology term. 330 genes did so for the 2010 data.
  - Both data sets had 5221 probes in the dataset.
  - For the 2009 data, 4449 of these probes linked to uniprot while 5100 of them linked in the 2010 analysis.
  - 1990 of the 2009 genes linked to a gene ontology term and 2475 of the 2010 genes did so.
  - The numbers the z-score was based on reflected changes in these earlier values.
- These numbers are different because the database was updated between 2009 and 2010 increasing the amount of genes that could match with probes in our dataset and uniprot's system.

Concluding Paragraph

Using microarray data, we examined differences in gene expression between a non-infectious, laboratory-grown Vibrio cholerae and a patient-derived, infective version. In particular, work focused on determining overall differences in gene expression and then focusing on the sixteen gene ontologies(GOs) which showed the most dramatic change in expression between conditions. These sixteen clustered into a two family groups of three ontologies, two groups of two, and six individual gene ontologies. One group of three GOs included sugar transmembrane transporter activity, carbohydrate transmembrane transmitter activity, and protein-N(PI)-phosphohistidine-sugar phosphotransferase activity. Changes in these GOs suggest that one of the major changes in the change from benign to infectious is in the cross-membrane transport of carbohydrates. The other group of three included cation:sugar symporter activity, solute:hydrogen symporter activity, and sugar:hydrogen symporter activity. Like the earlier group of three GOs, these are also groups of genes to bring sugars and ions across cell membranes. Together, these two GO families seem to suggest that the cell is requiring greater amounts of energy while infectious than under laboratory conditions.

One of the two member GO families consisted of cis-trans isomerase activity and its child group of peptidyl-prolyl cis-trans isomerase activity. These two gene families each play a role in modifying protein structures within the cell. The other two member family was concerned with translation regulation and contained a general GO for translation regulator activity and a child group that was specific to nucleic acid binding. This suggests that this family is a TF turned on when Vibrio cholerae becomes infectious and these transcription factors may be the factor that influences the differential expression in other GOs and drives the cellular change to an infectious pathogen.

The individual GOs were more general than the families. Changes were observed in protein binding (particularly unfolded proteins), endonuclease activity, glucose metabolic process, and aromatic compound biosynthetic process. The only one that was specific was the phosphoenolpyruvate-dependent sugar phosphotransferase system which is another instance of Vibrio cholerae increasing its uptake of carbohydrates from the environment.

These different groups of GOs lead to a better idea of how Vibrio cholerae changes as it becomes infectious. First of all, new transcription factors are expressed that could potential be the driving force for other changes in expression. Second, the import-export balance of the cell adjusts and it's metabolism changes. This can be seen by the changed in symporter activity, transmembrane transporters, and general changes in glucose metabolism. There are also changes in protein folding and structure and changes in the synthesis of aromatic compounds. As a whole, these changes suggest that the infectious state has an accelerated metabolism and a different protein makeup to help it infect the host organism.

Document Files:

GO mapping file (.gmf file)

20 GO Terms and Relationships

Digital Notebook:

Part 1

Downloaded original data (Merrell_Compiled_Raw_Data_Vibrio.xls) from [[1]]
Observed that the data collected had already been log transformed (there were negative numbers)
- Meant we could begin at the normalization step
Created a new sheet in Excel, copied in data from previous data sheet and titled it "scaled_centered".
In scaled_centered, inserted two empty rows and calculated average and standard deviation for each replicate using the Excel AVERAGE and STDEV functions.
Created a new column for each of the samples, relabeling them with a _sc (for scaled centered) after the name. Filled these columns with the scaled centered values calculated by taking the raw data minus the average for the sample (row 2) divided by the standard deviation (row 3).
- ex: (B4-B$2)/B$3
- This process served to normalize the data.
Created a new worksheet called statistics and copied the ID column into the new worksheet.
Pasted (using values only) the scaled and centered columns.
Deleted the rows for average and standard deviation.
Inserted columns to the right of the data for the average log fold change (FC) of patient and calculated the value by taking the average of the three technical replicates.
Calculated the t-stat for each gene in a new column by taking the average of the three biological replicates divided by (the standard deviation of the biological replicates divided by the sq. root of the sample size (which was three) ).
- ex: Average (N2:P2)/(STDDEV(N2:P2)/SQRT(3))
Calculated the p-value in a new column by using Excel's TDIST function
- ex: TDIST(ABS(R2),degrees of freedom,2)
  - R2 referred to the t-stat calculated earlier. There were 2 degrees freedom.
Took an average FC for each of the three biological replicates in a new column.
Copied into a new page titled forGenMAPP and inserted column 2 (System Code) where N was entered for each row.

Part 2

Downloaded GenMAPP, my text files, and the 2009 Vibrio cholerae Gene Database
Set the database to the downloaded 2009 Vibrio cholerae Gene Database
In Expression Dataset Manager, created a new dataset by importing my txt for GenMAPP File
- 772 errors reported, all of which were reported as gene not found (verified using excel filters)
Created new color set (Pathogenic vs Lab)
- Selected increasing items as those which had an AvgLogFC change > 0.25 and a p-value less than 0.05. ([AvgLogFC_all]>0.25 AND [Pvalue]<0.05)
- Selected decreasing items using the same criteria, just an inversed AvgLogFC change ([AvgLogFC_all]<-0.25 AND [Pvalue]<0.05)
- Colored increased as red, decreases as green, and no change as yellow
Used GenMAPP tool MAPPFinder to create a table for decreases in gene expression, exported as DecreasedVibrio_2009_TV.
Opened gene ontology browser, clicked on "Show Ranked List" and recorded the top ten terms (see questions)
Using the search bar on the top of MAPPFinder, searched for VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. Included switching search type to OrderedLocusNames. Search example in image. Any results were highlighted in blue (example). For results, see questions above.
Clicked on the 3'-5'-exoribonuclease activity as an example of a gene ontology (GO) found while searching for VC0647 (result).
- Saw two genes in the section: PNP_VIBCH and RBN_VIBCH.
- Searched on uniprot for VC0647. Uniprot ID is PNP_VIBCH. search result
- Determined PNP_VIBCH expression was decreased as the gene was highlighted green(*).
Downloaded the criterion file from this page and opened it in excel as a tab-deliminated file.
Compared information lines in the excel file to those in the 2010 excel spreadsheet (comparison above in questions).
Set a filter on z score (Column N) n for values greater than 2 and another one on PermuteP (Column O) for p<0.05.
Had 98 terms left, so filtered number changed (Column I) to values between 5 and 100 and percent changed (Column L) to value greater than 33%. (Limited total results to 16 genes.
Searched each GO name in MAPPfinder and looked at nearest relatives for other GOs that appeared on my shortened list. Colored related members the same background color using cell fill (see excel above titled "20 GO Terms and Relationships")
reviewed terminology on www.geneontology.org and wrote summary paragraph (at end of questions) section

By Tauras Vilgalys

As part of Biological Databases

Please Remember the Harassing of Deities is Strictly Prohibited

Never Forget Samson

Taur.vil Week 8

Contents

Question Answers

Sanity Check

Creating the GenMAPP

MAPPFinder

Criterion Files

Concluding Paragraph

Document Files:

Digital Notebook:

Part 1

Part 2

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox