Malverso Week 8

From LMU BioDB 2015
Jump to: navigation, search

Electronic Lab Notebook

Part One

  • The data from the Merrell et al. (2002) paper was accessed from the Stanford Microarray Database.
  • The Log2 of R/G Normalized Ratio (Median) has been copied from the raw data files downloaded from the Stanford Microarray Database.

Patient A

  • Sample 1: 24047.xls (A1)
  • Sample 2: 24048.xls (A2)
  • Sample 3: 24213.xls (A3)
  • Sample 4: 24202.xls (A4)

Patient B

  • Sample 5: 24049.xls (B1)
  • Sample 6: 24050.xls (B2)
  • Sample 7: 24203.xls (B3)
  • Sample 8: 24204.xls (B4)

Patient C

  • Sample 9: 24053.xls (C1)
  • Sample 10: 24054.xls (C2)
  • Sample 11: 24205.xls (C3)
  • Sample 12: 24206.xls (C4)
  • I downloaded the Merrell_Compiled_Raw_Data_Vibrio.xls file to my Desktop and saved it with my initials and the date.

Normalizing the Log Ratios

  • To scale and center the data I:
    • Inserted a new Worksheet into my Excel file, and named it "scaled_centered".
    • Going back to the "compiled_raw_data" worksheet, I clicked to select all and copy. I then went to the "scaled_centered" worksheet, click on the upper, left-hand cell (cell A1) and pasted the values.
    • I inserted two rows in between the top row of headers and the first data row.
    • In cell A2, I typed "Average" and in cell A3, "StdDev".
  • I then computed the Average log ratio for each chip (each column of data).
    • In cell B2, I typed =AVERAGE(B4:B5224)and then pressed Enter.
  • I then computed the Standard Deviation of the log ratios on each chip (each column of data).
    • In cell B3 I typed = STDEV(B4:B5224)and then pressed enter.
    • I then clicked on B2 and dragged it across all of the columns to copy the equation across all the data. I repeated this with B3 as well. Excel automatically changed the equation to match the cell designations for those columns.
  • I copied the column headings for all of my data columns and then pasted them to the right of the last data column so that there was a second set of headers above blank columns of cells. I Edited the names of the columns so that they read: A1_scaled_centered, A2_scaled_centered, etc.
  • In cell N4, I typed =(B4-B$2)/B$3 so that the data in cell B4 has the average subtracted from it (cell B2) and is divided by the standard deviation (cell B3). I used the dollar sign symbols in front of the "2" and "3" to tell Excel to always reference that row in the equation, even though I will pasted it for the entire column of 5221 genes.
  • Why is this important? This is important because if we didn’t use the dollar signs, Excel would assume that each cell should be subtracted by the cell two above it and divided by the cell directly above it instead of always by the average and standard deviation.
  • I copy and pasted this equation into the entire column by clicking on the original cell with my equation and position my cursor at the bottom right corner. When the curser changed to a thin black plus sign (not a chubby white one) I double clicked, and the formula magically copied to the entire column of genes.
  • I then copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header, making sure to adjust the equations to pertain to their respective columns.

Performing Statistical Analysis on the Ratios

  • This is performed on the scaled and centered data produced in the previous step.
  • I inserted a new worksheet and named it "statistics".
  • Going back to the "scaling_centering" worksheet, I copied the first column ("ID") and pasted the data into the first column of the new "statistics" worksheet.
  • I also went back to the "scaling_centering" worksheet and copied the columns designated "_scaled_centered" and copied them by clicking on the B1 cell and selecting "Paste Special" > “Values” from the Edit menu. This pasted the numerical result into my new worksheet instead of the equation.
  • Next I deleted Rows 2 and 3 where it says "Average" and "StDev" so that my data rows with gene IDs are immediately below the header row 1.
  • I went to a new column on the right of my worksheet and typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
  • I computed the average log fold change for the replicates for each patient by typing =AVERAGE(B2:E2) into cell N2. I copied this equation and pasted it into the rest of the column.
  • Next I created the equation for patients B and C and pasted it into their respective columns.
  • I computed the average of the averages. I typed the header "Avg_LogFC_all" into the first cell in the next empty column and created the equation that will compute the average of the three previous averages and pasted it into this entire column.
  • I inserted a new column next to the "Avg_LogFC_all" column and labeled it "Tstat". This will compute a T statistic that tells us whether the scaled and centered average log ratio is significantly different than 0 (no change).
  • I entered the equation: ¬= AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3))and copied it and then pasted it into all rows in that column.
  • Next I labeled the top cell in the next column "Pvalue". In the cell below the label, I entered the equation: = TDIST(ABS(R2),degrees of freedom,2)
  • The number of degrees of freedom is the number of replicates minus one, so in this case there are 2 degrees of freedom. I copied the equation and pasted it into all rows in that column.

Calculating the Bonferroni p value Correction

  • I performed adjustments to the p value to correct for the multiple testing problem. I labeled the next two columns to the right with the same label, Bonferroni_Pvalue.
  • I typed the equation =S2*5221, and used the trick to copy the formula throughout the column.
  • I replaced any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second Bonferroni_Pvalue header: =IF(T2>1,1,T2). Using the trick, I copied the formula throughout the column.

Calculating the Benjamini & Hochberg p value Correction

Insert a new worksheet named "B-H_Pvalue".

  • I copied and pasted the "ID" column from the previous worksheet into the first column of the new worksheet.
  • I then inserted a new column on the very left and named it "MasterIndex". This creates a numerical index of genes so that I could sort them back into the same order later.
  • I typed a "1" in cell A2 and a "2" in cell A3.
  • Selecting both cells, I hovered my mouse over the bottom-right corner of the selection until it made a thin black + sign. I double-clicked on the + sign to fill the entire column with a series of numbers from 1 to 5221 (the number of genes on the microarray).
  • For the following, I used Paste special > Paste values. I copied the unadjusted p values from the previous worksheet and pasted it into Column C.
  • I selected all of columns A, B, and C and sorted by ascending values on Column C by clicking the sort button from A to Z on the toolbar, and in the window that appears, sorting by column C, smallest to largest.
  • I typed the header "Rank" in cell D1. I typed "1" into cell D2 and "2" into cell D3. I selected both cells D2 and D3 and Double-click on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 5221.
  • I now calculated the Benjamini and Hochberg p value correction.
    • I typed B-H_Pvalue in cell E1 and typed the following formula in cell E2: =(C2*5221)/D2 and pressed enter and then copied that equation to the entire column.
    • I typed "B-H_Pvalue" into cell F1.
    • Then I typed the following formula into cell F2: =IF(E2>1,1,E2) and pressed enter, and copied that equation to the entire column.
  • I selected columns A through F and sorted them by MasterIndex in Column A in ascending order.
  • I copied column F and used Paste special > Paste values to paste it into the next column on the right of my "statistics" sheet.

Preparing the file for GenMAPP

  • I inserted a new worksheet and named it "forGenMAPP" and copied everything from the “statistics” worksheet and pasted it into cell A1, making sure that the values were pasted.
  • I selected Columns B through Q (all the fold changes). I then selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places, and clicked OK.
  • I next selected all the columns containing p values. I selected the menu item Format > Cells. Under the number tab, I selected 4 decimal places and clicked OK.
  • I deleted the left-most Bonferroni p value column, preserving the one that shows the result of the "if" statement.
  • I inserted a column to the right of the "ID" column and typed the header "SystemCode" into the top cell. I filled the entire column (each cell) with the letter "N".
  • I selected the menu item File > Save As, and choose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu. I clicked through the various warnings.
  • I then uploaded both the .xls and .txt files to this journal page.

Sanity Check: Number of genes significantly changed

  • Before I moved on to the GenMAPP/MAPPFinder analysis, I wanted to perform a sanity check to make sure that I performed my data analysis correctly by comparing my results to the published results of Merrell et al. (2002).
  • I opened my spreadsheet and went to the "forGenMAPP" tab.
  • I clicked on cell A1 and selected the menu item Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. This let me filter the data according to criteria.
  • I clicked on the drop-down arrow on the "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
  • How many genes have p value < 0.05? and what is the percentage (out of 5221)? 948 genes. This is 18.16% of the genes.
  • What about p < 0.01? and what is the percentage (out of 5221)? 235 genes. This is 4.50% of the genes.
  • What about p < 0.001? and what is the percentage (out of 5221)? 24 genes. This is 0.46% of the genes.
  • What about p < 0.0001? and what is the percentage (out of 5221)? 2 genes. This is 0.04% of the genes.
  • When I used a p value cut-off of p < 0.05, what I am is that I would have seen a gene expression change that deviates this far from zero less than 5% of the time.
  • I have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that I would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. Since I have more than 261 genes that pass this cut off, I know that some genes are significantly changed. However, I don't know which ones. To apply a more stringent criterion to my p values, I performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent.
  • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 5221)? 0 genes. This is 0% of the genes.
  • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 5221)? 0 genes. This is 0% of the genes.
  • In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If I want to be very confident of my data, I would use a small p value cut-off. If I am OK with being less confident about a gene expression change and want to include more genes in the analysis, I can use a larger p value cut-off.
  • The "Avg_LogFC_all" tells the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
  • I kept the (unadjusted) "Pvalue" filter at p < 0.05, and filtered the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero.
    • How many are there? (and %) 352 genes. This is 6.7% of the genes.
  • I kept the (unadjusted) "Pvalue" filter at p < 0.05 and filtered the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero.
    • How many are there? (and %) 596 genes. This is 11.34% of the genes
    • What about an average log fold change of > 0.25 and p < 0.05? (and %) 339 genes. This is 17.83% of the genes.
    • Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.) 579 genes. This is 11.09% of the genes.
  • For the GenMAPP analysis below, I used the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05 for my analysis because I want to include several hundred genes in my analysis.
  • What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method? Merrell et al. (2002) used a “two-class SAM analysis” with the determination of significance based on the level of expression which must have been at least a twofold change. Merrell et al. (2002) looked at the actual change in expression to figure out what was significant while I used the p value, which is the probability that changes in expression were due to chance. Merrel et al. (2002) used a more stringent method because they found 237 genes that were significantly changed while we found 918 using the criteria: -0.25 < pvalue >0 .25.

Sanity Check: Compare Individual Genes with Known Data

  • Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. I looked these genes up in my spreadsheet.
  • What are their fold changes and p values? Are they significantly changed in my analysis?
    • VC0028 (two entries)
      • fold change: 1.65 pvalue:.0474 This is significantly changed because the pvalue is less than .05 and the fold change is greater than .25.
      • fold change: 1.27 pvalue:.0692 This is not significantly changed because the pvalue is greater than .05.
    • VC0941 (two entries)
      • fold change:-.28 pvalue:.1636 This is not significantly changed because the pvalue is greater than .05.
      • fold change:.09 pvalue:.06759 This is not significantly changed because the pvalue is greater than .05.
    • VC0869 (five entries)
      • fold change:2.12 pvalue:.02 This is significantly changed.
      • fold change:1.50 pvalue:.0174 This is significantly changed.
      • fold change:1.59 pvalue:.0463 This is significantly changed.
      • fold change:1.95 pvalue:.0227 This is significantly changed.
      • fold change:2.20 pvalue:.002 This is significantly changed.
    • VC0051 (two entries)
      • fold change: 1.89 pvalue:.016 This is significantly changed.
      • fold change: 1.92 pvalue:.0139 This is significantly changed.
    • VC0647 (three entries)
      • fold change: -1.11 pvalue:.0003 This is significantly changed.
      • fold change:-0.94 pvalue:.0125 This is significantly changed.
      • fold change:-1.05 pvalue:.0051 This is not significantly changed because the pvalue is greater than .05.
    • VC0468
      • fold change:.17 pvalue:.3350 This is not significantly changed because the pvalue is greater than .05.
    • VC2350
      • fold change: -2.40 pvalue:.0130 This is significantly changed.
    • VCA0583
      • fold change:1.06 pvalue:.1011 This is not significantly changed because the pvalue is greater than .05.

Part Two

  • I downloaded the Vibrio cholerae Gene Database created by Drs. Dahlquist and Dionisio: 2010 Vc-Std_External_20101022.gdb. My partner, Kristin, downloaded the database from 2009.
  • I downloaded the file, and saved it into the folder C:\GenMAPP 2 Data\Gene Databases, and extracted it.

GenMAPP Expression Dataset Manager Procedure

  • I launched the GenMAPP Program and checked to make sure the correct Gene Database was loaded.
  • I selected the Data menu from the main Drafting Board window and chose Expression Dataset Manager from the drop-down list. The Expression Dataset Manager window opened.
  • I selected New Dataset from the Expression Datasets menu and selected the tab-delimited text file that I formatted for GenMAPP (.txt) in the procedure above from the file dialog box that appeared.
  • The Data Type Specification window appeared. GenMAPP is expecting that I provided numerical data. If any of my columns had text (character) data, I would have checked the box next to the field (column) name, but the data we have been working does not have any text data in it.
  • I allowed the Expression Dataset Manager to convert my data, which was automatically saved with a .gex extension.
  • A message appeared saying that the Expression Dataset Manager could not convert one or more lines of data. Lines that generate an error during the conversion of a raw data file are not added to the Expression Dataset. Instead, exception files were created. The exception file was given the same name as my raw data file with .EX before the extension. The exception file contains all of my raw data, with the addition of a column named ~Error~. This column contains the error messages.
  • Record the number of errors. For your journal assignment, open the .EX.txt file and use the Data > Filter > Autofilter function to determine what the errors were for the rows that were not converted. I got 121 errors. All of the errors were labeled “Gene not found in OrderedLocusNames or any related system.”
  • It is likely that you will have a different number of errors than your partner who is using a different version of the Vibrio cholerae Gene Database. Which of you has more errors? Why do you think that is? My partner Kristin has more errors, she has 772. A lot of genes must have been added between the time her database was created an mine was, because in the case of the 2010 database, more genes were found in “orderedLocusNames”.
  • I uploaded my exceptions file: EX.txt to this wiki page.
  • I customized the new Expression Dataset by creating a new Color Set which contain the instructions to GenMAPP for displaying data on MAPPs.
  • I created a Color Set by filling in the following different fields in the Color Set area of the Expression Dataset Manager: a name for the Color Set , the gene value, and the criteria that determine how a gene object is colored on the MAPP.
    • I entered “LogFoldChange2” in the Color Set Name field
    • For the Gene Value I used "Avg_LogFC_all".
    • I activated the Criteria Builder by clicking the New button.
    • I entered “increased” for the criterion in the Label in Legend field, because that is what my partner and I were assigned.
    • I chose red for the color criterion and stated the criterion for color-coding a gene in the Criterion field.
  • For the Vibrio dataset, I created two criterion. "Increased" was be [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05 and "Decreased was be [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05 with a color green.
  • I clicked “add” after typing in the criterion.
  • I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu.
  • I exited the Expression Dataset Manager to view the Color Sets on a MAPP.
  • I uploaded my .gex file to this page.

MAPPFinder Procedure

  • I launched the MAPPFinder program and clicked on the button "Calculate New Results".
  • I clicked on "Find File" and chose the Expression Dataset file (.gex) and clicked OK.
  • I chose the Color Set and Criteria with which to filter the data by clicking on the "Increased" criteria in the right-hand box.
  • Next I checked the boxes next to "Gene Ontology" and "p value" and clicked the "Browse" button and created a filename for my results.
  • I clicked "Run MAPPFinder". The analysis took a few minutes, but eventually a Gene Ontology browser opened showing my results. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result.
  • To see a list of the most significant Gene Ontology terms, I clicked on the menu item "Show Ranked List".
  • List the top 10 Gene Ontology terms in your individual journal entry.
  1. branched chain family amino acid metabolic process
  2. branched chain family amino acid biosynthetic process
  3. IMP metabolic process
  4. IMP biosynthetic process
  5. purine ribonucleoside monophosphate biosynthetic process
  6. purine ribonucleoside monophosphate metabolic process
  7. purine nucleoside monophosphate metabolic process
  8. purine nucleoside monophosphate biosynthetic process
  9. 'de novo' IMP biosynthetic process
  10. arginine metabolic process
  • Compare your list with your partner who used a different version of the Gene Database. Are your terms the same or different? Why do you think that is? Record your answer in your individual journal entry. Our answers were completely different. This must be because some of the most significant genes were genes discovered/added to the database after 2009 when hers was published but before mine was published in 2010.
  • One of the things I can do in MAPPFinder is find the Gene Ontology term(s) with which a particular gene is associated.
    • First, in the main MAPPFinder Browser window, I clicked on the button "Collapse the Tree". Then, I searched for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 by typing the identifier for each of these genes into the MAPPFinder browser gene ID search field.
    • I chose "OrderedLocusNames" from the drop-down menu to the right of the search field and clicked on the GeneID Search button. The GO term(s) that are associated with that gene are highlighted in blue.
    • I listed the GO terms associated with each of those genes below:
  • VC0028
    • branched chain family amino acid biosynthetic process
    • cellular amino acid biosynthetic process
    • glutamine family amino acid metabolic process
  • VC0941
    • metabolic process
    • metal ion binding
    • iron-sulfur cluster binding
    • 4 iron, 4 sulfur cluster binding
    • catalytic activity
    • lyase activity
    • dihodroxy-acid dehydratase activity
  • VC0869
    • glutamine metabolic process
    • purine nucleotide biosynthetic process
    • 'de novo' IMP biosynthetic process
    • cytoplasm
    • nucleotide binding
    • ATP binding
    • catalytic activity
    • ligase activity
    • phosphoribosylformylglycinamidine synthase activity
  • VC0051
    • purine nucleotide biosynthetic process
    • 'de novo' IMP biosynthetic process
    • nucleotide binding
    • ATP binding
    • catalytic activity
    • lyase activity
    • carboxy-lyase activity
    • phosphoribosylaminoimidazole carboxylase activity
  • VC0647
    • mRNA catabolic process
    • RNA processing
    • cytoplasm
    • mitochondrion
    • RNA binding
    • 3'-5'- exoribonuclease activity
    • transferease activity
    • nucleotidytransferase activity
    • polyribonucleotide nucleotidyltransferase activity
  • VC0468
    • glutathione biosynthetic process
    • metal ion binding
    • nucleotide binding
    • ATP binding
    • catalytic activity
    • ligase activity
    • glutathione synthase activity
  • VC2350
    • deoxyribonucleotide catabolic process
    • metabolic process
    • cytoplasm
    • catalytic activity
    • lyase activity
    • deoxyribose-phosphate aldolase activity
  • VCA0583
    • transport
    • outer membrane-bounded periplasmic space
    • transporter activity
  • Are they the same as your partner who is using a different Gene Database? Why or why not? I had entries in my database for every single gene while Kristin did not. I did, however, have all of the terms she found for V0647 and VCA0583, which are the only two of the above five that she had results for. This is because my database has more data in it since it is newer.
  • I clicked on one of the GO terms, polyribonucleotide nucleotidyltransferase activity. A MAPP opened listing all of the genes (as boxes) associated with that GO term.
  • The genes named within the map are based on the UniProt identification system.
  • To match the gene of interest to its identification I went to the UniProt site and typed in the gene ID, PNP_VIBCH, into the search bar.
  • According to the UniProt website, this gene is "Involved in mRNA degradation. Catalyzes the phosphorolysis of single-stranded polyribonucleotides processively in the 3'- to 5'-direction"
  • The genes on the MAPP were color-coded with the gene expression data from the microarray experiment.
  • The GO term I clicked on was polyribonucleotide nucleotidyltransferase activity and the expression of the gene decreased significantly in the experiment.
  • I double-clicked on the gene box. This opened a Internet Explorer window called the "Backpage" for this gene. This page has links to pages for this gene in the public databases.
  • According to the "Backpage", the Uniprot defined function of this gene is: "Involved in mRNA degradation. Hydrolyzes of single-stranded polyribonucleotides processively in the 3'- to 5'-direction". This is slightly different than the Uniprot function listed on the website, I wonder if that is just how it is worded or if one is more accurate than the other.
  • The MAPP that was created is stored in the directory, C:\GenMAPP 2 Data\MAPPs\VC GO. I uploaded this file and linked it to this page.
  • Next I made a copy of my results (-GO.txt) file.
  • I uploaded my results file to this journal page.
  • I launched Microsoft Excel and opened the copies of the .txt files in Excel. This showed the same data that I saw in the MAPPFinder Browser, but in tabular form.
  • On top of the spreadsheet there are rows of information that give me the background information on how MAPPFinder made the calculations.
  • Compare this information with your partner who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different?
  • An image of the top of the spreadsheet:

GenMAPPresults MA.jpg

  • Probes meeting the filter linked to a UniProt ID, genes meeting the criterion linked to a GO term, probes linked to a UniProt ID and genes linked to a GO term were different. In each of these cases, my numbers were larger than Kristin's. It is likely that this is the cause of my database (2010) having more gene entries than Kristin's, which makes sense since mine is the updated version of the 2009 database.
  • I filtered this list to show the top GO terms represented in my data for both the "Increased" and "Decreased" criteria. I filtered my list down to 20 terms. I clicked on a cell in the row of headers for the data. Then I went to the Data menu and clicked "Filter > Autofilter". Drop-down arrows appeared in the row of headers, and I set these filters:
    • Z Score greater than 2.
    • PermuteP less than .05.
    • Number Changed greater than or equal to 5 AND less than 100.
    • Percent Changed greater than or equal to 26%
    • I saved my changes to an Excel spreadsheet (.xls).
  • I uploaded a .xls file to this journal page that showed the parent-child relationships between go terms through use of highlighting their cells.
  • I interpreted my results by looking up the definitions for any GO terms that are unfamiliar to you, which was basically all of the GO terms. I found the definitions below at http://www.geneontology.org, and categorized them based on similar definitions.
    • branched chain family amino acid metabolic process - The chemical reactions and pathways involving amino acids containing a branched carbon skeleton, comprising isoleucine, leucine and valine.
    • branched chain family amino acid biosynthetic process - The chemical reactions and pathways resulting in the formation of amino acids containing a branched carbon skeleton, comprising isoleucine, leucine and valine.
    • IMP metabolic process - The chemical reactions and pathways involving IMP, inosine monophosphate.
    • IMP biosynthetic process - The chemical reactions and pathways resulting in the formation of IMP, inosine monophosphate.
    • 'de novo' IMP biosynthetic process - The chemical reactions and pathways resulting in the formation of IMP, inosine monophosphate, by the stepwise assembly of a purine ring on ribose 5-phosphate
    • purine ribonucleoside monophosphate biosynthetic process - The chemical reactions and pathways resulting in the formation of purine ribonucleoside monophosphate, a compound consisting of a purine base linked to a ribose sugar esterified with phosphate on the sugar.
    • purine ribonucleoside monophosphate metabolic process - The chemical reactions and pathways involving purine ribonucleoside monophosphate, a compound consisting of a purine base linked to a ribose sugar esterified with phosphate on the sugar.
    • purine nucleoside monophosphate metabolic process - The chemical reactions and pathways involving purine nucleoside monophosphate, a compound consisting of a purine base linked to a ribose or deoxyribose sugar esterified with phosphate on the sugar.
    • purine nucleoside monophosphate biosynthetic process - The chemical reactions and pathways resulting in the formation of purine nucleoside monophosphate, a compound consisting of a purine base linked to a ribose or deoxyribose sugar esterified with phosphate on the sugar.
    • arginine metabolic process - The chemical reactions and pathways involving arginine, 2-amino-5-(carbamimidamido)pentanoic acid.
  • I wrote a paragraph relating the results of this GO analysis to the experiment performed, with help from Kristin:
  • I can conclude from the definitions of the GO terms I gathered that the Vibrio cholerae collected from the patients showed gene expression increases in a few major areas. The significance of the increase in IMP metabolic process, IMP biosynthetic process, and the 'de novo' IMP biosynthetic process all point to an increase in the formation of IMP and inosine monophosphate. The significance in the increase of purine ribonucleoside monophosphate biosynthetic process, purine ribonucleoside monophosphate metabolic process, purine nucleoside monophosphate metabolic process, and purine nucleoside monophosphate biosynthetic process show an increase in the formation of a "compound consisting of a purine base linked to a ribose sugar esterified with phosphate on the sugar"(geneontology.org). These were the two main categories of the defitintions of the highest ranked GO terms, so give accurate insight into what the differences between the patient vibrio cholerae and the lab vibrio cholerae are. These definitions show that the patient vibrio cholerae was more prolific than the lab vibrio cholerae because the GO terms all were involved in the synthesis of protein because they were either involved with or helped to formulate basic building blocks of DNA/RNA. This increase in proteinsnot only helps the vibrio cholerae to stay alive, but also increases the pathogenecity of the vibrio cholerae because according to this Wikipedia article the protein that vibrio cholerae produces "causes profuse, watery diarrhea". The increase in GO terms that cause this diarrhea-causing protein to formulate shows that the patient vibrio cholerae causes more diarrhea than the lab vibrio cholerae.
  • I saved the file with the .gmf extension to this page.

Conclusion

  • Write a paragraph that briefly summarizes and gives a scientific conclusion for the work that you did for part 1 and 2 this week.
  • In part One of this assignment I downloaded and formatted data so that it could be interpreted by GenMAPP. This involved manipulating the data so that I could get to values that would give the information I needed in order to be able to compare gene expression between the patients samples of vibrio cholerae and the lab samples. In part Two of this assignment, we utilized both our newly created files as well as GenMAPP and MAPPFinder to be able to find out what all of the increases and decreases in the genes meant in relation to the pathogenecity of the vibrio cholerae. I specifically analyzed the genes that had significant increases in expression. Through outside resources, I was able to find that the increases in gene expression that were apparent caused an increase in protein synthesis in the vibrio cholerae found in the patients. I then found that the increase in protein signified an increase in at least one affect a human suffering from this bacterium - the diarrhea. This analysis led me to conclude that the vibrio cholerae from the patients was more dangerous and would affect humans worse than the vibrio cholerae from the lab.


Files

  • All necessary files are included within the zipped folder.

File:Vibrio cholerae MA 2015.zip


Team Page

Heavy Metal HaterZ

Assignments

Individual Journal Entries

Shared Journal Entries