Overview of Microarray Data Analysis

Lab Journal

Part 1

Accessed BIOL398-01:Bioinformatics Laboratory
Downloaded the Merrell_Compiled_Raw_Data_Vibrio.xls file to Desktop.
File renamed Merrell_Compiled_Raw_Data_Vibrio_SL_10102013.xls to reflect ownership and date
Inserted a new Worksheet into Excel file, and named it "scaled_centered"
Copied data from "compiled_raw_data" worksheet into "scaled_centered" worksheet
Inserted two rows in between the top row of headers and the first data row.
- Typed "Average" in cell A2 and "StdDev" in cell A3
Typed "=AVERAGE(B4:B5224)" in cell B2.
Typed "=STDEV(B4:B5224)" in cell B3.
Copied both equations in cells B2 and B3 and pasted them into the empty cells in the rest of the columns.
Copied the column headings for all data columns and then pasted them to the right of the last data column.
Edited the names of the columns to A1_scaled_centered, A2_scaled_centered, etc.
Typed "=(B4-B$2)/B$3" in cell N4.
Copied and pasted equation into the entire column.
- Copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header.
Created new worksheet "statistics".
Copied the first column ("ID") of "scaling_centering" worksheet and pasted the data into the first column of "statistics" worksheet.
Copied the columns that are designated "_scaled_centered" of "scaling_centering" worksheet
Clicked on the B1 cell. Selected "Paste Special" from the Edit menu. Clicked on the radio button for "Values" and clicked OK.
Typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
Typed "=AVERAGE(B2:E2)" into cell N2. Copied equation and pasted it into the rest of the column.
Typed equation for Patients B & C and pasted it into their columns
Typed the header "Avg_LogFC_all" into the first cell in the next empty column. Created equation to compute the average of the three previous averages and pasted it into entire column.
Inserted a new column next to the "Avg_LogFC_all" column. Labeled the column "Tstat". Typed "=AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))" Copied the equation and pasted it into all rows in that column.
Labeled the top cell in the next column "Pvalue". In the cell below the label typed "=TDIST(ABS(R2),degrees of freedom,2)" Copied the equation and pasted it into all rows in that column.
Created new worksheet "forGenMAPP".
Selected All and Copy on the "statistics" worksheet
Clicked on cell A1 of "forGenMAPP" and selected Paste Special, clicked on the Values radio button, and clicked OK.
Selected Columns B through Q (all the fold changes). Selected the menu item Format > Cells. Under the number tab, selected 2 decimal places. Clicked OK.
Selected Columns R and S. Selected the menu item Format > Cells. Under the number tab, selected 4 decimal places. Clicked OK.
Selected Columns N through S and Cut. Selected Column B by left-clicking on the "B" at the top of the column. Then right-clicked on the Column B header and selected "Insert Cut Cells"
Deleted Rows 2 and 3 where it says "Average" and "StDev".
Inserted a column to the right of the "ID" column. Typed the header "SystemCode" into the top cell of this column. Filled the entire column (each cell) with the letter "N".
Selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.
Uploaded both the .xls and .txt files to journal page in the class wiki.

Part 2

Launched GenMAPP
Downloaded new Vc-Std_External_20101022.gdb Gene Database using this link to the XMLPipeDB SourceForge Download page.
Clicked on the link for the Gene Database, downloaded the file, and saved it into the folder C:\GenMAPP 2 Data\Gene Databases, and extracted it.

Launched the GenMAPP Program.
Looked in the lower, left-hand corner of the main GenMAPP Drafting Board window to see the name of the Gene Database that is loaded.
Selected the Data menu from the main Drafting Board window and chose Expression Dataset Manager from the drop-down list.
Selected New Dataset from the Expression Datasets menu. Selected the tab-delimited text file (.txt) in the procedure above from the file dialog box that appears.
Allowed the Expression Dataset Manager to convert data.
After a few minutes, the converted dataset was active in the Expression Dataset Manager window and the file was saved in the same folder the raw data file was in, named the same except with a .gex extension
A message appeared saying that the Expression Dataset Manager could not convert one or more lines of data.
- 121 Errors were detected. This was far less severe then the 722 errors that was found in my partner's database. This was likely due to the fact that my partner was using an older version of the Vibrio cholerae database that was not recently proofread.
Uploaded exceptions file: EX.txt to wiki page.
Customize the new Expression Dataset by creating new Color Sets which contain the instructions to GenMAPP for displaying data on MAPPs.
- Red = Increased expression, Blue = Decreased expression, Gray = No change, White = No data
Selected "Avg_LogFC_all" for Gene Value.
Activated the Criteria Builder by clicking the "New" button.
Enter a name for the criterion in the Label in Legend field.
Created and named two criteria by entering the name of the criteria and choosing a color. Created two criteria with "increased" colored red and "decreased" colored blue.
Set increasing results as AvgLogFC change > 0.25 and a p-value less than 0.05 {AvgLogFC change > 0.25 and a p-value less than 0.05}
Set decreasing results as AvgLogFC change < -0.25 and a p-value less than 0.05 {([AvgLogFC_all]<-0.25 AND [Pvalue]<0.05)}
Selected Save from Expression Dataset menu, saved as .gex file
Launched MAPPFinder
Chose "calculate new results"
Chose "find file" and selected the saved .gex file from previous steps
Selected "increase" criteria in right-hand box and checked boxes for "Gene Ontology" and "p-value"
Clicked "browse" and saved file
Clicked "run MAPPFinder"
Clicked "show ranked list"
Top Ten Gene Ontology Terms:
- Branched chain family amino acid metabolic process
- Branched chain family amino acid biosynthetic process
- IMP metabolic process
- IMP biosynthetic process
- Purine nucleoside monophosphate metabolic process
- Purine ribonucleoside monophosphate biosynthetic process
- Purine ribonucleoside monophosphate metabolic process
- Purine nucleoside monophosphate biosynthetic process
- ‘de novo’ IMP biosynthetic process
- Arginine metabolic process
These search results were different from what my partner had come up with. This was most likely due to the fact that my database was more up-to-date and therefore had more updated expression data.
Clicked on the button "Collapse the Tree" in the main MAPPFinder Browser window. Searched for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583.
Typed the identifier for one of these genes into the MAPPFinder browser gene ID search field. Chose "OrderedLocusNames" from the drop-down menu to the right of the search field.
Clicked on the GeneID Search button. The GO term(s) that are associated with that gene will be highlighted in blue. Listed the GO terms associated with each of those genes in your individual journal.
GO Terms Search Results:
- VC0028: Branched chain family amino acid biosynthetic process, Cellular amino acid biosynthetic process, Metabolic process, Metal ion binding, Iron-Sulfur cluster binding, 4 iron, 4 sulfur cluster binding, Catalytic activity, Lyase activity, Dihydroxy-acid dehydratase activity
- VC0941: Glycine Metabolic Process, L-serine Metabolic Process, One-Carbon Metabolic Process, Cytoplasm, Pyridoxal Phosphate Binding, Catalytic Activity, Transferase Activity, Glycine Hydroxymethyltransferase Activity
- VC0869: Glutamine Metabolic Process, Purine Nucleotide Biosynthetic Process, 'de novo' IMP Biosynthetic Process, Cytoplasm, Nucleotide Binding, ATP binding, Catalytic Activity, Ligase Activity, Phosphoribosylformyglycinamidine Synthase Activity
- VC0051: Purine Nucleotide Biosynthetic Process, 'de novo' IMP Bisynthetic Process, Nucleotide Binding, ATP Binding, Catalytic Activity, Lyase Activity, Carboxy-lyase Activity, Phosphoribosylaminoimidazole
- VC0647: mRNA Catabolic Process, RNA Processing, Cytoplasm, Mitochondrion, RNA Binding, 3'-5'-exoribonuclease Activity, Transferase Activity, Nucleotidyltransferase Activity, Polyribonucleotide Nucleotidyltransferase Activity
- VC0468: Glutathione Biosynthetic Process, Metal Ion Binding, Nucleotide Binding, ATP Binding, Catalytic Activity, Ligase Activity, Glutathione Synthase Activity
- VC2350: Deoxyribonucleotide Catabolic Process, Metabolic Process, Cytoplasm, Catalytic Activity, Lyase Activity, Deoxyribose-phosphate aldolase Activity
- VCA0583: Transport, Outer Membrane-Bounded Periplasmic Space, Transporter Activity
- My search results were much more extensive then my partner's. This was probably because my database was more recently updated and contained information of genes that was not discovered at the time of my partner's database.
Went to UniProt site and entered "VC0941" into the search box
Clicked on entry Q9KTG1
Clicked on GO term "one-carbon metabolic process". Gene expression did not change significantly
- Function of VC0941 is to catalyze the reversible interconversion of serine and glycine with tetrahydrofolate (THF) serving as the one-carbon carrier. This synthesizes one-carbon groups required for the biosynthesis of purines, thymidylate, methionine, and other important biomolecules. VC0941 Also exhibits THF-independent aldolase activity toward beta-hydroxyamino acids, producing glycine and aldehydes, through a retro-aldol mechanism
Launched Microsoft Excel. Opened the copies of the .txt files in Excel. Clicked "Show all files" and clicked "Finish"
- Results differed from partner extensicely. The only relates that were shared was number of probes and the number of probes that met the (AvgLogFC_all]>0.25 and [Pvalue]<0.05. As previously stated, since my database was recently updated, it had information that was not previously available to my partner's database.
Clicked on a cell in the row of headers for the data. Under Data menu, clicked "Filter > Autofilter". Clicked on the drop-down arrow for the column "(Custom…)". Selected two filters:
- Z Score (in column N) greater than 2
- PermuteP (in column O) less than 0.05
Set conditions for filters:
- Number Changed (in column I) greater than or equal to 4 or 5 AND less than 100
- Percent Changed (in column L) greater than or equal to 25-50%
Saved changes to an Excel spreadsheet. Select File > Save As and select Excel workbook (.xls) from the drop-down menu.
Uploaded .xls file to journal page.
In respect to the pathogenicity of V. cholerae, the GO terms that appeared make sense in how they can improve the functions and survivability of the organism. Processes such as the Glycine Metabolic Process, L-serine Metabolic Process, One-Carbon Metabolic Process are all important in terms of maintaining the organism's metabolic processes and keeping it alive while it is in an outside environment. Catalytic Activity, Transferase Activity, and Glycine Hydroxymethyltransferase Activity all seem to relate to the replication of V. cholerae DNA. This is an essential for a pathogenic cell due to its need for proper DNA replication in order to properly facilitate cell infection.