Blitvak Week 3

From LMU BioDB 2015
Jump to: navigation, search

Individual Journal Assignment Week 3

Initial Preparations

  • PuTTY was downloaded, installed, and initialized
  • Connected to my.cs.lmu.edu workstation via PuTTY
  • Entered ~dondi/xmlpipedb/data using cd ~dondi/xmlpipedb/data
  • All results were checked using the ExPASy Translate Tool and the Nucleic Acid Sequence Massager, provided by Attotron
  • prokaryote.txt in ~dondi/xmlpipedb/data was examined using cat prokaryote.txt
  • prokaryote.txt was chosen for use in the first part of this assignment
  • The sequence in prokaryote.txt was copied and pasted on a separate file for future reference and checking (additionally, I found that text can be copied by highlighting and right-clicking)
  • Key goal in this first segment is to find the complementary strand of the sequence in prokaryote.txt. This should be accomplished by utilizing the base pairing rules of A-T and C-G
  • Key command in this assignment would be sed; various kinds of pattern replacements, combined together, can prove to be very powerful (should allow me to convert DNA to mRNA, and mRNA to an amino acid sequence)

Finding the Complementary Strand

  • sed "y/atcg/tagc/" was found to replace all lowercase a's, t's, c's, and g's with t's, a's, g's, and c's respectively (in lines of text); this command should allow me to find the complementary strand
  • Using prokarote.txt, the given nucleotide sequence, the complementary strand was found by using cat prokaryote.txt | sed "y/atcg/tagc/"
  • The given nucleotide sequence was:
tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac
  • The complementary strand, using cat prokaryote.txt | sed "y/atcg/tagc/", was found to be:
agatgatataaagttatccatgctaccggtttcttctgttataacttgaactttgcaacggattatggtacaaggcgcatattgggtcggcggtcaaggcgaccgccgtaaaattg

Finding the 6 Reading Frames of prokaryote.txt

Initial Findings

  • While still in ~dondi/xmlpipedb/data, genetic-code.sed was examined using cat genetic-code.sed
  • genetic-code.sed was found to contain all of the sed replacement commands needed to convert any mRNA triplet to an amino acid
  • The large amount of sed replacement commands in genetic-code.sed made it apparent that linking them all together in one pipeline would be difficult and tedious. All of genetic-code.sed, ideally, would be exploited in one command
  • cat prokaryote.txt | sed "s/^.//g" was found to remove the first letter from the nucleotide sequence (would be useful in finding the +2 and -2 reading frames, as that involves omitting the first sequence letter)
  • cat prokaryote.txt | sed "s/^..//g" was found to remove the first two letters from the nucleotide sequence (would be useful in finding the +3 and -3 reading frames, as that involves omitting the first two sequence letters)
  • cat prokaryote.txt | sed "s/.../ & /g" was found to make the nucleotide sequence a set of triplets, with spaces in between each (this makes the codons distinct from each other and allows them to be clear and readable for the program; this should be tied to a use of genetic-code.sed)
  • rev prokaryote.txt was found to reverse the sequence (changes the direction from 5' - 3' to 3' - 5', or vice versa; this should be useful in making the template strand run from 5' - 3' prior to working with it for its reading frames)
  • It was assumed that the sequence in prokaryote.txt ran from 5' to 3'
  • sed "s/[atcg]//g" was found to delete any uncapitalized nucleotide sequence letters (should allow the removal of any letters that did not form codons and, thus, did not lead to an amino acid)
  • sed "y/t/u/" was found to replace any uncapitalized t's with u's; would be useful in converting a nucleic acid sequence into RNA
  • It was realized that a file with a set of sed commands could be exploited by using sed -f <filename>; this command should pair well with genetic-code.sed!
  • I figured out that sed "s/.../ & /g" would eventually lead to an amino acid sequence with spaces in between each letter. sed "s/ //g" was found to delete any spaces (would be good to place it after the codons are converted to a sequence of amino acids)

Finding the Reading Frames of the mRNA-like strand (5'-3')

  • +1 reading frame was found by using: cat prokaryote.txt | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
    • Output: STIFQ-VRWPKKTILNLKRCLIPCSAYNPAASSAGGIL
  • +2 reading frame was found by using: cat prokaryote.txt | sed "s/^.//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
    • Output: LLYFNRYDGQRRQY-T-NVA-YHVPRITQPPVPLAAF-
  • +3 reading frame was found by using: cat prokaryote.txt | sed "s/^..//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
    • Output: YYISIGTMAKEDNIELETLPNTMFRV-PSRQFRWRHFN

Finding the Reading Frames of the template strand (3'-5')

  • -1 reading frame was found by using: cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
    • Output: VKMPPAELAAGLYAEHGIRQRFKFNIVFFGHRTY-NIV
  • -2 reading frame was found by using: cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/^.//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
    • Output: LKCRQRNWRLGYTRNMVLGNVSSSILSSLAIVPIEI--
  • -3 reading frame was found by using: cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/^..//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
    • Output: -NAASGTGGWVIRGTWY-ATFQVQYCLLWPSYLLKYSR

Checking Results

  • Using the ExPASy Translate Tool, tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac was entered and converted into the possible sequences of amino acids (output format was selected as compact). The 6 reading frames, as given by this tool, matched those found in the assignment

XMLPipeDB Match Practice

Preparations

  • The program xmlpipedb-match-1.1.1.jar was found in ~dondi/xmlpipedb/data
  • It was found that java programs can be run by using java -jar <program name>
  • xmlpipedb-match-1.1.1.jar would be run, for the purpose of matching patterns, by using java -jar xmlpipedb-match-1.1.1.jar <pattern> < <filename>
  • 493.P_falciparum.xml was found in ~dondi/xmlpipedb/data and examined using cat 493.P_falciparum.xml; it took quite some time to fully load (viewing using more seems like a wonderful idea)
  • I figured out that there is search function in more, initiated by typing /<search_text> and pressing enter

Working with XMLPipeDB Match

  1. Match command for the tallying of the occurrences of the pattern GO:000[567] in 493.P_falciparum.xml
    • java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml can be used to match occurrences of GO:0005, GO:0006, and GO:0007
    • 3 total unique matches were found: GO:0005, GO:0006, and GO:0007
    • Occurrences of each unique match: 113 for GO:0007, 1100 for GO:0006, and 1371 for GO:0005
  2. Observing "in situ" occurences of GO:000[567] in 493.P_falciparum.xml
    • more 493.P_falciparum.xml was used to make the viewing of the file more manageable
    • While in more, by typing /GO:0006 and pressing enter, a line containing the pattern GO:0006 was present at the top of the window (surrounded by the file's text)
    • Based on the surrounding text, the pattern likely represents the beginning portion of an ID string tied to various genes in a gene database for Plasmodium falciparum
    • In the text, it was found that various processes/metabolic pathways are connected to each database ID string (likely influenced by the genes in question)
  3. Match command for the tallying of the occurrences of the pattern \"Yu.*\" in 493.P_falciparum.xml
    • 3 total unique matches were found: "yu b.", "yu k.", and "yu m."
    • Occurrences of each unique match: 1 for "yu b.", 228 for "yu k.", and 1 for "yu m.".
    • I'm fairly certain that this pattern represents a person's name. By using more 493.P_falciparum.xml, and typing /\"Yu.*\", an example of an in-text line containing this pattern was found. It was observed that this pattern is preceded by <person name=
  4. Using Match and grep + wc to count occurences of the pattern ATG in hs_ref_GRCh37_chr19.fa
    • hs_ref_GRCh37_chr19.fa was found in ~dondi/xmlpipedb/data
    • java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa was employed to find the instances of ATG via Match
      • Output: 1 unique match, atg, was found. There are 830101 instances of atg in the file
    • grep "ATG" hs_ref_GRCh37_chr19.fa | wc was used to find the instances of ATG using grep + wc
      • Output: 502410 lines, 502410 words, and 35671048 characters (the output of grep + wc is unlabeled, it is always lines, words, and characters from left to right)
    • There is a large difference between the outputs of Match and grep + wc in regards to finding the occurrences of ATG. This big difference is due to the fact that Match finds specific instances of the ATG pattern (possibly several in a line) while grep + wc just finds lines that contain at least one instance of ATG and counts those lines. grep + wc treats lines and words as the same since it sees the output lines (of grep) as words (there are no spaces/breaks within each individual line)

Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments Individual Journal Pages Shared Journal Pages