Rlegaspi Week 3

From LMU BioDB 2015
Jump to: navigation, search

Individual Journal Assignment

Homework Partner

Kevin Wyllie

The Genetic Code, by Computer

Connect to the my.cs.lmu.edu workstation as shown in class and do the following exercises from there.

For these exercises, two files are available in the Keck lab system for practice; of course, you can always make your own sequences up. The practice files are ~dondi/xmlpipedb/data/prokaryote.txt and ~dondi/xmlpipedb/data/infA-E.coli-K12.txt.

Complement of a Strand

Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:

   cat sequence_file | ?????

Sequence file: ~dondi/xmlpipedb/data/prokaryote.txt

tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac

Sequence of Piped Text Processing Commands:

cat prokaryote.txt | sed "y/atcg/tagc/"

Result of Text Processing Commands: The Complimentary Strand of the Nucleotide Sequence (3'-5' direction) from ~dondi/xmlpipedb/data/prokaryote.txt

agatgatataaagttatccatgctaccggtttcttctgttataacttgaactttgcaacggattatggtacaaggcgcatattgggtcggcggtcaaggcgaccgccgtaaaattg


Reading Frames

Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. In other words, fill in the question marks:

   cat sequence_file | ?????

Sequence file: ~dondi/xmlpipedb/data/prokaryote.txt

tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac 
  • +1 Reading Frame Amino Acid Sequence
S T I F Q - V R W P K K T I L N L K R C L I P C S A Y N P A A S S A G G I L 
Command Sequence: cat prokaryote.txt | sed "s/..$//g" | sed "y/t/u/" | sed "s/.../& /g" | sed -f genetic-code.sed 

  • +2 Reading Frame Amino Acid Sequence
L L Y F N R Y D G Q R R Q Y - T - N V A - Y H V P R I T Q P P V P L A A F - 
Command Sequence: cat prokaryote.txt | sed "s/^.//g" | sed "s/.$//g" | sed "y/t/u/" | sed "s/.../& /g" | sed -f genetic-code.sed 
  • +3 Reading Frame Amino Acid Sequence
Y Y I S I G T M A K E D N I E L E T L P N T M F R V - P S R Q F R W R H F N 
Command Sequence: cat prokaryote.txt | sed "s/^..//g" | sed "y/t/u/" | sed "s/.../& /g" | sed -f genetic-code.sed 
  • -1 Reading Frame Amino Acid Sequence
V K M P P A E L A A G L Y A E H G I R Q R F K F N I V F F G H R T Y - N I V 
Command Sequence: cat prokaryote.txt | sed "y/tagc/aucg/" | rev | sed "s/.../& /g" | sed "s/..$//g" | sed -f genetic-code.sed 
  • -2 Reading Frame Amino Acid Sequence
L K C R Q R N W R L G Y T R N M V L G N V S S S I L S S L A I V P I E I - - 
Command Sequence: cat prokaryote.txt | sed "y/tagc/aucg/" | rev | sed "s/^.//g" | sed "s/.$//g" |  sed "s/.../& /g" | sed -f genetic-code.sed 
  • -3 Reading Frame Amino Acid Sequence
- N A A S G T G G W V I R G T W Y - A T F Q V Q Y C L L W P S Y L L K Y S R
Command Sequence: cat prokaryote.txt | sed "y/tagc/aucg/" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed

Checking My Work: ExPASy Translate Tool

ExPASy Translate Tool Results for prokaryote.txt DNA Sequence - Thank you, Brandon Klein, for uploading the image.

  • Note: The amino acid sequences that I recovered from using text processing commands are a match to the ExPASy Translate Tool results of the same DNA sequence from the prokaryote.txt file.

XMLPipeDB Match Practice

For your convenience, the XMLPipeDB Match Utility (xmlpipedb-match-1.1.1.jar) has been installed in the ~dondi/xmlpipedb/data directory alongside the other practice files. Use this utility to answer the following questions:

  1. What Match command tallies the occurrences of the pattern GO:000[567] in the 493.P_falciparum.xml file? java -jar xmlpipedb-match-1.1.1.jar "GO:000[567]" < 493.P_falciparum.xml
    • How many unique matches are there?
      • There are 3 unique matches.
    • How many times does each unique match appear?
      • "go:0007" appears 113 times. "go:0006" appears 1100 times. "go:0005" appears 1371 times.
  2. Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
    • Describe how you did this.
      • First, I typed the command "grep "GO:0005" 493.P_falciparum.xml" and it gave me a list that showed similar lines of code: <dbReference type="GO" id="GO:0005....">. I typed the following command "cat 493.P_falciparum.xml" that allowed me to view the entire file, which is a very large file indeed. Then I manually scrolled through the file to find the pattern (lots of scrolling).
    • Based on where you find this occurrence, what kind of information does this pattern represent?
      • Firstly, the file appears to contain a database of genome sequences of the Parasite Plasmodium falciparum in which many people contributed to the study of this particular species of parasite. Looking at the pattern closely and in its place within the information presented in the file there are many "<dbReference type=..."> that leads me to believe that the pattern is a way to show relationships between the many different genome sequences. <dbReference type="GO" id="GO:0005..."> is just one of the many ways to relate genome sequences to one another.
  3. What Match command tallies the occurrences of the pattern \"Yu.*\" in the 493.P_falciparum.xml file? java -jar xmlpipedb-match-1.1.1.jar "\"Yu.*\"" < 493.P_falciparum.xml
    • How many unique matches are there?
      • There are 3 unique matches.
    • How many times does each unique match appear?
      • "yu b." appears 1 time. "yu k." appears 228 times. "yu m." appears 1 time.
    • What information do you think this pattern represents?
      • The information offers the last name and the first initial of a person who contributed to the sequencing of a human malaria parasite plasmodium falciparum.
  4. Use Match to count the occurrences of the pattern ATG in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.
    • What answer does Match give you?
      • Match gives me that there is 1 unique match, which makes sense because I only searched for the occurrence of "ATG" and not something like "AT[GCTA]" (this would give me 4 unique matches). In addition, the pattern of ATG occurs 830101 times.
    • What answer does grep + wc give you?
      • grep + wc gives me the answer that ATG occurs in 502410 lines, 502410 words, and in 35671048 characters.
    • Explain why the counts are different. (Hint: Make sure you understand what exactly is being counted by each approach.)
      • With Match it counted how many times a unique match occurred and since there was just one unique match of ATG (or any variation of ATG in a mix of lower-case or upper-case letters), it produced the result of 830101 times that ATG occurred; however, grep + wc would only count for how many times ATG (Note: grep is case-sensitive, so it would only count for ATG in all captial letters) occurred in lines of text giving the result of 502410, which is a smaller number compared to the Match answer. This can be explained by the reason that Match is case-insensitive and grep is case-sensitive (which limits the search of the pattern).

Electronic Lab Notebook

  1. To complete the first part of The Genetic Code, by Computer (Complement of a Strand), I firstly had to connect to the my.cs.lmu.edu work station on my MacBook Pro laptop. Thankfully, my homework partner Kevin
    told me in class that we need to be connected to the LMU network [i.e. Student(Secure)], which saved a lot of frustration because I was planning to work from home. I do understand that it is required to connect via LMU network because the computer that operates the my.cs.lmu.edu is on campus. Focusing more on the assignment, I found it useful to write my thoughts out on paper before using my Terminal app and my.cs.lmu.edu work station. I chose the shortest of the two practice files. When thinking about complimentary strand of the DNA sequence, I figured that all of the nucleotides needed to be changed into their compliments (a to t, t to a, c to g, g to c). From the Tuesday Dondi lecture, I remember the command used to replace characters into desired characters - sed "y/ / /". The piped text processing commands of cat prokaryote.txt | sed "y/atcg/tagc/" produced the expected output that I wanted, which was the compliment strand.
  2. To complete the second part of the The Genetic Code, by Computer (Reading Frames), I did the same logging in steps as the first part of the assignment. The task of producing the amino acid sequences of all 6 reading frames was a challenging one because it required most of the skills learned from class and being able to apply what was learned to produce the results desired. I knew at first that I needed to get the file (cat prokaryote). Then I had to figure out a way in which to separate the nucleotides into three-nucleotide codons (sed "s/.../& /g") [Thank you Dr. Dionisio for helping me remember this important step, because otherwise the result of the piped text processing commands would have been a mix of nucleotides and amino acids]. Then I needed to change the t's to u's (sed "y/t/u/"). I also had to cut the last two nucleotides from the DNA sequence because they wouldn't be translated (sed "s/..$//g") - I would have to use variations of this command to cut out the nucleotides at the end or beginning that would not be translated. Finally, I inputed the command that would use the genetic-code.sed file to translate the codons into the amino acid sequence (sed -f genetic-code.sed). Up until this point, the sequence of commands were used for the +1, +2, and +3 reading frames, but for the -1, -2, and -3 reading frames there were more steps involved. I needed a command that would give me the 3'-5' complimentary strand (sed "y/tagc/aucg/"). I needed to get the complimentary strand into the 5'-3' direction (rev). Then I used the same commands in order to cut out the nucleotides that would not be translated and used the genetic-code.sed file to translate. I am not going to lie: This part of the assignment required plenty of trial and error, and I would not have reached the results I wanted without the help of Dr. Dionisio Introduction to the Command Line and his advice via email; in addition, it required a lot of critical thinking and playing around with the different commands we, as a class, were taught in lectures. I checked my work using the ExPASy Translate Tool online and was glad to know that my answers corresponded with the translation tool.
  3. Electronic Lab notebook entry of match utility practice

Links to User Page and Journal Pages

Ron Legaspi
BIOL 367, Fall 2015

Assignment Links
Individual Weekly Journals
Shared Weekly Journals