Anuvarsh Week 3

From LMU BioDB 2015
Jump to: navigation, search

The Genetic Code, by Computer

Connect to the my.cs.lmu.edu workstation as shown in class and do the following exercises from there.

I did so by performing the following command:

   ssh avarshne@my.cs.lmu.edu

and then inputting my password.

I then created a folder for this class.

   mkdir biodb2015

And created a sequence_file.txt file

   echo 'agcggtatac' >sequence_file.txt

I then moved to Dondi's repository and copied over some files using the following commands:

   cd ~dondi/xmlpipedb/data
   cp genetic-code.sed ~avarshne/biodb2015
   cp xmlpipedb-match-1.1.1.jar ~avarshne/biodb2015
   cp prokaryote.txt ~avarshne/biodb2015
   cp infA-E.coli-K12.txt ~avarshne/biodb2015
   cp 493.P_falciparum.xml ~avarshne/biodb2015
   cp hs_ref_GRCh37_chr19.fa ~avarshne/biodb2015


Complement of a Strand

Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:

   cat sequence_file | ?????

For example, if sequence_file contains:

   agcggtatac

Then your text processing commands should display:

   tcgccatatg

In order to do this, I first set out to determine what all needs to be done by the computer consecutively.

  1. sequence_file.txt must be concatenated in order for any of the next commands to work on the text within that file.
  2. Replace A, T, C, and G with it's corresponding base pairs.

These steps can be achieved with the following commands, and produces the following result:

   cat "sequence_file.txt" | sed "y/atcg/tagc/" 
   tcgccatatg

Reading Frames

Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. In other words, fill in the question marks:

   cat sequence_file | ?????

In this case, the steps that the computer needs to complete are as follows:

  1. Concatenate the sequence_file.txt file.
    • cat "sequence_file.txt"
  2. Replace any "t"s with "u"s when finding the +1, +2, and +3 protein sequences. For the -1, -2, and -3 sequences, we must create the complementary strand, replace each A, T, C, and G with its corresponding RNA base pair (U, A, G, and C), and then reverse the strand.
    • sed "s/t/u/g"
    • sed "s/atcg/uagc/g" | rev
  3. Remove any necessary bases from the beginning of the sequence in order to start at the correct reading frame.
    • either not applicable, sed "s/^.//g", or sed "s/^..//g"
  4. Add a space after every codon (every 3 characters).
    • sed "s/.../& /g"
  5. Reach into the genetic-code.sed file and utilize the sed commands already written into it in order to convert each codon into it's corresponding protein.
    • sed -f genetic-code.sed
  6. Removed all added spaces between codons.
    • sed "s/ //g"
  7. Remove any left over bases that weren't a part of a codon, and couldn't be used to translate into a protein sequence.
    • sed "s/[aucg]//g"

Because we are looking at 6 different reading frames on that fragment of DNA, 6 different commands will need to be written for each protein sequence. Each of the following commands represents one reading frame, and is followed by the resulting protein sequence.

+1

   cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   SGI

+2

   cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   AVY

+3

   cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   RY

-1

   cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   VYR

-2

   cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   YTA

-3

   cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   IP

Check Your Work

I checked my work with ExPASy Translate Tool. I input my original DNA strand (agcgguauac), and received the following results from the translator:

  • 5'3' Frame 1: S G I
  • 5'3' Frame 2: A V Y
  • 5'3' Frame 3: R Y
  • 3'5' Frame 1: V Y R
  • 3'5' Frame 2: Y T A
  • 3'5' Frame 3: I P

XMLPipeDB Match Practice

For your convenience, the XMLPipeDB Match Utility (xmlpipedb-match-1.1.1.jar) has been installed in the ~dondi/xmlpipedb/data directory alongside the other practice files. Use this utility to answer the following questions:

  1. What Match command tallies the occurrences of the pattern GO:000[567] in the 493.P_falciparum.xml file?
    • java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml
    • How many unique matches are there?
      • 3
    • How many times does each unique match appear?
      • go:0007: 113
      • go:0006: 1100
      • go:0005: 1371
  2. Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
    • <dbReference type="GO" id="GO:0007010">
    • Describe how you did this.
      • grep "GO:000[567]" 493.P_falciparum.xml
    • Based on where you find this occurrence, what kind of information does this pattern represent?
      • The pattern "GO:000[567]" represents the id of an item (gene?) of type GO (gene ontology?)
  3. What Match command tallies the occurrences of the pattern \"Yu.*\" in the 493.P_falciparum.xml file?
    • java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" < 493.P_falciparum.xml
    • How many unique matches are there?
      • 3
    • How many times does each unique match appear?
      • "yu b.": 1
      • "yu k.": 228
      • "yu m.": 1
    • What information do you think this pattern represents?
      • I think this pattern represents people's names.
  4. Use Match to count the occurrences of the pattern ATG in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.
    • java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa
    • grep "ATG" hs_ref_GRCh37_chr19.fa | wc
    • What answer does Match give you?
      • Total matches: atg: 830101
      • Total unique matches: 1
    • What answer does grep + wc give you?
      • Lines: 502410
      • Words: 502410
      • Characters: 35671048
    • Explain why the counts are different. (Hint: Make sure you understand what exactly is being counted by each approach.)
      • Match provides a statistic that represents the total number of times the search parameter was found within the file. Within hs_ref_GRCh37_chr19.fa, "ATG" appears 830,101 times. Grep looks for the search parameter in every line of a specific file and returns a list that consolidates every line that has an instance of that search parameter. Wc provides 3 statistics representing line count, word count, and character count respectively. Within hs_ref_GRCh37_chr19.fa, grep found 502,410 lines that contain "ATG". Because there are no spaces in genetic code, each line was considered a word, so wc reported that there are 502,410 words, and a total of 35,671,048 characters accumulated through each of those lines/words.

Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15