Anuvarsh Week 4

From LMU BioDB 2015
Jump to: navigation, search

Transcription and Translation "Taken to the Next Level"

Before anything else, I logged into my account using:

   ssh avarshne@my.cs.lmu.edu

And put in my password. Then, I entered the directory within which I copied infA-E.coli-k12.txt from Dondi's library.

   cd biodb2015

Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene

In order to complete this task, I reviewed the Introduction to the Command Line page and looked over the More Text Processing Features page. At this point, my partner Ron Legaspi and I were led through the first couple steps of the homework in class. In particular, we learned how to go about adding the -35 box and -10 box tags. In order to do this, we first searched infA-E.coli-K12.txt for all instances of the -35 sequence, which was provided to us as a hint on the homework assignment. In order to do this, we used the following command:

   grep "tt[gt]ac[at]" infA-E.coli-K12.txt

When we ran this test, we noticed that there were 2 instances of this pattern with only two nucleotides between them. Because we understood that the -10 box must occur after the -35 box, we searched for the -10 box sequence while also searching for the -35 box. In this instance, we could not use grep because only one sequence can be searched at any given time. In order to locate both sequences relative to each other, we ran the following command:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/    ***&***    /g" | sed "s/[ct]at[at]at/    ***&***    /g"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt    ***tataat***    tgcggtcgcagagttggttacgctca
   ttaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcgg
   cttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc    ***tttact***    ta    ***tttaca***
   gaacttcgg    ***cattat***   cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg
   aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctc
   cggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaagg
   ccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggccttttt
   attttat

At this point it became very clear that the "real" -35 box was the first instance of the sequence, and the "real" -10 box was the second instance of the sequence, or the first instance after the -35 box. We began with tagging the -35 box. In order to replace just the first instance of a sequence using sed, we found that we just needed to replace the "g" in sed "s///g" with "1". This tells sed to only replace the first instance of a sequence. We found this information in the More Text Processing Features page. The resulting command was as follows:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccg
   ctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagcc
   gtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
   tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg
   aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatct
   ccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa
   ggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcc
   tttttattttat

In order to tag the -10 box, Dondi provided us with a hint that said that we should enter a new line after the -35 box tag. This is beneficial in accurately tagging the -10 box because the "correct" -10 box should be found only a few nucleotides away from the -35 box. By entering a new line, we are able to begin our search for the "correct" -10 box at line 2. In order to enter a new line after the -35 box tag, we referred to the More Text Processing Features page which indicated that we should use the phrase &\n in order to enter a new line. The resulting command was as follows:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" 
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccg
   ataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcgga
   gtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc
   aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgc
   gcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgta
   gtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

At this point, all we needed to was search and replace the first instance of the -10 box after the line break. In class, we were provided with a hint that said that in order to start a search and replace at the second line of a set of text, we should modify sed "s///g" to look like sed "2s///g". This led us to the following command:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgat
   aaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaa
   tgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgccggttcaaattacggtagtgataccccagaggattag
   atggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgt
   ggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacc
   tgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggc
   ctttttattttat

At this point, class had ended and we were on our own. Ron and I split up, and I continued working on this assignment by myself. At this point, the assignment asked us to tag the transcription start site with the hint that it was 12th nucleotide after the first nucleotide in the -10 box. Because the -10 box is 6 nucleotides, and TSS is 12 nucleotides *after* the first, I knew the transcription start side would be the 7th nucleotide after the -10 box. In order to find this, I first entered another line 6 nucleotides after the -10 box using the repetition shortcut as outlined in More Text Processing Features. I could then tag the first nucleotide in line 3 as the transcription start site.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccga
   taaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagta
   atgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g<\tss> gttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaa
   acgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgc
   atcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcc
   tgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

In order to tag the ribosome binding site, I followed a similar pattern as earlier where I entered a new line after the transcription start site tag, and searched for the ribosome binding sequence and replaced the first instance of it in line 4. The sequence for the RBS is gagg as outlined in the hints portion of the homework assignment.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataag
   gaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
   aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g</tss> 
   gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgt
   tgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgac
   gggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcga
   agagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

To find the start codon, I followed the same pattern as before: I first added a new line after the ribosome binding site and then searched for the start codon sequence because the start codon would only exist after the RBS. On the mRNA the start codon is AUG, so the mRNA-like strand of DNA would be ATG. There can only be one start codon, so only the first instance of ATG after the RBS will be tagged.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | 
   sed "5s/atg/ <start_codon>&<\/start_codon> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataag
   gaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgcc
   gaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g</tss> 
   gttcaaattacggtagtgatacccca <rbs>gagg</rbs> 
   attag <start_codon>atg</start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgt
   tccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtg
   actgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacga
   gtaaaaggtcggtttaaccggcctttttattttat

The stop codon can be either UAA, UAG, or UGA on the mRNA, so the stop codon on the mRNA-like strand would read as either TAA, TAG, or TGA. Furthermore, the stop codon must exist a multiply of 3 nucleotides away from the start codon because it must be an even number of codons away from the start codon in order to be considered the correct stop codon. In order to find the stop codon, therefore, I must follow the same procedure as earlier (entering a new line after the previous tag) but before I search and replace the stop codon sequence with the tagged sequence, I must first split all of the nucleotides in line 6 into 3 nucleotide long codons. Then, I can search for the stop codon sequence and tag it. Finally, the spaces between the codons must be removed. When I first did this command, I realized that this removes the spaces surrounding the stop codon tag, and realized that I needed to go back and replace the stop codon tag with the correctly spaced version.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | sed "5s/atg/ 
   <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon> /&\n/g" | sed "6s/.../& /g" | sed "6s/t[ag][ag]/ 
   <stop_codon>&<\/stop_codon> /1" | sed "6s/ //g" | sed "6s/<stop_codon>/ <stop_codon>/g" | 
   sed "6s/<\/stop_codon>/<\/stop_codon> /g"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataa
   ggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgc
   cgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g</tss> 
   gttcaaattacggtagtgatacccca <rbs>gagg</rbs> 
   attag <start_codon>atg</start_codon> 
   gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttact
   gcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa
   ggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagtaaa
   aggtcggtttaaccggcctttttattttat

The terminator sequence was the most challenging sequence to tag. The hint in the homework assignment informed me that the first half of the terminator hairpin sequence is "AAAAGGT". The other half of the terminator sequence would need to be complementary to this strand and in the reverse order with the exceptions that the T would bind with a G. This meant that the other half of the terminator hairpin sequence would need to be "GCCTTTT". The hint in the homework also informed me that the terminator sequence does not end until 4 nucleotides after the end of the second half of the hairpin sequence. Given this information, I was able to vaguely string together which portion of the sequence I would need to tag. However, the pattern that I used for the previous search and tag's wouldn't be as useful since I do not know the number of nucleotides between the first and second half of the hairpin sequence. In order to tag the terminator, I decided to tag the first part first, enter a new line, and then tag the last part.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | sed "5s/atg/ 
   <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon> /&\n/g" | sed "6s/.../& /g" | sed "6s/t[ag][ag]/ 
   <stop_codon>&<\/stop_codon> /1" | sed "6s/ //g" | sed "6s/<stop_codon>/ <stop_codon>/g" | 
   sed "6s/<\/stop_codon>/<\/stop_codon> /g" | sed "s/<\/stop_codon> /&\n/g" | sed "7s/aaaaggt/ <terminator>& \n/1" | 
   sed "8s/gcctttt..../&<\/terminator> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataag
   gaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgcc
   gaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g</tss> 
   gttcaaattacggtagtgatacccca <rbs>gagg</rbs> 
   attag <start_codon>atg</start_codon> 
   gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttact
   gcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa
   ggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> 
   ttgttttaccgcctgatgggcgaagagaaagaacgagt <terminator>aaaaggt 
   cggtttaaccggcctttttatt</terminator> ttat

The only thing left to do is to remove all of the new lines that I created. In order to do this, I referred to the More Text Processing Features page and found the command for combining lines.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | sed "5s/atg/ 
   <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon> /&\n/g" | sed "6s/.../& /g" | sed "6s/t[ag][ag]/ 
   <stop_codon>&<\/stop_codon> /1" | sed "6s/ //g" | sed "6s/<stop_codon>/ <stop_codon>/g" | 
   sed "6s/<\/stop_codon>/<\/stop_codon> /g" | sed "s/<\/stop_codon> /&\n/g" | sed "7s/aaaaggt/ <terminator>& \n/1" | 
   sed "8s/gcctttt..../&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g'
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagga
   atttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacc
   tgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
   </minus10box> cttgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg
   </start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcac
   gtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctga
   gcaaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagt 
   <terminator>aaaaggt cggtttaaccggcctttttatt</terminator> ttat

What is the exact mRNA sequence that is transcribed from this gene?

In order to find the mRNA sequence, I essentially need the sequence between the transcription start site to the terminator without the tags. I remembered the section in More Text Processing Features that talked about deleting lines, and decided to add a new line before the TSS and after the terminator, and delete the first and last line, leaving only the sequence between the TSS and terminator with it's tags. In order to delete the tags, my first thought was the make a new line before and after each tag, and delete the line that the tag sat on, and then remove all extra lines. This sounded like way too much work. Instead, I thought about the common characteristics between the tags that separates them from the DNA sequence. Obviously, DNA sequences do not have greater than or less than signs. For that reason, I decided that I could create a new line at every < or > and delete all of the lines that have a tag on them. In fact, I could create a new line on the entire tagged sequence and delete all of the lines that I don't need! After deleting all of the unnecessary lines, I just need to remove all of the line breaks using the same command as earlier. After creating the line breaks using ... sed "s/ //g" | sed "s/[<>]/\n/g", I ended up with the following lines:

   1 - ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaat
   ttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgttt
   gttgcgatttagcgcgcaaatc
   2 - minus35box
   3 - tttact
   4 - /minus35box
   5 - tatttacagaacttcgg
   6 - minus10box
   7 - cattat
   8 - /minus10box
   9 - cttgcc
   10 - tss
   11 - g
   12 - /tss
   13 - gttcaaattacggtagtgatacccca
   14 - rbs
   15 - gagg
   16 - /rbs
   17 - attag
   18 - start_codon
   19 - atg
   20 - /start_codon
   21 - gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttact
   gcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaa
   aggccgcattgtcttccgtagtcgc
   22 - stop_codon
   23 - tga
   24 - /stop_codon
   25 - ttgttttaccgcctgatgggcgaagagaaagaacgagt
   26 - terminator
   27 - aaaaggtcggtttaaccggcctttttatt
   28 - /terminator
   29 - ttat

I determined that I could delete the following lines: 1-10, 12, 14, 16, 18, 20, 22, 24, 26, 28-29. The command to delete these lines is sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D". After that command, I would need to rejoin all of the lines with sed ':a;N;$!ba;s/\n//g'. The final command that I inputted that returned just the mRNA strand was as follows:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | sed "5s/atg/ 
   <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon> /&\n/g" | sed "6s/.../& /g" | sed "6s/t[ag][ag]/ 
   <stop_codon>&<\/stop_codon> /1" | sed "6s/ //g" | sed "6s/<stop_codon>/ <stop_codon>/g" | 
   sed "6s/<\/stop_codon>/<\/stop_codon> /g" | sed "s/<\/stop_codon> /&\n/g" | sed "7s/aaaaggt/ <terminator>& \n/1" | 
   sed "8s/gcctttt..../&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | sed "s/[<>]/\n/g" | 
   sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" | sed ':a;N;$!ba;s/\n//g'
   ggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgc
   ctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgc
   atcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttt
   taccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttatt

What is the amino acid sequence that is translated from this mRNA?

In order to find the amino acid sequence, I can use the procedure I used before to delete everything before the start codon and everything after the stop codon. Then, I need to add spaces to separate the codons, replace all t's with u's, and use the genetic_code.sed file in order to convert the codons into their corresponding proteins. The following lines can now be deleted: 1-18, 20, 22-29 with sed "1,18D;20D;22,29D". Then, I will rejoin the lines with sed ':a;N;$!ba;s/\n//g'. At this point, I can use the same command that I used last week in order to transcribe the mRNA sequence into it's corresponding protein sequence: sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g".

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | sed "5s/atg/ 
   <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon> /&\n/g" | sed "6s/.../& /g" | sed "6s/t[ag][ag]/ 
   <stop_codon>&<\/stop_codon> /1" | sed "6s/ //g" | sed "6s/<stop_codon>/ <stop_codon>/g" | 
   sed "6s/<\/stop_codon>/<\/stop_codon> /g" | sed "s/<\/stop_codon> /&\n/g" | sed "7s/aaaaggt/ <terminator>& \n/1" | 
   sed "8s/gcctttt..../&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | sed "s/[<>]/\n/g" | sed "1,18D;20D;22,29D" | 
   sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR

Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15