Msaeedi23 Week 4

From LMU BioDB 2015
Jump to: navigation, search

Transcription and Translation “Taken to the Next Level”

This computer exercise examines gene expression at a much more detailed level than before, requiring knowledge in both the biological aspects of the process and the translation of these steps into computer text-processing equivalents.

The following sequence represents a real gene, called infA and found in E. coli K12. As you might have guessed, it’s stored as infA-E.coli-K12.txt in ~dondi/xmlpipedb/data.

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgc
tcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgtt
gcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
tttacttatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggcc
aaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaa
cggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtga
ctgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatg
ggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

For each of the following questions pertaining to this gene, provide (a) the actual answer, and (b) the sequence of text-processing commands that calculates this answer. Specific information about how these sequences can be identified is included after the list of questions.

  1. Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
    • -35 box of the promoter
      ... <minus35box>...</minus35box> ...
    • -10 box of the promoter
      ... <minus10box>...</minus10box> ...
    • transcription start site
      ... <tss>...</tss> ...
    • ribosome binding site
      ... <rbs>...</rbs> ...
    • start codon
      ... <start_codon>...</start_codon> ...
    • stop codon
      ... <stop_codon>...</stop_codon> ...
    • terminator
      ... <terminator>...</terminator> ...
  2. What is the exact mRNA sequence that is transcribed from this gene?
  3. What is the amino acid sequence that is translated from this mRNA?

-35 Box

  • using the sed command and the given information on the designated sequence I was able to target the -35 box. Inputting the tt[gt]ac[at] into sed and substituting a 1 for g to search for the first occurence of the desired sequence. I ran the code cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ & /1" resulted in the desired output, however I needed to invoke the starting and ending tags. As we discussed in class, preceding a "/" with a "\" will produce keep the "/" when using sed. Using this technique, I came up with the code: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" and it produced my desired output. The sequence came out as:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg 
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatcttgccggttcaa
attacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgt
agagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactg
accccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggttta
accggcctttttattttat

-10 box

  • Using a similar sequence of commands as the -35 box and the given target sequence, I was able to mark the -10 box. First I used the command cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/ & /g" to locate the desired sequence. It came up that there were multiple occurences of this sequence. I needed to target the -10 box sequence as it appeared after the -35 box sequence. To do this we needed to do a similar newline technique after the target sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1". Once this was done, the newline needed to be removed using the sed ':a;N;$!ba;s/\n//g' command. The resulting sequence came out to be:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttga
aacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctg
acgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaag
agaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

transcription start site

  • The code for the -10 box is six nucleotides, so six more directly after this will mark the transcription start site. I can go about this by isolating the sequence with a newline, and then inserting the desired tag. Once this is completed I can remove the newlines. The added command I used for the transcription start site came out to: 2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g" | sed ':a;N;$!ba;s/\n//g'. The comprehensive command put together was
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g" | sed ':a;N;$!ba;s/\n//g'</code> to produce:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaag
gtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaacta
catccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgc
ctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

ribosome binding site

  • Using the given, that the rbs is indicated by the consensus sequence gagg and that it comes after the tss it was not too difficult to locate and tag. The tss begins on the third line so I searched this line for our target sequence using this code: 3s/gagg/ <rbs>&<\/rbs> /1" | sed ':a;N;$!ba;s/\n//g' directly after tagging the tss. The complete command came out to be: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1" | sed ':a;N;$!ba;s/\n//g' and this produced:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attagatggccaaagaagacaat
attgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaa
tgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctg
attgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

start codon

  • The start codon, atg, will appear after the rbs. To target this sequence we must do a newline command after the rbs sequence and search the fourth line to target the start codon. The command comes out to:

<code>sed "4s/atg/ <start_codon>&<\/start_codon> /1". The entire command output came out to: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed ':a;N;$!ba;s/\n//g' and produced:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg<
/start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca
cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc
aaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttatttta
t

stop codon

  • The stop codon is different than targeting the start codon, because the codons must now be read in groups of three and be in the same reading frame as the start codon. A newline command was again used, along with a 5 before the "s/" in sed command to target the fifth line. To account for any of the three stop codons that are possible, tga, tag, or taa it was necessary to simultaneously do a search for all three. This was done using sed "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon>. After the search I needed to eliminate the spaces between the codons while inputting spaces between the specific tags: 1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g" | sed ':a;N;$!ba;s/\n//g'. Combining the entire command comes out to: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g" | sed ':a;N;$!ba;s/\n//g' and this produced:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg<
/start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca
cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc
aaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagtaaaag
gtcggtttaaccggcctttttattttat

Terminator

  • The hairpin proved to be a challenge, but with the hint that the terminator includes four nucleotides following the hairpin. The hairpin essentially bends around and connects to itself, meaning that the complement will exist in reverse. The resulting sequence will begin with a "g" instead of an "a." Furthermore, this means that "gcctttt" will be included in the terminator. To account for both ends of the loop and newlines I tried the command:

sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g'. The completed entire command came out to: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g" | sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' to produce:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg<
/start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca
cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc
aaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagt <ter
minator>aaaaggtcggtttaaccggcctttttatt</terminator> ttat


Base your commands on the following hints/guidelines about the gene, plus your own knowledge learned from the past few weeks:

  • The consensus sequence for the -10 site is [ct]at[at]at.
  • The consensus sequence for the -35 site is tt[gt]ac[at].
  • The ideal number of base pairs between the -35 and -10 box is 17, counting from the first nucleotide after the end of the -35 sequence up to the last nucleotide before the -10 sequence.
  • The transcription start site is located at the 12th nucleotide after the first nucleotide of the -10 box.
  • A consensus sequence for the ribosome binding site is gagg.
  • The first half of the terminator “hairpin” is aaaaggt, where the u in the mRNA binds with a g instead of the usual a.
  • The terminator includes 4 more nucleotides after the hairpin completes.

What is the exact mRNA sequence that is transcribed from this gene?

  • This is where I got completely lost. I know the mRNA sequence will be transcribed from the transcription start site up to the terminator. It became too confusing to work backwards. The only thing that makes sense is to newlines before and after every tag to make it easily removable, but this seems like an extended amount of work. Due to the fact that I was completely stumped I glanced at another student's work [blitvak] to help me get going. I realized that I could use sed numerous times in order to target and delete specific sequences. Going off of previous commands used earlier in the assignment, and combining these with certain deletion commands, the command I first wrote down and paper then inputted came out to:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g; 5s/<\/stop_codon>/& /g;5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | sed -r "s/<|>/\n/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g"

  • once each line and tag had been separated I went through and deleted all of the unnecessary lines and removed spaces in the sequence. Finally, I transcribed all of the t's to u's. The mRNA strand came out to be:

cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuug

aaacguugccuaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgca
aaaacuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugu
cuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu

What is the amino acid sequence that is translated from this mRNA?

In order to invoke the amino acid sequence we must use the genetic-code.sed folder and make sure to account for the three codon separation for proper translation. The command to ensure the correct amino acid sequence came out to: sed -f genetic-code.sed | sed "s/ //g". The entire command was as follows: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g; 5s/<\/stop_codon>/& /g;5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | sed -r "s/<|>/\n/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g" The sequence that was produced came out to be:

MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR


Computer Tips

  • Remember that sed is line-based, and that you can add and count lines to get certain things done, say strictly before or after a certain point.
  • Don’t forget how you enforced reading frames in Week 3.
  • If you do add lines or spaces to get the job done, make sure to clean up after yourself by removing them from the final answer.
  • This exercise is difficult enough that you might be thinking to yourself, “I’d rather do this by hand!” This sentiment is understandable, but when you find yourself feeling this way, consider the following:
    • Part of the difficulty is learning these things for the first time. Once you’ve gotten the hang of it, there’s no way that doing things by hand will be faster.
    • Consider trying to do this over and over, for multiple genes, with lots of potential variations. Doing this by hand not only takes longer at this point, but risks errors that a computer won’t make (once the correct commands have been determined).
  • Form your commands so that they can be strung together into a single pipeline of processing directives in the end. In other words, once you’ve figured out how to do each step, no human intervention should be needed to perform everything from beginning to end.
  • You will need the More Text Processing Features wiki page to complete this assignment. The How to Read XML Files wiki page gives you an idea for why the requested output was formatted the way it was.

Mahrad Saeedi

Class Whoopers Team Page
Assignment Links
Individual Journals
Shared Journals