Lenaolufson Week 4

From LMU BioDB 2015
Jump to: navigation, search

Transcription and Translation "Taken to the Next Level"

  • The first step was logging in to the Terminal app to access the files for this assignment.
ssh eolufson@my.cs.lmu.edu
  • Then I accessed the folder for the class and this assignment.
cd biodb
mkdir week4
  • Then I went into Dondi's files to get the assigned file for the assignment.
cd ~dondi/xmlpipedb/data
cp infA-E.coli-K12.txt ~eolufson/biodb/week4
  • Next I went into my directory to do assignment.
cd ~eolufson/biodb/week4

This computer exercise examines gene expression at a much more detailed level than before, requiring knowledge in both the biological aspects of the process and the translation of these steps into computer text-processing equivalents.

The following sequence represents a real gene, called infA and found in E. coli K12. As you might have guessed, it’s stored as infA-E.coli-K12.txt in ~dondi/xmlpipedb/data.

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgc
tcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgtt
gcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
tttacttatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggcc
aaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaa
cggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtga
ctgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatg
ggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

For each of the following questions pertaining to this gene, provide (a) the actual answer, and (b) the sequence of text-processing commands that calculates this answer. Specific information about how these sequences can be identified is included after the list of questions.

Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):

  • -35 box of the promoter
... <minus35box>...</minus35box> ...
  • By using the info that the consensus sequence for the -35 site is tt[gt]ac[at] as well as the hints and help from class, I was able to determine that in order to add a tag for the -35 box, the command is:
cat inca-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>/1"
  • The 2 is used instead of "g" at the end in order to change the global into the number 1 to find the specific match. This is the output:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgc
gtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttag
cgcgcaaatc<minus35box>tttact</minus35box>tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggatt
agatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacac
atctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgt
agtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
  • -10 box of the promoter
... <minus10box>...</minus10box> ...
  • By using the info that the consensus sequence for the -10 site is [ct]at[at]at, that there are 17 nucleotides between the -35 and the -10 box sites, and the instructions given in class, I was able to figure out that the command for the -10 box tag is:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed ':a;N;$!ba;s/\n//g'
  • The output of this command is:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcg
tttatctcaccgctcccttatacgttgc  gcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>    
tatttacagaacttcgg  <minus10box>cattat</minus10box>  cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatg
caaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacggg
cgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggt
ttaaccggcctttttattttat
  • transcription start site
... <tss>...</tss> ...
  • By using the info that the transcription start site is located at the 12th nucleotide after the first nucleotide of the -10 box, in addition to the help provided in class as well as from my homework partner, it was revealed to me that since the newline created a after the -35 box was still there, the second line could be searched for with ">". The character after the end of the of the tag by 6 nucleotides is the tss. In order to make it easier on myself, -r was used with sed to allow me to create a repetitive pattern. The command inputted is:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g" | 
sed ':a;N;$!ba;s/\n//g'
  • The output performed by this command is:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcg
tttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
tatttacagaacttcgg <minus10box>cattat</minus10box> cttgc <tss>c</tss>    
ggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgc
ctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctg
acgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgg
gcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
  • Ribosome binding site
... <rbs>...</rbs> ...
  • I used the info that consensus sequence for the ribosome binding site is gagg, as well as help form my homework partner to figure out the correct command. The transcription start site is on the third line, which I used to help save me time when typing in the command:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1" | sed ':a;N;$!ba;s/\n//g'
  • The output was as follows:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgt
caggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcg
caaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus10box> cttgc <tss>c</tss> 
ggttcaaattacggtagtgatacccca <rbs>gagg</rbs>        attagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaatac
catgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaact
gaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
  • start codon
... <start_codon>...</start_codon> ...
  • This input was very similar to the rbs, as I created a newline after the rbs and searched for the start codon on the 4th line, the command is:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed ':a;N;$!ba;s/\n//g'
  • The output was:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaa
cgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
 <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus10box> cttgc <tss>c</tss> ggttcaa
attacggtagtgatacccca <rbs>gagg</rbs>  attag<start_codon>atg</start_codon>gccaaagaagacaatattgaaatgcaaggtaccgttcttga
aacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatg
cgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgt
cttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
  • Stop codon
... <stop_codon>...</stop_codon> ...
  • This stop codon was challenging for me to figure out as it was pretty advanced and required some knowledge of the command line. I honestly had to look at the work of my peers in order to help me figure out what the correct command was, but after searching and asking my homework partner some questions, I was able to come up with the command line as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | 
sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g" | 
sed ':a;N;$!ba;s/\n//g'
  • The output of this command was:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttc
gcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
<minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus10box> cttgc <tss>c</tss> ggttcaaattac
ggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg</start_codon>
gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtg
gttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgta
gtcgc   <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
  • terminator
... <terminator>...</terminator> ...
  • From my knowledge and Dondi's demonstration in class, I know that a hairpin loops around itself and thus binds to itself, aaaaggt is the sequence where the t binds with a g. Gcctttt will also exist in the terminator and this makes it simp enough to construct a command to tag the terminator:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1" | sed "3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | 
sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g;
5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | 
sed ':a;N;$!ba;s/\n//g'
  • The output of the command is:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggta
acgcccatcgtttatctcaccgctcccttatacgttgc  gcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
<minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus10box> cttgc <tss>c</tss> ggttca
aattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg</start_codon>g
ccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatg
cgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> 
ttgttttaccgcctgatgggcgaagagaaagaacgagt <terminator>aaaaggtcggtttaaccggcctttttatt</terminator> ttat

What is the exact mRNA sequence that is transcribed from this gene?

  • I used sed many times while creating the command to solve this question because it allows me to delete lines so that I can manipulate the data into the form I want. I put each tag on its own line, which I followed by deleting the tags and other useless information to transcribe. The command is:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | 
sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;
5s/<\/stop_codon>/& /g;5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | 
sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | 
sed -r "s/<|>/\n/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g"
  • The sequence is:
cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuug
aaacguugccuaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgca
aaaacuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugu
cuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu

What is the amino acid sequence that is translated from this mRNA?

  • Using the same technique as before, I figured out I needed to separate the lines into codons, similar to the week 3 assignment. The command is:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | 
sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;
5s/<\/stop_codon>/& /g;5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | 
sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' | sed -r "s/ //g;s/<|>/\n/g" | 
sed "1,18D;20D;22,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/.../& /g;s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g"
  • The amino acid sequence is:
MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR

Loyola Marymount University: website


Weekly Assignments Individual Journal Pages Shared Journal Pages
Lenaolufson (talk) 22:33, 28 September 2015 (PDT)