Nanguiano Week 4

From LMU BioDB 2015
Jump to: navigation, search

Transcription and Translation “Taken to the Next Level”

  • First, I needed to log in to my LMU CS account to access the data used in this weeks assignment.
ssh nanguia1@my.cs.lmu.edu
  • Next, I needed to enter the folder that I'd created for the class, and create a new folder for this week's assignment.
cd biodb
mkdir week4
  • Next, I moved into Dondi's directory so I could obtain the file required for the assignment - infA-E.coli-K12.txt.
cd ~dondi/xmlpipedb/data
cp infA-E.coli-K12.txt ~nanguia1/biodb/week4
  • Then, I moved into my directory to prepare to do the assignment.
cd ~nanguia1/biodb/week4

For each of the following questions pertaining to this gene, provide (a) the actual answer, and (b) the sequence of text-processing commands that calculates this answer. Specific information about how these sequences can be identified is included after the list of questions.

Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene

-35 box of the promoter

... <minus35box>...</minus35box> ...
  • First, I knew I needed to identify the sequence that I'd be looking for within the file. The week 4 assignment indicated that the consensus sequence for the -35 promoter sequence is tt[gt]ac[at]. In thus, I knew I needed to plug this sequence into sed in order to filter for this sequence. Because I wanted a single replacement of one sequence, I knew that sed s//g would be the best option. My first theory was to try for sed s/tt[gt]ac[at]/ & /g, to put a space on either side of the sequence. This would test whether or not it was finding the sequence correctly, before I put in the tag.
  • I tested using the command cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ & /g". However, this command did not work, since it changed every single one that appeared, not just the first! Since I only wanted the first one to be changed, I did some research to find out how to change the first iteration using sed. Using this link from Stack Overflow, I learned that the /g in the command was indicating to change every single iteration. Changing it to /1 would cause it to change only the first iteration! Running cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ & /1" resulted in the output I expected. As a result, all that was left was to find the first and last space and replace then with the starting and ending tags.
  • However, this ended up being harder than expected. Because </minus35box> had a / key, sed interpreted that as the end of the input. The forward slash would need to be escaped in order for sed to treat it not as a part of the command, but rather as a string. I knew that in other command line arguments, a backslash placed before the offending character would escape the character, allowing it to be read as a character. This held true for the sed command as well. The final command and output was as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg 
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatcttgccggttcaa
attacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgt
agagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactg
accccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggttta
accggcctttttattttat

-10 box of the promoter

... <minus10box>...</minus10box> ...
  • Using what I had learned from the previous problem, as well as the hint from the week 4 assignment that indicated that the -10 box was located at [ct]at[at]at), I began to formulate the command. Upon running the command to test to make sure that the sequence was being found correctly (cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/ & /g", I was surprised to find that there was a match both before and after the location that had been found for the minus 35 box. Knowing that that -10 box comes after the -35 box, and there should be around 17 nucleotides between them, I knew that this time I could not simply change the first match, since the first match would not be correct. It would be the second match that would be correct. However, it is possible that in a string of text, there could be many more than simply 1 incidence of the -10 box sequence before the -35 box. As a result, I wanted to restrict my search to only appear after the -35 box. The way to do this, as stated by the text processing article, was to insert a newline after the target, then search the second line for the text, with the command sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1". The newline could then be removed with the command sed ':a;N;$!ba;s/\n//g'. The final command and output to display the -10 box alongside the -35 box was as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed ':a;N;$!ba;s/\n//g'

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttga
aacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctg
acgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaag
agaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

transcription start site

... <tss>...</tss> ...
  • The transcription start site, according to the week 4 assignment, was the 12th letter after the minus 10 box. As a result, I knew that I would have to start my search at the -10 box tag. If the newline that created a line after the -35 box was still there, the second line could be searched for the instance of the characters "> ". This would indicate the last character of the -10 box tag. Since the code of the -10 box is 6 nucleotides, I know that the character 6 nucleotides after the end of the tag is the transcription start site. I could separate out this site with a newline, then surround the first character of this new line with the tag. At the end, both newlines could be removed using the same command as before. To make searching the 5 nucleotides after easier, I used the parameter -r into sed to allow me to input a repetitive pattern. The command is as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g" | 
sed ':a;N;$!ba;s/\n//g'

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaag
gtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaacta
catccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgc
ctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

ribosome binding site

... <rbs>...</rbs> ...
  • The ribosome binding site is indicated by the consensus sequence gagg, as indicated on the week 4 assignment. Additionally, it has to be after the transcription start site. Therefore, I knew I needed to search after the transcription start site for the ribosome binding site. Since the transcription start type is on the third line, we can simply search that line for the sequence. This is done the same way as the search for the -10 box. The command and output was as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1" | sed ':a;N;$!ba;s/\n//g'

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attagatggccaaagaagacaat
attgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaa
tgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctg
attgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

start codon

... <start_codon>...</start_codon> ...
  • Similarly to the rbs, the start codon, atg, will appear after the rbs. I can place a newline after the rbs and search the new, fourth line for the start codon. The command is as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed ':a;N;$!ba;s/\n//g'

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg<
/start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca
cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc
aaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttatttta
t

stop codon

... <stop_codon>...</stop_codon> ...
  • The stop codon indicates a new challenge. Following the start codon, the sequence must be read in groups of three. As a result, I must space out each of the codons before making the search to ensure that the stop codon is in the same reading frame as the start codon. To do this, I began with putting a newline after the end tag of the start codon in the way to how I added a newline after the rbs. Then, I knew I needed to space each codon. Having used this command in the week 3 assignment, I knew exactly what command to use: sed "s/.../& /g", only this time I would include a 5 before the s to indicate that it needs to search the fifth line. There are three possibilities as to what the stop codon can be: tga, tag, or taa. This requires me to use an or command that can search for not just one possibility, but for many. The way this is done is by putting a "|" character in between possibilities. So I could search the fifth line for tga|tag|taa in sed after passing in the parameter -r. Then, I would need to remove the spaces between the codons, and put spaces between the tags. The command that I created was as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | 
sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g" | 
sed ':a;N;$!ba;s/\n//g'

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg<
/start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca
cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc
aaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagtaaaag
gtcggtttaaccggcctttttattttat

terminator

... <terminator>...</terminator> ...
  • According to the week 4 assignment, the terminator hairpin loop starts with aaaaggt, where the t will end up binding with a g. This hairpin loop bends around and connects to itself, and an additional 4 nucleotides exist after the hairpin loop. Since I know the hairpin binds with itself, I know that the complement of this sequence will exist in reverse, only it will begin with a g instead of an a, as the t connects to a g. So I know that gcctttt will also exist in the terminator, and the 4 nucleotides following that sequence will also be included. Using everything I'd learned up to this point, I was able to easily construct a command that would allow me to easily create the tag for the terminator despite the added complexity. The command and final output is as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1" | sed "3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | 
sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g;
5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | 
sed ':a;N;$!ba;s/\n//g'

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg<
/start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca
cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc
aaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagt <ter
minator>aaaaggtcggtttaaccggcctttttatt</terminator> ttat

What is the exact mRNA sequence that is transcribed from this gene?

  • The mRNA sequence will be transcribed from the transcription start site all the way to the end of the terminator. To simplify this solution, I created a command to perform the computation for me. In order to do this, I used the original command that I had used above. I knew I could delete everything before the transcription start site. If I placed a newline before the tss, I could remove the first line using the command "1D". However, this command could be further simplified. If I placed a newline before and after every tag, I could isolate the lines that contain nothing but tags, and remove them. Running the commands sed "s/ //g" | sed -r "s/<|>/\n/g" after the commands listed in the "terminator" part above, I was able to break the block of text into 29 lines, with the tag data to be removed on every other line starting at line 2, as shown below.
1:  ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccga
    taaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggag
    taatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
2:  minus35box
3:  tttact
4:  /minus35box
5:  tatttacagaacttcgg
6:  minus10box
7:  cattat
8:  /minus10box
9:  cttgc
10: tss
11: c
12: /tss
13: ggttcaaattacggtagtgatacccca
14: rbs
15: gagg
16: /rbs
17: attag
18: start_codon
19: atg
20: /start_codon
21: gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggtt
    actgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc
    aaaggccgcattgtcttccgtagtcgc
22: stop_codon
23: tga
24: /stop_codon
25: ttgttttaccgcctgatgggcgaagagaaagaacgagt
26: terminator
27: aaaaggtcggtttaaccggcctttttatt
28: /terminator
29: ttat
  • The data before the transcription start site was located on lines 1-10, and the data after the terminator was on line 29. Therefore, I could run the command sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" and delete every one of the unnecessary lines. Removing the newlines left me with a string of DNA that needed only the t's converted to u's. The final command and output is as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | 
sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;
5s/<\/stop_codon>/& /g;5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | 
sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | 
sed -r "s/<|>/\n/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g"

cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuugaaacguugccuaauaccauguu
ccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgcaaaaacuacauccgcauccugacgggcgacaaagugacuguuga
acugaccccguacgaccugagcaaaggccgcauugucuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguu
uaaccggccuuuuuauu

What is the amino acid sequence that is translated from this mRNA?

  • To get the amino acid sequence, I went back to the original command that I'd written to separate out the tags by line. This time, everything before the start codon and after the stop codon did not need to be kept. This meant that lines 1-18 as well as 22-29 could be removed. Line 20 will also need to be removed. Therefore, the command could be adjusted to sed "1,18D;20D;22,29D".
  • To perform the conversion to the amino acid sequence, I performed the following series of commands to get the genetic-code.sed from the week3 assignment into the current directory:
  • First, I moved to the week three directory
cd ../week3
  • Next, I copied the file to my directory.
cp genetic-code.sed ../week4
  • Lastly, I returned to the week 4 directory to continue the assignment
cd ../week4
  • With the file in the directory, I was able to now start performing the conversion. As before, I needed to break the line up into codons so that they could be fed into genetic-code.sed. The procedure for doing this was the same as the week 3 assignment. The full command and output was as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | 
sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | 
sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | 
sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | 
sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;
5s/<\/stop_codon>/& /g;5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | 
sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' | sed -r "s/ //g;s/<|>/\n/g" | 
sed "1,18D;20D;22,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/.../& /g;s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g"

MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR

Links

Nicole Anguiano
BIOL 367, Fall 2015

Assignment Links
Individual Journals
Shared Journals