Troque Week 4

From LMU BioDB 2015
Jump to: navigation, search

User Page        Bio Databases Main Page       


Transcription and Translation “Taken to the Next Level”

First, login to the LMU CS server using ssh. Type in the following in a command prompt (Windows) or terminal (Mac) window:

ssh <username@my.cs.lmu.edu>

Enter your password. Note: You will not visibly see the cursor move when typing in your password so just keep typing. Then change directories to dondi's using the following commands to find the practice files and other miscellaneous files:

cd ~dondi/xmlpipedb/data

Here, you can use the command ls in order to see the list of files in the directory. Then we can start manipulating some files. Note: I collaborated with Lena Olufson when starting this assignment. We first decided to use grep in order to visually see where the pattern would be (it was actually my fault since we could've jumped to using sed right away, but I didn't read the assignment description thoroughly; I didn't notice that we were supposed to add the tags).

In this assignment, we will be manipulating the file infA-E.coli-K12.txt.

Adding tags to the strand

We start off by adding the -35 box of the promoter. The tag that we will add is

...<minus35box>...</minus35box>...

We do this by using the sed command in order to "replace" the empty string around the pattern we are looking for. For this part, we are looking for the pattern tt[gt]ac[at].

Type the following command to insert the tags around the pattern on each line of its occurrence:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/g" infA-E.coli-K12.txt

Since we want to keep the pattern in the same file, but we want to add the tags around it, we use the & symbol. We are also adding a new line after we add the tags using \n. This will be especially useful later on when we are looking for the -10 box. Note: we cannot simply type a forward slash (/) into the code; a regular forward slash is treated differently by the command line so we have to "escape" it using the escape character backslash (\). After running the command above, the command will output something like this:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattacccc
gctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttag
ccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>
ta<minus35box>tttaca</minus35box>
gaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaa
ggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaa
atgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtc 
ttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

(Note: This is actually just 2 lines of bases: the second line starts after the tag </minus35box>. For visual purposes, I decided to break up the strand into lines for this wiki.)

Notice that sed found 2 matches for the pattern for the -35 box. We'll have to decide which is the real one by adjusting our command for sed; more specifically, we need to change the last argument such that we are inspecting one of the two matches. For looking at just the first match, we use the command:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" infA-E.coli-K12.txt

For the second match only:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/2" infA-E.coli-K12.txt

We have to decide which one to use; in this case, we'll just choose the first one and hope we get lucky! So then we will have the following:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattacccc
gctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttag
ccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>
tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg
aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatct
ccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggcc
gcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggccttttt
attttat

Next, we can add the -10 box tag. We will use a similar command that we use in adding the -35 box:

sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/g"

When we pipeline the commands for -35 and -10,

cat infA-E.coli-K12.txt |  sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" |  sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/g"

the tags around the patterns found will be added so we have the following strand:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>
tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgc
tcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaa
atc<minus35box>tttact</minus35box> tatttacagaacttcgg<minus10box>cattat</minus10box>
cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaa
acgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactac
atccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctga
ttgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Notice that this, too, matched with 2 patterns. We can determine which is the correct one by remembering that the -35 box always comes before the -10 box and that the -35 box and -10 box are generally 17 bases apart. This means that, from the end of the -35 box to the start of the -10 box, there are 17 bases. We use the sed again for the purpose of finding the 17 bases; for this reason, we turn to the information provided here for matching a certain number of characters without typing 17 dots and for selecting which match to use; in this case, it would be the second -10 box match. Since we do not care which bases they are, we use the "." as placeholder:

cat infA-E.coli-K12.txt |
sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" |
sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" |
sed -r "2s/^(.){17}/&\n/g"

So then we get:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattacccc
gctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttag
ccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>
tatttacagaacttcgg      
<minus10box>cattat</minus10box>
cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaa
acgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactac
atccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctga
ttgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Next, we find the transcription start site. We know that it is located 12 characters after the start of the -10 box. Since the -10 box already has 6 characters, we should actually be looking for the 6 bases after the -10 box. We use sed again to find the 12th base after the -10 box; We look for 5 bases after the -10 box and the 6th one is the transcription start site to attach the <tss></tss> tags:

cat infA-E.coli-K12.txt | 
sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | 
sed -r "2s/^(.){17}/&\n/g" |
sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" |
sed -r "5s/^(.){5}/&\n/g" |
sed "6s/^./<tss>&<\/tss>\n/g"

Then the result will be the following:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>
tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccg
ctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgc
aaatc<minus35box>tttact</minus35box>
tatttacagaacttcgg
<minus10box>cattat</minus10box>
cttgc
<tss>c</tss>
ggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgtt
gcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatcc
gcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattg
ttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Next is the ribosome binding site. We know that the sequence will be gagg. We also know that this pattern should come after the transcription start site and so we start our search from the end of the <tss> tag, i.e. we start on the 7th line.

cat infA-E.coli-K12.txt | 
sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | 
sed -r "2s/^(.){17}/&\n/g" |
sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" |
sed -r "5s/^(.){5}/&\n/g" |
sed "6s/^./<tss>&<\/tss>\n/g" |
sed "7s/gagg/\n<rbs>&<\/rbs>\n/"

And we get the following:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>
tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccg
ctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgc
aaatc<minus35box>tttact</minus35box>
tatttacagaacttcgg
<minus10box>cattat</minus10box>
cttgc
<tss>c</tss>
ggttcaaattacggtagtgatacccca<rbs>gagg</rbs>
attagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaa
acggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaac
tgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgag
taaaaggtcggtttaaccggcctttttattttat

(Note: the rbs tag is on the 8th line on a command window even though it's shown to be the 9th here.)

Next is the start_codon. Since the start codon in an mRNA sequence is aug, we should look for an atg in our strand.

cat infA-E.coli-K12.txt | 
sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | 
sed -r "2s/^(.){17}/&\n/g" |
sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" |
sed -r "5s/^(.){5}/&\n/g" |
sed "6s/^./<tss>&<\/tss>\n/g" |
sed "7s/gagg/<rbs>&<\/rbs>\n/1" |
sed "8s/atg/<start_codon>&<\/start_codon>\n/1"

And we get the following:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>
tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccg
ctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgc
aaatc<minus35box>tttact</minus35box>
tatttacagaacttcgg
<minus10box>cattat</minus10box>
cttgc
<tss>c</tss>
ggttcaaattacggtagtgatacccca<rbs>gagg</rbs>
attag<start_codon>atg</start_codon>
gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca
cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccc
cgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaa
ggtcggtttaaccggcctttttattttat

For the stop codon, we have to separate the bases into groups of 3 after the start codon since we don't want to match just any three bases that follow the pattern for the stop codon. We add the following commands at the end of our pipeline from above to look for the first instance of our stop codon; the groups of three's that we should be looking for are tag, tga, and, taa since these match up with the mRNA stop codon:

cat infA-E.coli-K12.txt | 
sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | 
sed -r "2s/^(.){17}/&\n/g" |
sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" |
sed -r "5s/^(.){5}/&\n/g" |
sed "6s/^./<tss>&<\/tss>\n/g" |
sed "7s/gagg/<rbs>&<\/rbs>\n/1" |
sed "8s/atg/<start_codon>&<\/start_codon>\n/1" |
sed "9s/.../& /" |
sed -r "9s/tag|tga|taa/<stop_codon>&<\/stop_codon>/" |
sed "9s/ //" |
sed "9s/<stop_codon>/ &/" |
sed "9s/<\/stop_codon>/& /" 

(Note: we have to turn the strand back from having spaces to having no spaces in between for the next set of commands).

Lastly, for the terminator, we know that the first part is the hairpin which binds to itself. Therefore, since the hairpin is known to be aaaaggt we need to match its reverse. The only difference for the terminator is that the g will not match with an a, but with a t so the following commands are used:

cat infA-E.coli-K12.txt | 
sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | 
sed -r "2s/^(.){17}/&\n/g" |
sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" |
sed -r "5s/^(.){5}/&\n/g" |
sed "6s/^./<tss>&<\/tss>\n/g" |
sed "7s/gagg/<rbs>&<\/rbs>\n/1" |
sed "8s/atg/<start_codon>&<\/start_codon>\n/1" |
sed "9s/.../& /g" |
sed -r "9s/tag|tga|taa/<stop_codon>&<\/stop_codon>/g" |
sed "9s/ //g" |
sed "9s/<stop_codon>/ &/g" |
sed "9s/<\/stop_codon>/& /g" | 
sed "9s/aaaaggt/\n<terminator>&/g" |
sed "10s/gcctttt..../&<\/terminator>/g" |
sed ':a;N;$!ba'|
sed 's/\n//g'

This list of commands should output the same strand but with all the necessary tags.

mRNA transcribed from the gene

In order to get the mRNA strand from the DNA strand, we first have to extract the section of the DNA strand that is actually read from the rest of it. I was thinking of using some regex patterns for extracting the necessary strand, but couldn't think of a way to isolate the tags on the inside and still keep the strands I wanted to transcribe. I ended up looking through another students' methods and found that Nicole Anguiano's method is a lot more efficient than anything I thought of so borrowed her process. However, I did notice that here method only switched the t's with the u's. The correct way of translating DNA into mRNA would be to switch ALL bases with their corresponding mRNA base:

sed "s/ //g" |  sed -r "s/<|>/\n/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" | sed ':a;N;$!ba;s/\n//g' | sed "y/acgt/ugca/"

When the above commands are added to the end of the ones for Part1, the result is as follows:

ccaaguuuaaugccaucacuauggggucuccuaaucuaccgguuucuucuguuauaacuuuacguuccauggcaagaacuuugcaacggauuaugguacaaggcgcaucuc
aaucuuuugccagugcaccaaugacguguguagaggccauuuuacgcguuuuugauguaggcguaggacugcccgcuguuucacugacaacuugacuggggcaugcuggac
ucguuuccggcguaacagaaggcaucagcgacuaacaaaauggcggacuacccgcuucucuuucuugcucauuuuccagccaaauuggccggaaaaauaa

Amino acid sequence

Just like with Week 3's assignment, we can use the same set of commands to transcribe the mRNA sequence so that we get their corresponding amino acids. We use the file genetic-code.sed to make our translation easier since this file already has the amino acids with their respective codons. Then we can add the commands below to the ones we did for Part1 of this assignment:

cat infA-E.coli-K12.txt | 
sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | 
sed -r "2s/^(.){17}/&\n/g" |
sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" |
sed -r "5s/^(.){5}/&\n/g" |
sed "6s/^./<tss>&<\/tss>\n/g" |
sed "7s/gagg/<rbs>&<\/rbs>\n/1" |
sed "8s/atg/<start_codon>&<\/start_codon>\n/1" |
sed "9s/.../& /g" |
sed -r "9s/tag|tga|taa/<stop_codon>&<\/stop_codon>/g" |
sed "9s/ //g" |
sed "9s/<stop_codon>/ &/g" |
sed "9s/<\/stop_codon>/& /g" | 
sed "9s/aaaaggt/\n<terminator>&/g" |
sed "10s/gcctttt..../&<\/terminator>/g" |
sed ':a;N;$!ba'|
sed 's/\n//g'|
sed "1,18D;20D;22,29D" |
sed ':a;N;$!ba;s/\n//g'|
sed "s/.../& /g" |
sed "y/acgt/ugca/" |
sed -f genetic-code.sed

Then the result will be:

YRFLLL-LYVPWQELCNGLWYKAHLNLLPVHQ-RV-RPFYAFLM-A-DCPLFH-QLDWGMLDSFPA-QKASA

Assignment Links

Weekly Assignments

Individual Journal Entries

Shared Journal Entries