Vpachec3 Week 4

Modify the Gene Sequence with Tags

-35 Box

-10 Box

My lab partner, Nicole, was a big help and helped me go through the Week 4 homework. Here is how far we got:

vpachec3@ab201:/nfs/home/dondi/xmlpipedb/data$ cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1"|sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"

This is what the command gave us:

 ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box>     cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Right before we were stopped to bring it back into a larger group discussion, Nicole taught me that \n would break it into two lines. We just didn't get to apply it to the command line just yet.

Transcription Start Site

Now trying this on my own. I used the \n to break the line to start to figure out how to add the transcription start site. I wanted to break the information into two line so i used this command:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <minus10box>/& \n/g"

However, I wanted to break the line after the minus 10 box so I modified the command:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"

This command gave me:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box>

cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Breaking it into two lines would be easier to insert the transcription start site because we were told:The transcription start site is located at the 12th nucleotide after the first nucleotide of the -10 box.

This means that I could count to insert the transcription start site and use commands that I have used before to insert the tag. I used :

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/cc/ <tts> /1"

However this was problematic because it replace the nucleotide rather than put it in front. Therefore, revision of the command was needed.

New command

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/c/ <tss>& <\/tss> /5"

This command did exactly what I needed:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box>

cttgccggttcaaatta <tss>c </tss>ggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Finally to tie it ll back together, I used this command at the end : sed ':a;N;$!ba;s/\n//g'

This leaves me with this sequence:

 ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgccggttcaaatta <tss>c </tss> ggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Ribosome Binding Site

I assume that adding the ribosome binding site would be similar to adding in the other tags. The tricky part should be where to put the tag on the sequence. This is the information given:"A consensus sequence for the ribosome binding site is gagg". So I'm going to break the sequence into two lines and look for the gagg pattern after the transcription start site.

So I broke the line into two:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/c/ <tss>& <\/tss> /5"|  sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/tss> /&\n/g"

Then I added the ribosome binding site tag into the second line:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/c/ <tss>& <\/tss> /5"|  sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/tss> /&\n/g"| sed "2s/gagg/ <rbs>& <\/rbs> /1"

Then put the two lines back together to get:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgccggttcaaatta <tss>c </tss> ggtagtgatacccca <rbs>gagg </rbs> attagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Start Codon

From biology, we know that the start codon is atg and it falls after the ribosome binding site. So I will divide the sequence into two lines once again and add the start codon tag. After this is done, I will put the two lines back together. Now thinking about it, I could have left the lines into 2 or even 3 until all the tags are added then combined the lines after. However, this is good for practice. Here is the final product:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgccggttcaaatta <tss>c </tss> ggtagtgatacccca <rbs>gagg </rbs> attag <start_codon>atg </start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Stop Codon

Unlike the start codon, the stop codon has more than one possible combination. So what I am thinking is to break the sequence into two lines and putting in the command to search for multiple combination: sed "s/.../& /g"

So I tried this command:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/c/ <tss>& <\/tss> /5"|  sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/tss> /&\n/g"| sed "2s/gagg/ <rbs>& <\/rbs> /1" |sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/rbs> /&\n/g"| sed "2s/atg/ <start_codon>& <\/start_codon> /1"| sed ':a;N;$!ba;s/\n//g' | sed "s/ <\/start_codon> /&\n/g"| sed "2s/tga|tag|taa/ <stop_codon>& <\/stop_codon> /g"

And it appeared not to work. Initially, I had no clue why. I tried multiple variations of the command and it still wasn't really working. I tried playing around with it. So my method, not particularly efficient, but still work was to search the 3 tags (tga, tag, taa) individually and see which one of the three came first.

But then I realized the mistake. They need to be read in 3's which I realized I didn't specify in the first command. So now I need to add in the sed "s/.../& /g" command to let the computer know to read it in threes starting the second line.

I added this command but it didn't insert the stop_codon in the appropriate place. It didn't even put it in the sequence.

sed  "2s/.../& /g"| sed "2s/tag|tga|taa/ <stop_codon>& <\/stop_codon> /1"

But alas, I finally figured it out! I used this command to add the tags:

sed  "2s/.../& /g"| sed -r "2s/tag |taa |tga /<stop_codon>& <\/stop_codon>/1"

And now I need to put the sequence back together

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgccggttcaaatta <tss>c </tss> ggtagtgatacccca <rbs>gagg </rbs> attag <start_codon>atg </start_codon> gcc aaa gaa gac aat att gaa atg caa ggt acc gtt ctt gaa acg ttg cct aat acc atg ttc cgc gta gag tta gaa aac ggt cac gtg gtt act gca cac atc tcc ggt aaa atg cgc aaa aac tac atc cgc atc ctg acg ggc gac aaa gtg act gtt gaa ctg acc ccg tac gac ctg agc aaa ggc cgc att gtc ttc cgt agt cgc <stop_codon>tga  </stop_codon>ttg ttt tac cgc ctg atg ggc gaa gag aaa gaa cga gta aaa ggt cgg ttt aac cgg cct ttt tat ttt at

The second line is still in triplets but I am not going to change it since I feel like it will come in handy later on.

Terminator

At this point I am struggling and I need up using grep to find the hairpin. We know the hair pin is aaaaggt so I used grep to search it in the file. After that I used:

sed "s/gcctttttatt/&<\/terminator>/g"

But this did not work because I actually need them to not be separated into triplets.

Well it looked like I got myself into a hole that I couldn't get out of. So I redid my command. The command line I had had the correct placement of where things should be so I used that information to edit the command line.

sed "s/cattat/<minus10box>&<\/minus10box>/g" infA-E.coli-K12.txt | sed "s/tttact/<minus35box>&<\/minus35box>/g" | sed -r "s/<\/minus10box>.{5}/&<\/tss>/g" | sed "s/<\/tss>/<tss>c&/g" | sed "s/gagg/<rbs>&<\/rbs>/g" | sed -r "s/<\/rbs>.{8}/&<\/start_codon>/g" | sed "s/<\/start_codon>/<startcodon>atg&/g" | sed "1s/tga/<stop_codon>&<\/stop_codon>/3" | sed "s/aaaaggt/<terminator>&/g" | sed "s/gcctttttatt/&<\/terminator>/g"

and I FINALLY got the sequence:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>tatttacagaacttcgg<minus10box>cattat</minus10box>cttgc<tss>c</tss>cggttcaaattacggtagtgatacccca<rbs>gagg</rbs>attagatg<startcodon>atg</start_codon>gccaaagaagacaatat<stop_codon>tga</stop_codon>aatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagt<terminator>aaaaggtcggtttaaccggcctttttatt</terminator>ttat

Q2

I used this command:

sed "y/atcg/uagc/"

This would cause this result:

aaaagugguguucuuacuuacaaaagccguguaaagaggggucucacaauauuaacgccagcgucucaaccaaugcgaguaauggggcgacggcuauuccuuaaaaagcgcaguccauugcggguagcaaauagaguggcgagggaauaugcaacgcgaaaaccacgccgaaucggcacacaaaagccucauuacacggcuuggacaaacaacgcuaaaucgcgcguuuag<minus35box>aaauga</minus35box>auaaaugucuugaagcc<minus10box>guaaua</minus10box>gaacg<ass>g</ass>gccaaguuuaaugccaucacuauggggu<rbs>cucc</rbs>uaaucuac<sauragodon>uac</saura_godon>cgguuucuucuguuaua<saop_godon>acu</saop_godon>uuacguuccauggcaagaacuuugcaacggauuaugguacaaggcgcaucucaaucuuuugccagugcaccaaugacguguguagaggccauuuuacgcguuuuugauguaggcguaggacugcccgcuguuucacugacaacuugacuggggcaugcuggacucguuuccggcguaacagaaggcaucagcgacuaacaaaauggcggacuacccgcuucucuuucuugcuca<aerminuaor>uuuuccagccaaauuggccggaaaaauaa</aerminuaor>aaua

And we know from biology that transcription starts from the transcription start site to the terminator. So I'm just going to cut that out using copy and paste to get:

aacg<ass>g</ass>gccaaguuuaaugccaucacuauggggu<rbs>cucc</rbs>uaaucuac<sauragodon>uac</saura_godon>cgguuucuucuguuaua<saop_godon>acu</saop_godon>uuacguuccauggcaagaacuuugcaacggauuaugguacaaggcgcaucucaaucuuuugccagugcaccaaugacguguguagaggccauuuuacgcguuuuugauguaggcguaggacugcccgcuguuucacugacaacuugacuggggcaugcuggacucguuuccggcguaacagaaggcaucagcgacuaacaaaauggcggacuacccgcuucucuuucuugcuca<aerminuaor>uuuuccagccaaauuggccggaaaaauaa</aerminuaor>

And because the replacement happened throughout the whole sequence, I am going to go back and fix the tags. I know that for much larger sequences it might be a hassle to do so but in this case it its manageable.

 aacg<tss>g</tss>gccaaguuuaaugccaucacuauggggu<rbs>cucc</rbs>uaaucuac<start_codon>uac</start_codon>cgguuucuucuguuaua<stop_codon>acu</stop_codon>uuacguuccauggcaagaacuuugcaacggauuaugguacaaggcgcaucucaaucuuuugccagugcaccaaugacguguguagaggccauuuuacgcguuuuugauguaggcguaggacugcccgcuguuucacugacaacuugacuggggcaugcuggacucguuuccggcguaacagaaggcaucagcgacuaacaaaauggcggacuacccgcuucucuuucuugcuca<terminator>uuuuccagccaaauuggccggaaaaauaa</terminator>

Q3

Now this command, "sed "s/.../ & /g" , I am familiar with because it messed me up when I first tried to do my super long compacted command line. I am going to use it in this case to begin to get the amino acid sequence.

It also applied the command to the tags which was a bit messy but it would be cleaned up soon enough by adding this command:

sed -f genetic-code.sed

Links

Vpachec3 User Page

Vpachec3 Week 4

Contents

Modify the Gene Sequence with Tags

-35 Box

-10 Box

Transcription Start Site

Ribosome Binding Site

Start Codon

Stop Codon

Terminator

Q2

Q3

Links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools