Blitvak Week 4

Individual Journal Assignment Week 4

Finding and tagging the minus35box and the minus10box of the promoter

I first found infA-E.coli-K12.txt by entering the correct directory, cd ~dondi/xmlpipedb/data.
I copied the sequence kept in that file on an external space for future reference/checking.
I assumed that the sequence is the mRNA-like strand and that it runs from 5'- 3'.
By reading the Week 4 Assignment Page, I found that the -10 box is generally [ct]at[at]at, and that the -35 box is generally tt[gt]ac[at]
I skimmed over the More Text Processing Features page, and I found that sed "s/Title/<h1>&<\/h1>/g" results in an output of <h1>Title</h1>; this command would be useful in tagging the sequence with its various parts.
I then tried cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>/g"| sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>/g".
- Gave me the output:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggag
taatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>ta<minus35box>tttaca</minus35box>gaacttcgg<minus10box>cattat</minus10box>cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc
aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaa
gagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

I realized that there are two possibilities for the minus35box and one for the minus10box (since the minus10box must come after the minus35box, the first instance of a "minus10box" is to be ignored).
I looked at the Week 4 Assignment Page, and I found that there is an ideal number of 17 base pairs between the -35 and -10 box. Only <minus35box>tttact</minus35box> fits this criteria (is 17 bp away from <minus10box>cattat</minus10box>).
In the More Text Processing Features page, I found that sed "s/paragraph/&\n/g" results in a line-break right after the pattern gets matched/replaced (this would be useful in making the text more manageable for sed, as it is line centric). From referencing that same page, I found that replacing the g in the s///g format with a number, n, will result in sed replacing only the nth match on that line. Some of these sequences, that correspond to specific parts of the DNA strand, result in multiple matches; it would be fairly useful to limit sed to one replacement by making it only work on the nth match.
I realized that adding a number before the s in the s///g format will limit sed to that line (ex. executing 2s///g, results in a replacement being made on the 2nd line). It should be useful to make several line-breaks in order to make any matching easier; later, these line-breaks will have to be removed.
In an in-class work session, I learned that sed -r "<line#>s/^.{n}/<replacement>/g" limits sed to the first n characters of a line.
I executed cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1"| sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" (took me some time to realize that the -r was necessary for this command to work!). In making up this command, I decided to add a break after finding the minus35box and I limited sed to the first match (since I realized that the first match for the minus35box is, in fact, the correct one). Since the minus10box is 17 bp away from the minus35box, I decided to add the <minus10box> part of the tag after those 17 characters of the second line. I then decided to create another line-break after placing the first part of the minus10box tag; on the third line, I exploited the fact that the minus10box is 6 bp long in order to create a sed command that would add the last part of the tag, </minus10box>, after the actual minus10box.

Output so far

Finding and tagging the tss and the ribosome binding site

From looking over the Week 4 Assignment Page, I learned that the tss is located 12 bp away from the start of the minus10box. I also learned that the ribosome binding site will have a sequence of gagg.
I decided to first find the tss, since it is located just after the minus10box. I would have to create a sed command that starts its search from the fourth line (since the base that corresponds to the tss will be somewhere on that line). I came up with sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g", which are added to the set of the sed commands that I created in the previous section. Since the minus10box is 6 bp long, I decided to place the first part of the tss tag after 5 characters of the fourth line; the last part of the tag was placed by adding a line-break after the first part of the tag and then by adding </tss> after the first character of the next line through sed. The output of these commands results in the 12th nucleotide from the beginning of the minus10box being tagged as the transcription start site.
I used cat infA-E.coli-K12.txt | grep "gagg" to check if there are multiple matches for gagg. Since I only found one match, I used sed "s/gagg/<rbs>&<\/rbs>\n/g" to tag the ribosome binding site region and I added a line-break right after in order to make additional future sed commands easier.
The chain of commands so far is: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1"| sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" | sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g"

Output so far

Finding and tagging the start and stop codons

I recalled that the start codon will be atg and that the stop codon will be either taa, tag, or tga. Referenced a RNA Codon Table.
I executed sed "s/atg/<start_codon>&<\/start_codon/" with the current chain of commands and I found that there is just one start codon after the ribosome binding site (which is where it should be). I found that there are also several start codons appearing before the ribosome binding site but these will have to be ignored. sed "7s/atg/<start_codon>&<\/start_codon\n/g", added onto the current chain, properly tags the start codon.
I don't know where exactly the stop codon will appear but I know that it will have to appear after the start codon and that it will either be taa, tag, or tga. From the More Text Processing Features page, I learned that sed can be given several choices for replacement patterns by separating each pattern with a vertical bar (|); this multiple-choice form of sed should be useful in finding any possible stop codons. I also found that sed commands can be applied to ranges of lines by writing startline#,endline# prior to the s in the "s///g" format of sed. I wrote and tested sed -r "8,100000s/taa|tag|tga/<stop_codon>&<\/stop_codon>\n/1" with the current command chain and I found that no stop codons were found and tagged. I looked over this sed command for some time and I couldn't find any faults; the line 8 to 100,000 range seemed alright, as did the multiple-choice sed command portion. I then decided to try to split the bp into triplets to make the job easier for this sed command. I tested the current chain with | sed "s/.../ & /g" | sed -r "8,100000s/taa|tag|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g", which splits the entire sequence into triplets after finding the start codon and removes the spaces after finding the stop codon. The output of this chain revealed that there is, actually, a tga stop codon appearing after the start codon.
The chain of commands so far is: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1"| sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" | sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g" | sed -r "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "s/.../ & /g" | sed -r "8,100000s/taa|tag|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g"

Output so far

Finding and tagging the Terminator region and cleaning up the sequence

From looking at the Week 4 Assignment Page, I learned that the first half of the terminator "hairpin" is aaaaggt and that the terminator continues after the "hairpin" for another 4 bp after the "hairpin" part concludes.
I tested sed "s/aaaaggt/<terminator>&\n/g" with the current chain of commands and I found that there is only one match (appears near the end, after the stop codon).
I realized, by reading the Week 4 Assignment Page, that, in mRNA, the u's in the hairpin connect to the g's in the hairpin. From observing some examples of hairpins, I found that the two halves of the hairpin are supported by several nucleotides between them (I also learned that the second half wraps around and connects to the first half in a reverse position).
I began to look at the nucleotides that come after the <terminator> part of the tag via sed "s/aaaaggt/<terminator>&\n/g" tied to the current chain, which are cggtttaaccggcctttttattttat.
- In mRNA, these nucleotides, along with the first half of the terminator, are aaaaggucgguuuaaccggccuuuuuauuuuau. aaaaggu, the first half of the hairpin loop, will need to bind to uuuuccg which is the reverse of the second half of the hairpin loop: gccuuuu. gccuuuu appears after aaaaggu with cgguuuaaccg in between. After aaaaggucgguuuaaccggccuuuu, the terminator includes 4 more bp, which makes aaaaggucgguuuaaccggccuuuuuauu the full terminator region, or aaaaggtcggtttaaccggcctttttatt in DNA.
aaaaggtcggtttaaccggcctttttatt is 29 characters long, and given that aaaaggt is 7 characters long, a sed command that works on the 22 characters after aaaaggt will be useful.
From the More Text Processing Features page, I learned that sed ':a;N;$!ba;s/\n//g' removes all of the line-breaks in text and makes it into a single "line".
I later noticed that there weren't any spaces between the tagged regions and the sequence itself, I added sed -r "s/<minus35box>|<minus10box>|<tss>|<rbs>|<start_codon>|<stop_codon>|<terminator>/ &/g" | sed -r "s/<\/minus35box>|<\/minus10box>|<\/tss>|<\/rbs>|<\/start_codon>|<\/stop_codon>|<\/terminator>/& /g" to the current chain in order to add the spaces between the tagged regions and the sequence itself.
I added sed "s/aaaaggt/<terminator>&\n/g" | sed -r "10s/^.{22}/&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' to the current chain; I found that every region was properly tagged and that all of the line-breaks were removed. sed -r "10s/^.{22}/&<\/terminator>/g" involves the first 22 characters of line 10 because the terminator is 29 bp (22 + 7 = 29).
The final chain of commands is: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" | sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g" | sed -r "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "s/.../ & /g" | sed -r "8,100000s/taa|tag|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g" | sed "s/aaaaggt/<terminator>&\n/g" | sed -r "10s/^.{22}/&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' | sed -r "s/<minus35box>|<minus10box>|<tss>|<rbs>|<start_codon>|<stop_codon>|<terminator>/ &/g" | sed -r "s/<\/minus35box>|<\/minus10box>|<\/tss>|<\/rbs>|<\/start_codon>|<\/stop_codon>|<\/terminator>/& /g"

Final Output

Fully Tagged Sequence

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttg
cgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg</start_codon>  
gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgc  
<stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagt <terminator>aaaaggtcggtttaaccggcctttttatt</terminator> ttat

Finding the Exact mRNA Sequence

The exact mRNA sequence will involve all of the nucleotides from the transcription start site to the end of the terminator.
I decided to use the final chain of commands from the tagging section, except the last | sed -r "s/<minus35box>|<minus10box>|<tss>|<rbs>|<start_codon>|<stop_codon>|<terminator>/ &/g" | sed -r "s/<\/minus35box>|<\/minus10box>|<\/tss>|<\/rbs>|<\/start_codon>|<\/stop_codon>|<\/terminator>/& /g" portion, as the base for the chain of commands that will find the exact mRNA sequence.
I modified the sed -r "10s/^.{22}/&<\/terminator>/g" portion of the current chain with a line-break; using the current chain without sed ':a;N;$!ba;s/\n//g' shows that you can remove lines 1 through 4 in order to have the sequence start from the transcription start site. By reading the More Text Processing Features page, I found that you can delete lines in sed via sed "<line#>D"; ranges of lines can be deleted via sed "<firstline#>,<finalline#>D".
I added sed "1,4D" to the chain of commands in order to get rid of the first four lines (which are before the transcription start site). The segment after the terminator region is now considered to be line 7; I additionally added sed "7D" in order to get rid of that segment.
The big problem at this point is the tagging in the sequence; I decided to get rid of all of the tags, the long way, by using a long multiple choice sed command. I found that sed -r "s/<start_codon>|<\/start_codon>|<rbs>|<\/rbs>|<terminator>|<\/terminator>|<stop_codon>|<\/stop_codon>|<\/tss>//g" removes all of the leftover tagging without altering any bits of the actual sequence. To the end of this chain, I also added sed "s/t/u/g" to turn the DNA into the RNA (by turning the t's into u's, working with the mRNA-like strand).
The final chain of commands to turn the sequence into the actual mRNA is: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" | sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g" | sed -r "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "s/.../ & /g" | sed -r "8,100000s/taa|tag|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g" | sed "s/aaaaggt/<terminator>&\n/g" | sed -r "10s/^.{22}/&<\/terminator>\n/g" cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" | sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g" | sed -r "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "s/.../ & /g" | sed -r "8,100000s/taa|tag|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g" | sed "s/aaaaggt/<terminator>&\n/g" | sed -r "10s/^.{22}/&<\/terminator>\n/g" | sed "1,4D" | sed "7D" | sed -r "s/<start_codon>|<\/start_codon>|<rbs>|<\/rbs>|<terminator>|<\/terminator>|<stop_codon>|<\/stop_codon>|<\/tss>//g" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g"
- The output actual mRNA strand is:

cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuugaaacguugccuaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuc
cgguaaaaugcgcaaaaacuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugucuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgagu
aaaaggucgguuuaaccggccuuuuuauu

Finding the Translated Amino Acid Sequence

The translated amino acid sequence will run from the start codon until the stop codon; in order to find it, the regions prior to the start codon and those after the stop codon will have to be removed.
The genetic-code.sed file will likely prove to be useful again.
The first "start" codon in the actual mRNA sequence is the actual "start" codon. In order to remove the sequences behind it, I decided to have sed match the first aug and add a line-break before it. From that point, my plan is to delete the first line via sed. The command for this, added to the command chain created in the previous section regarding mRNA, is sed "s/aug/\naug/1" | sed "1D". Executing it results in a sequence that begins with the actual start codon.
Getting rid of everything after the actual stop codon seems a bit trickier; I would have to convert all of the bp back into codon triplets and then match the actual uga stop codon and add a line-break after it (then delete that line, in order to remove all of the bp after the stop codon). I figured out that sed "s/.../ & /g" | sed "s/uga/uga\n/g" | sed "s/ //g" | sed "2D" should work in conjunction with the current chain (including the start codon segment).
Using that chain of commands, I found the sequence to now run from the start codon until the stop codon. I should convert it, once more, into triplets and employ sed -f "genetic-code.sed" in order to convert each codon into its amino acid. Adding sed "s/.../ & /g" | sed -f "genetic-code.sed" | sed "s/ //g", I found, converts the DNA sequence into the compact amino acid sequence (with no spaces).
The final chain of commands is: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" | sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g" | sed -r "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "s/.../ & /g" | sed -r "8,100000s/taa|tag|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g" | sed "s/aaaaggt/<terminator>&\n/g" | sed -r "10s/^.{22}/&<\/terminator>\n/g" cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" | sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g" | sed -r "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "s/.../ & /g" | sed -r "8,100000s/taa|tag|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g" | sed "s/aaaaggt/<terminator>&\n/g" | sed -r "10s/^.{22}/&<\/terminator>\n/g" | sed "1,4D" | sed "7D" | sed -r "s/<start_codon>|<\/start_codon>|<rbs>|<\/rbs>|<terminator>|<\/terminator>|<stop_codon>|<\/stop_codon>|<\/tss>//g" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g" | sed "s/aug/\naug/1" | sed "1D" | sed "s/.../ & /g" | sed "s/uga/uga\n/g" | sed "s/ //g" | sed "2D" | sed "s/.../ & /g" | sed -f "genetic-code.sed" | sed "s/ //g"
- Output amino acid sequence:

MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR-

Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments

Individual Journal Pages

Shared Journal Pages

Blitvak Week 4

Contents

Individual Journal Assignment Week 4

Finding and tagging the minus35box and the minus10box of the promoter

Finding and tagging the tss and the ribosome binding site

Finding and tagging the start and stop codons

Finding and tagging the Terminator region and cleaning up the sequence

Fully Tagged Sequence

Finding the Exact mRNA Sequence

Finding the Translated Amino Acid Sequence

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools