Kzebrows Week 4

Modifying the Gene Sequence

To start this assignment I began by opening Terminal on my laptop. I entered

ssh kzebrows@my.cs.lmu.edu

followed by my password to log into the LMU CMSI database. As I usually do, I entered the following commands in order to enter Dr. Dionisio's directory, list the files in the directory, and choose the appropriate file for this assignment:

~cd dondi/xmlpipedb/data | ls | cat infA-E.coli-K12.txt

This took me to the E.coli file and showed me the nucleotide sequence. To complete this assignment I frequently used this page as a resource.

I began by using grep to find the potential -35 box and -10 box because grep highlights the searched pattern in red. I simply entered

 cat infA-E.coli-K12.txt | grep "tt[gt]ac[at]"

which gave me two possible answers for the -35 box, tttact and tttaca, both of which fit the pattern. Now it was a matter of finding out which one was the correct one. I also searched for the -10 box using

 cat infA-E.coli-K12.txt | grep "[ct]at[at]at"

which also revealed two potential sites at tataat and cattat. I realized that in order to find out which sequences were the correct ones I needed to visualize them both together, but grep doesn't do this, so instead I used sed. To do this, I entered the sed commands as a pipe, and added three space on either side of each occurrence of the consensus sequences (both -35 and -10) in the file to make the sequences more visible.. This is done by adding sed "s/<pattern>/& /g" where <pattern> is what I wish to find and each space after the "&" sign is what I wished to add to each side of the pattern (instructions found here). The pipe looked like this:

 cat infA-E.coli-K12.txt | sed "[ct]at[at]at/   &   /g" | sed "tt[gt]ac[at]/   &   /g"

This made it clear that it was the first -35 box option, tttact, and the second -10 box option, cattat, that I was looking for in this gene. Using this information, it was then much simpler for me to highlight the specific sequences for the assignment.

To highlight the -35 box, I needed to use sed to put <minus35box> on each side of the first option, along with three spaces. To do this, I consulted the Text Processing page of the wiki and found out that to do this I can replace g with the number of the occurrence I wish to change. Because I only needed the first option to be highlighted (tttact), the command looked like this:

 cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1"

Next, to highlight the -10 box, I did the same thing except my goal was to add <minus10box> to each side of the second -10 box option. The command looked like this:

 cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /2"

Which highlighted the -10 box, cattat.

In order to find the transcription start site, I learned from the assignment page that the site is located at the 12th nucleotide after the first nucleotide of the -10 box. This means that the start of transcription was the sixth codon after cattat. To find this, I broke up the gene and inserted a new line right after the -35 box. In the "picking lines" section of More Text Processing Features, I found that to do this I had to replace sed s///g with sed 2s///g. This command looked like this:

  cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed    
  "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1"

I noted that it should be /1, not /2, after the -10 box because since I'm only looking at things after the -35 box it would be the first occurrence of [ct]at[at]at.

My next goal was to find a command that would allow me to skip over 5 more nucleotides to the transcription start site <tss>...</tss> on the 6th nucleotide after the -10 box. I did this by adding the command

 sed -r "s/<\/minus10box> (.){5}/&\n/g"

Which indicated that I meant to skip over 5 nucleotides (in the curly braces). the -r meant each repetition of the pattern.

This had me starting at the 10th nucleotide, not the 12th. I realized that this was because I had added extra spaces around the <minus10box>...</minus10box>, and the spaces counted as (.). To fix this, I put {7} in curly braces instead of {5}, which gave me a newline at the right nucleotide (the 12th one). Then, to highlight the transcription start site I added

 sed "3s/^./<tss>&<\/tss> /g"

to tell the computer that I wished to add <tss> labels around the first character in the third line. The command looked like this:

  cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed   
  "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g"

Next, to find the ribosome binding site (which has to be after the transcription start site), I searched the same line (line 3) for gagg, as hinted by the assignment page. I did this by invoking the command

 sed "3s/^./<tss>&<\/tss> /g"

just like I did for the -35 box much earlier. The sequence then looked like this:

  cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed   
  "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed 
  "3s/gagg/ <rbs>&\/rbs> /1"

For the next part I needed to find the start codon, f-Met. This is coded for by AUG, but since this is the mRNA-like strand, the sequence is ATG. To find this ATG, I added a new line after the ribosome binding site and used sed to search for the next occurrence of ATG after that. I did this by adding two commands to the pipe, as seen below. This pattern followed the same pattern as the other sites.

  cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed 
  "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed 
  "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1"

Next I was presented with the challenge of finding the stop codon, which is coded for by either TAA, TAG, or TGA on this strand of the DNA. From our Week 3 assignment I remembered that it would be necessary to space out the nucleotides in 3-nucleotide codons in order to find the stop codon, and from Intro the Command Line I was able to recall the command for this, which was sed "s/.../& /g". I invoked this and began a newline using the same command as earlier for a newline (sed "s//&\n/g"). Once everything was separated into codons it became very easy to find the stop codon. All I had to do was add a new line and then tag it. The only difference was that the first term in the pattern was t[ag][ga], with the brackets representing an either/or situation. I then used /1" with the newline in order to find the first occurrence of t[ag][ga]. The pipe looked like this:

  cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed 
  "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed 
  "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed 
  "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1"

The final part of this portion of the assignment, locating the terminator, was the hardest. In class, Dr. Dionisio discussed with us how the first half of the sequence is AAAAGGT. Because it is a hairpin, however, I needed to find the reverse of this sequence, which is TGGAAAA, and find the complement, making the sequence ACCTTTT. Then Dr. Dionisio also said that the T binds with a G instead, so the second part of the sequence is actually GCCTTTT. We were also given the hint that there were 4 nucleotides after the terminator sequence.

To start, I added a new line directly after the first half of the sequence which I knew using the newline command. When I tried to do this it wouldn't work, but then I realized it was because I hadn't removed all of the spaces from when I was finding the stop codon. I invoked sed "s/ //g" to get rid of the spaces and proceeded to add a new line after that, then I tagged the AAAAGGT sequence. This command set looked like this:

  cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed  
  "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed 
  "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed 
  "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ 
  <terminator>& /1"

For the last part, I needed to find where GCCTTTT was. To do this, I first added yet another line after the first half of the sequence. I then searched for the last half of the sequence plus (....) to indicate the four unknown characters after it in the new line. I also needed to remove all of the lines that I had made in highlighting all of these sites, which is done using the command sed ':a;N;$!ba;s/\n//g' from More Text Processing Features.

  cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed 
  "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed 
  "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed   
  "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ 
  <terminator>& /1" | sed "s/aaaaggt/&\n/g" | sed "6s/gcctttt..../&<\/terminator> /1"

The last four characters ended up being TATT, with TTAT left over after the terminator sequence ended. The final set of commands is this:

  cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed 
  "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed  
  "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed 
  "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ 
  <terminator>& /1" | sed "s/aaaaggt/&\n/g" | sed "6s/gcctttt..../&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g'

Which gives a final sequence that looks like this:

 ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccga
 taaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagt
 aatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>tatttacagaacttcgg
 <minus10box>cattat</minus10box>cttgc<tss>c</tss>ggttcaaattacggtagtgatacccca<rbs>gagg<
 /rbs>attag<start_codon>atg</start_codon>gccaaagaagacaatat<stop_codon>tga</stop_codon>a
 atgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaa
 tgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgt
 agtcgctgattgttttac cgcctgatgggcgaagagaaagaacgagt <terminator>aaaaggt cggtttaaccggcctttttatt</terminator> ttat

Exact mRNA Sequence

In order to determine the exact sequence of mRNA from this gene, I wasn't quite sure where to start, so I looked at More Text Processing Features. First I removed the spaces by using sed "s/ //g" so that I could eventually isolate the nucleotides without the tags. Next, I used the repetitions modifier. to get rid of anything that was in between the arrows (<...>). I tried adding this to the previous set of commands, using brackets to indicate that I wanted to replace either < or > with a new line.

 sed "s/ //g" | sed -r "s/[<>]/n/g"

but all it did was remove the < > and everything was still a big block of text. Then I realized that I had not added the backslash.

 sed "s/ //g" | sed -r "s/[<>]/\n/g"

I did this and the text was separated like so:

 ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccga
 taaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagt
 aatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
 minus35box
 tttact
 /minus35box
 tatttacagaacttcgg
 minus10box
 cattat
 /minus10box
 cttgc
 tss
 c
 /tss 
 ggttcaaattacggtagtgatacccca
 rbs 
 gagg
 /rbs 
 attag
 start_codon
 atg
 /start_codon 
 gccaaagaagacaatat
 stop_codon
 tga
 /stop_codon
 aatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaa
 atgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactg accccgtacgacctgagcaaaggccgcattgtcttccg
 tagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagt
 terminator
 aaaaggtcggtttaaccggcctttttatt
 /terminator
 ttat

Then, it was just a matter of deleting the lines that didn't have nucleotides using the Delete command. Originally, I thought this would be only lines 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, and 24, but then I realized that since transcription starts at TSS and ends at the terminator I needed to delete the lines before and after those, too. 1,10D would delete lines 1-10. This took a few tries, as I forgot to add the semicolons and then the D after each separate line separated by a semicolon. At the end I got rid of the lines I had created using sed ':a;N;$!ba;s/\n//g'. This was the command:

 sed "s/ //g" | sed -r "s/[<>]/\n/g" | 
 sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28D;29D" | sed ':a;N;$!ba;s/\n//g'

Then, like with last week's assignment, I needed to switch all T's to U's. I then separated it into 3-nucleotide codons. This was the final command sequence:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ <terminator>& /1" | sed "s/aaaaggt/&\n/g" | sed "6s/gcctttt..../&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | sed -r "s/[<>]/\n/g" | sed "s/ //g" | sed -r "s/[<>]/\n/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28D;29D" | sed ':a;N;$!ba;s/\n//g'

and this was what I got for the exact mRNA sequence:

 cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuugaa
 cguugccuaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgcaaaaa
 cuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugucuuccgu
 agucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu

What is the amino acid sequence translated from this mRNA?

The only difference here is that the start and stop site are different. Translation only occurs starting at the start codon, AUG, and ends at the stop codon, which I found to be TGA.

To do this, I went back and deleted lines 1-18 (everything up to the start codon) as well as lines 20 and 22-29. I

This was my final set of commands:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ <terminator>& /1" | sed "s/aaaaggt/&\n/g" | sed "6s/gcctttt..../&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | sed -r "s/[<>]/\n/g" | sed -r "s/[<>]/\n/g" | sed "1,18D;20D;22,29D" | sed ':a;N;$!ba;s/\n/ /g' | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed