Jkuroda Week 4

From LMU BioDB 2015
Jump to: navigation, search

1. I started this problem in class with Emily and got as far as the -10 box of the promoter. We walked through it together, and figured out which sequences were the correct ones. Then I continued to find the other sequences, using the same methods I already used. I didn't run into any issues until I tried to find the start codon. For this case, as well as some others to follow, I had to skip a line and use the sed command with 9s instead of 8s, which would be the next line after where the ribosome binding site was found. Doing the stop codon was interesting because I tried to find it using the regex pattern of t[ag][ag], but quickly realized that this could potentially find a codon in the form of tgg, which is not a stop codon. So I looked at more of the text processing features and found that there is a way to use -r that allows multiple choice. I also needed to use some past strategies for separating the nucleotides into groups of three to get the correct stop codon. Getting the terminator sequence was also a little different because I had to first find the first half of the sequence then find the final parts of the sequence with an extra four nucleotides. After all of this, I got this text:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcg
cgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgattt
agcgcgcaaatc<minus35box>tttact</minus35box>tatttacagaacttcgg<minus10box>cattat</minus10box>cttgcc<tss>g</tss>g
ttcaaattacggtagtgatacccca<rbs>gagg</rbs>attag<start_codon>atg</start_codon>gccaaagaagacaatattgaaatgcaaggtaccgt
tcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgg
gcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon>
ttgttttaccgcctgatgggcgaagagaaagaacgagt<terminator>aaaaggtcggtttaaccggcctttttatt</terminator>ttat

Using this sequence of commands:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/\n<minus35box>&<\/minus35box>\n/" | sed -r "3s/^.{17}/&\n/" |  
sed "4s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/" | sed -r "5s/^.{6}/&\n/" | sed "6s/[atcg]/<tss>&<\/tss>\n/" | 
sed "7s/gagg/\n<rbs>&<\/rbs>\n/" | sed "9s/atg/\n<start_codon>&<\/start_codon>\n/" | 
sed -r "11s/.../& /g;11s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1;11s/ //g;11s/<stop_codon>/ &/g;11s/<\/stop_codon>/& /g" | 
sed "11s/aaaaggt/\n<terminator>&/g" | sed "12s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g'

2. To get the exact mRNA sequence from this gene, I had to get the nucleotides that were between the start site and the end of the terminator. I thought of some inefficient methods for doing this, such as going through and meticulously taking out the frames as well as the unused sequence. However, I ended up using some commands that Nicole Anguiano used, because her method was much easier and sensible. With her method, I split up the whole text into 29 lines that I could then selectively delete to end up with only the mRNA strand (after switching around the nucleotides accordingly). Below is the string of commands used to arrive at the result. (Note that these need to be added to the end of the piped command above.

sed "s/ //g" |  sed -r "s/<|>/\n/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" | 
sed ':a;N;$!ba;s/\n//g' | sed "y/atcg/uagc/"

Finally, this is the mRNA sequence that I got, which happens to be different from the one Nicole Anguiano got, because I replaced every letter, instead of just switching the t's for u's.

ccaaguuuaaugccaucacuauggggucuccuaaucuaccgguuucuucuguuauaacuuuacguuccauggcaagaacuuugcaacggauuaugguacaaggcgcaucuc
aaucuuuugccagugcaccaaugacguguguagaggccauuuuacgcguuuuugauguaggcguaggacugcccgcuguuucacugacaacuugacuggggcaugcuggac
ucguuuccggcguaacagaaggcaucagcgacuaacaaaauggcggacuacccgcuucucuuucuugcucauuuuccagccaaauuggccggaaaaauaa

3. To get the amino acid sequence, I kept using the line deletion technique and simply deleted the lines that I didn't need for this sequence, which were a good chunk of them. After that I used the genetic-code.sed file to make the whole process much faster. Below is what I added to the original command to get the answer.

sed "1,18D;20D;22,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/.../& /g;s/t/u/g" | sed -f genetic-code.sed

Here is the amino acid sequence:

M A K E D N I E M Q G T V L E T L P N T M F R V E L E N G H V V T A H I S G K M R K N Y I R I L T G D K V T V 
E L T P Y D L S K G R I V F R S R


Josh Kuroda's page

Individual Journal Entries

Week 2
Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14
Week 15

Shared Journal Entries

Week 1
Week 2
Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14
Week 15