Troque Week 3

From LMU BioDB 2015
Jump to: navigation, search

User Page        Bio Databases Main Page       


The Genetic Code, by computer

First, login to the LMU CS server using ssh. Type in the following in a command prompt (Windows) or terminal (Mac) window:

ssh <username@my.cs.lmu.edu>

Enter your password. Note: You will not visibly see the cursor move when typing in your password so just keep typing. Then change directories to dondi's using the following commands to find the practice files and other miscellaneous files:

cd ~dondi/xmlpipedb/data

Here, you can use the command ls in order to see the list of files in the directory. Then we can start manipulating some files.


Complement of DNA

To find the complementary strand when given a standard 5' to 3' DNA strand, match each of the four base pairs A,T,C, and G with T, A, G, and C, respectively. Done in the computer, we use the sed command for replacing the bases with the ones they correspond to. To find the complement of the DNA strand in the file prokaryote.txt, the following command is used:

Command: sed "y/actg/tgac/" prokaryote.txt
Yields:  agatgatataaagttatccatgctaccggtttcttctgttataacttgaactttgcaacggattatggtacaaggcgcatattgggtcggcggtcaaggcgaccgccgtaaaattg

where "prokaryote.txt" is the file that contains the original DNA strand. Similarly, we can use this command for a file with a longer DNA strand:

Command: sed "y/actg/tgac/" infA-E.coli-K12.txt
(The resulting strand is not shown since it is too long).


Reading Frames

For each of the reading frames for the mRNA strand, we first have to convert the DNA strand into mRNA. We use the sed command again in order to change all the T's into U's:

sed "s/t/u/g"

Then we divide the strand into its codons so we use the wildcard "." (in this case, since we want 3 characters, we use "...") since we do not care which letters we look to replace. Then, we use "& " because we want to keep those same letters, but we want to add a space between them. This would result in the same string of letters, but with the space character every 3 letters.

sed "s/.../& /g"

Next, we use the existing file genetic-code.sed, which already has all the codons and their corresponding amino acids. We use the following command:

sed -f genetic-code.sed

Then, we will get something that looks like this (using the prokaryote.txt file as an example):

S T I F Q - V R W P K K T I L N L K R C L I P C S A Y N P A A S S A G G I L ac

Since this strand still has the residual bases at the end that are not converted to codons, we want to remove these bases. We use the command:

sed "y/acug/    /"

so that each of the residual bases is replaced with a space character. We can then remove ALL space characters with the following command:

sed "s/ //g"

For the +1 reading frame, the above commands would suffice and when we combine them into a pipeline of commands, we get the following:

+1:     cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "y/acug/    /" | sed "s/ //g"

However, for the +2 and +3 reading frames, we have to shift reading the codons by 1 and 2 letters, respectively. The same commands from above are still used, but we add another sed command so that we shift by a certain number of letters. For +2, we add the command:

sed "s/^.//g"

so that we shift by 1 letter from the very first letter (the symbol "^" means that we only want the beginning character(s)).

+2:     cat sequence_file | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed |
        sed "y/acug/    /" | sed "s/ //g"

For +3, similar to +2, we use the command:

sed "s/^..//g" 

for shifting by 2 letters in the strand. So we get the following pipeline of commands:

+3:     cat sequence_file | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | 
        sed "y/acug/    /" | sed "s/ //g"

For the -1, -2, and -3 reading frames, 2 additional commands are needed: the commands rev, to reverse the strand, and sed "y/acug/ugac/", to find the complementary mRNA strand. By doing this, we do not have to deviate much from our previous commands shown above. Instead, we are only adding 2 additional steps. The resulting reading frames are as follows:

-1:     rev sequence_file | sed "s/t/u/g" | sed "y/acug/ugac/" | sed "s/.../& /g" | sed -f genetic-code.sed |
        sed "y/acug/    /" | sed "s/ //g"
-2:     rev sequence_file | sed "s/t/u/g" | sed "y/acug/ugac/" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | 
        sed "y/acug/    /" | sed "s/ //g"
-3:     rev sequence_file | sed "s/t/u/g" | sed "y/acug/ugac/" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | 
        sed "y/acug/    /" | sed "s/ //g"


XMLPipeDB Match Practice

In order to properly use Match, we have to write java -jar <.jar file> <pattern> < <file to be searched>. We do this since Match is a java file; in order to run it, we have to use the keyword java followed by -jar. The Match command that tallies occurrences of the pattern GO:000[567] is the following:

java -jar xmlpipedb-match-1.1.1.jar "GO:000[567]" < 493.P_falciparum.xml

As a result of running the program, the number of unique matches and the number of times each of the unique matches occur are outputted. The following shows the summary of what the result is when the above code is entered into the command line.

  • There are 3 unique matches
  • The occurrences of each unique match is as follows:
    • The pattern go:0007 occurs 113 times.
    • The pattern go:0006 occurs 1100 times.
    • The pattern go:0005 occurs 1371 times.


Similarly, we can use the same set of keywords in order to find other patterns within the files. Trying to find one of the matches (using GO:0005 in this case) we use the grep command:

grep "GO:0005" 493.P_falciparum.xml
  • This gives us all occurrences of the pattern GO:0005. Observing this list of occurrences, the pattern GO:0005 is actually a part of the id's used in the .xml file. From speculation, I would say that each individual GO:0005 corresponds to a gene in the gene ontology page and each number after GO: is the individual identifier for each of the genes. This would mean that there are 113 genes that have the id that start with 0007, 1100 genes that have id's that start with 0006, and 1371 genes with identifier beginning with 0005.


The same keywords are also applied in finding the pattern below. The Match command that tallies occurrences of the pattern \"Yu.*\" is the following:

java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" < 493.P_falciparum.xml 

As a result of running the program, the number of unique matches and the number of times each of the unique matches occur are outputted. The following shows the summary of what the result is when the above code is entered into the command line.

  • There are 3 unique matches
  • The occurrences of each unique match is as follows:
    • The pattern yu b. occurs 1 time.
    • The pattern yu k. occurs 228 times.
    • The pattern yu m. occurs 1 time.
  • I believe this pattern represents the names of certain authors/contributors to the Gene Ontology database; more specifically, I think they are people whose last names are "Yu"


In tallying the number of occurrences of ATG in hs_ref_GRCh37_chr19.fa:

  • We type the following command in order to use Match: java -jar xmlpipedb-match-1.1.1.jar "ATG" < hs_ref_GRCh37_chr19.fa
    • Using Match, the number of unique occurrences is 1. The number of times it occurred in the file is 830101.
  • In order to use grep + wc in the command line, type in the following: grep "ATG" hs_ref_GRCh37_chr19.fa | wc
    • Using grep and wc, the word count of the pattern is 502410. Unsurprisingly, the number of lines in the file is also 502410. This is because grep only searches each new line for this particular three-letter pattern.
  • The counts are different because Match counts exactly how many times the regular expression appears in the file, regardless of whether or not the regex is in the same line or not. Using grep + wc, however, only the lines where the particular regex appears are counted. Since there are more occurrences in the entire file than there are lines, Match would result in a bigger number than using grep + wc alone.


Assignment Links

Weekly Assignments

Individual Journal Entries

Shared Journal Entries