Jwoodlee Week 6

From LMU BioDB 2015
Jump to: navigation, search

Lab Notebook

I sshed into my home directory on my.cs.lmu.edu using the built in command. I downloaded the file from the FDA home page using terminal, and then unzipped the file using terminal as well. I then had all the appropriate files in my home directory.

At first I wasn't familiar with how the data was formatted in application.txt so I wasn't sure what variables to include in the tables, but after looking at the top of the text document, I found out the variables are: ApplNo, ApplType, SponsorApplicant, MostRecentLabelAvailableFlag, CurrentPatentFlag, ActionType, Chemical_Type, Ther_Potential, and Orphan_Code. It appears the data types are: int primary key, varchar, varchar, boolean, boolean, varchar, int, varchar, and varchar respectively. So with this information it is now pretty easy to make a table for application, see questions below. Next I needed to figure out the variables for product.txt. Opening the file it is easy to see the names for each column: ApplNo, ProductNo, Form, Dosage, ProductMktStatus, TECode, ReferenceDrug, drugname, activeingred. Based on the values of data in each individual column I would say that the data types are: int, int, varchar, varchar, varchar, varchar, int, varchar, and varchar respectively. Again, from this info it is easy to make a table for product.txt.

With the tables setup my goal was then to add the appropriate data, to do this I used the sed command. The first line of application.txt is useless so I removed that: cat application.txt | sed "1D". Then, I wanted to remove all tabs because I thought there was a variable number of tabs between each piece of data. However, I was wrong and I discovered there were actually a bunch of extra spaces in the way of single tabs separating the data. What I thought was a variable number of tabs was really a variable number of spaces. Some of the data relies on spaces so I couldn't just delete all the spaces, instead I removed all spaces next to another space because no valuable spaces were ever right next to each other. I was then left with only the valuable spaces within the data. So I had: cat application.txt | sed "1D" | sed "s/ //g".

In class I heard the discussion about "\r" meaning the end of a line so I looked it up and it means "carriage return", which is a basically a character representing someone hitting the enter key. This became useful later.

I noticed many rows were short a few pieces of data so, assuming each piece of data is separated by a tab, I needed to insert a 'null' value in between all tabs that are right next to eachother. So I did this and noticed some rows were still short data which meant they had two pieces of data missing right next to each other, fortunately to remedy this I just ran the command again, so:

cat application.txt | sed "1D" | sed "s/ //g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g" .

Now I had what I wanted. I had every piece of data separated by a single tab with the nulls inserted into the correct place, time to add the insert command!

Whoops I realized that a lot of my rows only had 8 pieces of data when I really needed 9. I realized most data didn't have an Orphan_Code so I needed to tack null onto the end of a lot of rows. To do this I used the carriage return. Every line that ends with a tab and a carriage return needs to have a null added to the end, so I did this: cat application.txt | sed "1D" | sed "s/ //g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\r/\tnull\r/g"

Now that I have the data properly formatted and separated by tabs, I can insert the apostrophes for the varchars, and replace the tabs with commas.

With sed "s/<to be replaced>/<replacer>/1" one can replace the first instance of string in the <to be replaced> spot on the line. So using this it becomes pretty easy, albeit long, to insert commas and apostrophes. When all that is done there are a bunch of null values that look like: 'null' when they should have no apostrophes around them. To get rid of those simply replace 'null' with null. Then I used the carriage return to insert the final parenthesis, and I used the carrot to insert the beginning command. Final command is as follows:

cat application.txt | sed "1D" | sed "s/ //g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\r/\tnull\r/g" | sed "s/\t/,'/1" | sed "s/\t/','/1" | sed "s/\t/','/1" | sed "s/\t/','/1" | sed "s/\t/','/1" | sed "s/\t/',/1" | sed "s/\t/,'/1" | sed "s/\t/','/g" | sed "s/\r/'\r/g" | sed "s/'null'/null/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | sed "s/\r/);\r/g" > ~/public_html/application.sql.txt

After reading the first line form Product.txt and gleaning the information from it, it becomes useless: cat Product.txt | sed "1D". Fortunately this one doesn't appear to have any weird extra spaces, so the procedure should be: insert nulls into correct spaces, add commas and apostrophes, add beginning and end of command. So.... cat Product.txt | sed "1D"

The added wrinkle for this one is that it has apostrophes within some of the data so to make sure those are included simple replace all apostrophes with a double apostrophe before the inserting of commas and apostrophes begins. Other than that it is the same basic format and was rather easy to come up with using what I did for application.txt.

cat Product.txt | sed "1D" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\r/\tnull\r/g" | sed "s/'/''/g" | sed "s/\t/,/1" | sed "s/\t/,'/1" | sed "s/\t/','/1" | sed "s/\t/',/1" | sed "s/\t/,'/1"| sed "s/\t/',/1" | sed "s/\t/,'/1" | sed "s/\t/','/1" | sed "s/\r/'\r/g" | sed "s/'null'/null/g" | sed "s/^/insert into product(ApplNo, ProductNo, Form, Dosage, ProductMktStatus, TECode, ReferenceDrug, drugname, activeingred) values(/g" | sed "s/\r/);\r/g" > ~/public_html/product.sql.txt


Questions to Answer

  • Provide the DDL (create table) statements that you used for your application and product tables.
    • create table application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
    • create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus varchar, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)
  • Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
    • cat application.txt | sed "1D" | sed "s/ //g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\r/\tnull\r/g" | sed "s/\t/,'/1" | sed "s/\t/','/1" | sed "s/\t/','/1" | sed "s/\t/','/1" | sed "s/\t/','/1" | sed "s/\t/',/1" | sed "s/\t/,'/1" | sed "s/\t/','/g" | sed "s/\r/'\r/g" | sed "s/'null'/null/g" | sed "s/^/insert into application(AppleNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | sed "s/\r/);\r/g" > ~/public_html/application.sql.txt
    • cat Product.txt | sed "1D" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\r/\tnull\r/g" | sed "s/'/''/g" | sed "s/\t/,/1" | sed "s/\t/,'/1" | sed "s/\t/','/1" | sed "s/\t/',/1" | sed "s/\t/,'/1" | sed "s/\t/',/1" | sed "s/\t/,'/1" | sed "s/\t/','/1" | sed "s/\r/'\r/g" | sed "s/'null'/null/g" | sed "s/^/insert into product(ApplNo, ProductNo, Form, Dosage, ProductMktStatus, TECode, ReferenceDrug, drugname, activeingred) values(/g" | sed "s/\r/);\r/g" > ~/public_html/product.sql.txt
  • Using the command line, how can you determine the number of records in each file? Provide the command.
    • The number or records is exactly the same as the number of lines in our modified file, therefore a simple piped wc command in place of the > ~/public_html/product.sql.txt will yield the result on either file.
  • Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.
    • Using Select count(*) from product or Select count(*) from application will yield the number of records in the table.
  • In your database, are these numbers the same or different? Explain why you think so.
    • When I ran the commands I got exactly the same numbers, so I would say they are the same. This is the case because assuming we formatted the text files correctly, each line holds one insert command which adds a record to the table, therefore the number of lines represents the number of records.
  • For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
  • What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?
    1. "LOVENOX", "VELCADE", "VIDAZA", "ENOXAPARIN SODIUM", "ACTEMRA", "AZACITIDINE"
    2. select drugname from product where Form = 'INJECTABLE;INTRAVENOUS, SUBCUTANEOUS';
  • What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?
    1. "ATROPEN" (x4), "ATROPINE"
    2. select drugname from product where activeingred = 'ATROPINE';
  • In what forms and dosages can the drug product named BENADRYL be administered?
    1. Forms: "CAPSULE;ORAL", "ELIXIR;ORAL", "CAPSULE;ORAL", "INJECTABLE;INJECTION", "INJECTABLE;INJECTION" dosage: "50MG","12.5MG/5ML","25MG","10MG/ML", "50MG/ML"
    2. select Form from product where drugname = 'BENADRYL'; select dosage from product where drugname = 'BENADRYL';
  • Which drug products have a name ending in ESTROL?
    1. "DIETHYLSTILBESTROL"(x11), "STILBESTROL"(x6), "DIENESTROL"(x1), "STILBESTROL"(x12)
    2. select drugname from Product where drugname ~'ESTROL$'
  • Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
    1. Therapeutic_Potential; Count, S*; 3, S; 3460, P; 928, P*; 26
    2. select therapeutic_potential,count(*) from application group by therapeutic_potential
  • Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
    1. 6;87, 14;34, 34;9, 8;27, 1;1271, 2;139, 3;1464, 10;2, 23;8, 4;323, 5;1046, 9;1, 24;8, 7;35
    2. select chemical_type,count(*) from application group by chemical_type
  • What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?
    1. "DECADRON""HUMORSOL""NEO-HYDELTRASOL""PRINIVIL""MAXALT-MLT""HYDROCORTONE""PERIACTIN""PROPECIA""PROSCAR""CLINORIL""PRINZIDE""ELSPAR""ALDOMET""ALDORIL D30""EMEND""DIUPRES-500""NEODECADRON""ALDOCLOR-250""MEFOXIN IN DEXTROSE 5% IN PLASTIC CONTAINER""ARAMINE""TIAMATE""BLOCADREN""CANCIDAS""CHIBROXIN""CORTONE""PEPCID""TRUSOPT""REDISOL""VIOXX""FLOROPRYL""ALPHAREDISOL""DECADRON-LA""MEFOXIN IN SODIUM CHLORIDE 0.9% IN PLASTIC CONTAINER""PEPCID RPD""NOROXIN""ZOCOR""LERITINE""DECADERM""DOLOBID""MANNITOL 25%""DECADRON W/ XYLOCAINE""ALDORIL D50""TIMOLIDE 10-25""CYCLAINE""HYDROPRES 25""AMINOHIPPURATE SODIUM""MEVACOR""MODURETIC 5-50""ALDORIL 25""SINGULAIR""COLBENEMID""DIUPRES-250""HYDELTRA-TBA""PRIMAXIN""BENEMID""MAXALT""FOSAMAX PLUS D""HYDRODIURIL""HYDELTRASOL""HYDROPRES 50""ZOLINZA""ALDORIL 15""FOSAMAX""DECASPRAY""COGENTIN""ALDOCLOR-150""PEPCID PRESERVATIVE FREE"
    2. select drugname from product,application where product.applno = application.applno and application.sponsorapplicant = 'MERCK' group by drugname
  • Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?
    1. "WATSONLABS", "ACTAVISLABSUTINC"
    2. select sponsorapplicant from product,application where application.applno = product.applno and sponsorapplicant ~ 'LABS' and activeingred ~ 'ASPIRIN' and activeingred ~ 'CAFFEINE' group by sponsorapplicant


BIOL 367, Fall 2015, User Page, Team Page

Weekly Assignments Individual Journal Pages Shared Journal Pages