Kzebrows Week 6

Electronic Lab Notebook

Unzipping Files

For this week's assignment I worked in the lab in Seaver to use the PostgreSQL tool installed on the computers. I used

kzebrows@my.cs.lmu.edu

followed by my password to log into the LMU CMSI database.

Next, following the Direct Download/Unzipping Commands section of the Week 6 Assignment, I downloaded the FDA drug file directly to my.cs.lmu.edu and unzipped it. I opened each file using the cat application.txt or cat Product.txt commands to view them first.

Reworking Files

Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.

This part of the assignment required the most time and effort. I worked closely with my homework partner, Anu, and also with Brandon K. It took my homework partner and I around four hours to figure out the first command.

Anu explained a lot of what she had worked out for the Application file and then from there I was able to figure out the majority of the Product file. With the Application process, she explained to me what she had done and I attempted to write the commands on my own computer to execute it. She explained to me that when looking at the file, the first step is to replace all tabs with a character and then any spaces afterwards. Initially we tried sed "s/\t/,/g", however; we realized that we couldn't just replace each tab/space combo with a comma because commas and other characters exist within the data itself in the Sponsor Applicant column. This became clear right after trying to put the data into a table. Instead, a star was chosen to replace these spaces/tabs; however, later it became apparent that there were stars in the data itself, so they were replaced with tildes ~.

  cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g"

The above sequence of commands produced the following in sed:

  207917~N~GALDERMA LABS~False~False~AP~5~S~
  207925~N~VERTEX PHARMS INC~False~False~AP~3~P~
  207926~N~AKORN INC~False~False~AP~7~S~
  207931~N~ABBVIE INC~False~False~AP~5~P~
  207946~N~JANSSEN PHARMS~False~False~AP~5~P~
  207949~N~ACCORD HLTHCARE INC~False~False~TA~5~~

It then became apparent that there were some instances where there were two or three tildes in a row that needed to be replaced with the word "null" indicating empty values, as leaving that column empty would translate into an error when applied in PostgreSQL. This was done by adding the command sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g".

Next it was clear that each tilde needed to be replaced with a comma. This command was found in the More Text Processing Features wiki because we needed to specify how to replace a certain instance in each line of data with a comma and/or apostrophe combo. Because certain features need to be interpreted as text they require apostrophes around them (e.g. Sponsor Applicant column), the sed command sed "s/pattern/replacement/instance to be replaced" was used. At first I tried sed "s/~/,'/1" | sed "s/~/','/2". This only changed the first instance and the third instance, reminding me that it would always be the first instance as the computer reads the commands in order. Apostrophes needed to be added around the ApplType, Sponsor Applicant, Action Type, Ther Potential, and Orphan Code columns. The line then looked like this:

  cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | 
  sed "s/\~\~/~null~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | 
  sed "s/~/',/1" | sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" |

Which yielded this:

  207946,'N','JANSSEN PHARMS',False,False,'AP',5,'P','null
  207949,'N','ACCORD HLTHCARE INC',False,False,'TA',5,'null','null
  207955,'A','SPEAR PHARMS INC',False,False,'AP',null,'null','null
  207958,'N','APRECIA PHARMS CO',False,False,'AP',3,'P','null

Next, another apostrophe needed to be added to the end of each line, and each instance of 'null or null' needed to be replaced with simply the word null. This was done by adding the commands sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/'null/null/g" to the end of the command line. Then the data was inserted into the application based on what we did in class as well as in the Prepare Data for Insertion section of the PostgreSQL Tutorial page. Finally, the first line needed to be deleted as indicated in the Week 6 assignment page, which was done using the command sed ID .

Command to convert Application (written by Anu with edits from Kristin)

  cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | 
  sed "s/\~\~/~null~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | 
  sed "s/~/',/1" | sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" | 
  sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/'null/null/g" | sed "s/^/insert into application(ApplNo,ApplType,
  SponsorApplicant, MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,
  Orphan_Code) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/application.sql.txt

I was able to visit my page at http://my.cs.lmu.edu/~kzebrows/application.sql.txt, select all, and then add the data to PostgreSQL successfully.

Next, to determine the sed command that would enable me to insert the Product data into PostgreSQL, I started with similar protocol as in the Application file. First, I needed to replace all tabs/spaces with a character. In this case it was all right to use a star (although it had to be designated as \*) because the only stars appeared in the Federal Register warnings as I will address later. It was very similar to the Application command except for the placement of some of the apostrophe/comma combos. In the case of this file, apostrophes needed to be placed around the Form, Dosage, TECode, drugname, and activeingred column. Once I figured out that the TECode column is varchar, not int, the sequence of commands that I tried went like this:

  cat Product.txt | sed "s/\t/*/g" |
  sed -r "s/( ){4}//g" | sed -r "s/( ){2}//g" | sed "s/ \*/*/g" | sed "s/\* /*/g" | sed "s/\*\*/*null*/g" | 
  sed "s/\*\*/*null*/g" | sed "s/\*/,/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\*/',/1" | sed "s/\*/,'/1"
  | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\r$/'/g" |

This gave me the following output:

  207955,001,'GEL;TOPICAL','0.05%',1,'AB',0,'TRETINOIN','TRETINOIN'
  207958,001,'TABLET;ORAL','250MG',1,'null',0,'SPRITAM','LEVETIRACETAM'
  207958,002,'TABLET;ORAL','500MG',1,'null',0,'SPRITAM','LEVETIRACETAM'
  207958,003,'TABLET;ORAL','750MG',1,'null',0,'SPRITAM','LEVETIRACETAM'
  207958,004,'TABLET;ORAL','1GM',1,'null',0,'SPRITAM','LEVETIRACETAM'

It appeared as though the data was ready to be inserted into the PostgreSQL table from here. I tried it using the same protocol as before and received an error message that the star was appearing in the data set itself. I searched for these instances and found that they occurred in the following forms:

  **Federal Register determination that product was not discontinued or withdrawn for safety or efficacy reasons**
  **Federal Register determination that product was not discontinued or withdrawn for safety or efficacy**
  **Federal Register notice determination that product was not discontinued or withdrawn for safety or efficacy reasons**
  **Federal Registration determination that product was not discontinued or withdrawn for safety or efficacy reasons**

Unfortunately this note, while effectively the same message, was not consistent. We had to remove each of them manually using sed "s/\*\*message\*\*//g".

I tried running it again and received another error message, this time due to another note in the database. I removed it in a similar fashion using the command sed "s/See current Annual Edition, 1.8 Description of Special Situations, Levothyroxine Sodium//g". Note that these deletions had to be done at the beginning of the sequence so they did not prove to be a problem later on in the sequence.

I tried running it one more time and received another error message involving a 5'-phosphate sequence. Right after the other deletions I changed this into 5-phosphate via the command sed "s/'//g", replacing all apostrophes with two apostropes (not quotation marks) in the text. This successfully let me get rid of this problem.

After deleting the first line using the command sed 1D, we were finally able to insert this data into the system by adding sed "s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) values(/g".

Final command sequence to convert Product (written by Kristin with edits from Anu)

  cat Product.txt | sed "s/\*\*Federal Register determination that product was not discontinued 
  or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/\*\*Federal Register determination
  that product was not discontinued or withdrawn for safety or efficacy\*\*//g" | sed "s/\*\*Federal
  Register notice determination that product was not discontinued or withdrawn for safety or 
  efficacy reasons\*\*//g" | sed "s/\*\*Federal Registration determination that product was not discontinued
  or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/See current Annual Edition, 1.8 
  Description of Special Situations, Levothyroxine Sodium//g" | sed "s/'//g" | sed "s/\t/*/g" |
  sed -r "s/( ){4}//g" | sed -r "s/( ){2}//g" | sed "s/ \*/*/g" | sed "s/\* /*/g" | sed "s/\*\*/*null*/g" | 
  sed "s/\*\*/*null*/g" | sed "s/\*/,/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\*/',/1" | sed "s/\*/,'/1"
  | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\r$/'/g" | sed "s/^/insert into    
  product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) 
  values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/Product.sql.txt

This portion of the assignment presented me with many challenges. As a biology major with no prior experience, I'm still not completely comfortable with the command line, although things that were difficult earlier in the semester are second nature to me now, which shows me that I'm making progress in learning how to use this new language. My homework partner was very helpful as she has had prior experience with coding, but without her it would have been much harder. Each file came with its own difficulties: with the Product.txt file, there was a lot of trial and error once actually trying to insert the data into PostgreSQL because of the different notes and deletions we had to make as a result. In the Application file, trying to delete spaces/tabs combo was a challenge. As Dr. Dionisio helpfully advised, we could have replaced all instances of \t using the command sed "s/\t/','/g". Then, however, we would have to delete the ones we didn't need manually. Instead of trying that and then working backwards we decided to add each comma/apostrophe combo manually, although we recognize that we could have done it the other way, too.

Creating Tables and Loading Files

Provide the DDL (create table) statements that you used for your application and product tables.

This required me to open the PostgreSQL Tutorial wiki page. I then created the following commands in SQL, following the protocol we did in class with the movie database. The command began with create table [application/Product]. Then, as Application was the primary key, the first thing I entered in parentheses after the create table command was (ApplNo int primary key; however, because Product has a foreign key as indicated by the assignment page, I just used (ApplNo int. Then I entered the names of each column that I viewed by opening the Product and Application files on Notepad, followed by either int for integers, varchar for varying characters, and boolean for columns with True or False values. The following commands to create tables in SQL looked like this:

Application:

  create table Application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, 
  MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, 
  Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)

Product:

  create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus 
  int, TECode int, ReferenceDrug int, drugname varchar, activeingred varchar)

Note that originally, the TECode column was entered as int; however, upon creating the command (see next question) to convert the files from sed to SQL insert statements, I realized that because not everything in this code was an integer (SQL pointed this out to me with an instance of AA in the TECode column when I tried to insert the data into the SQL table) the command needed to be changed.

  create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus 
  int, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)

Once this tables were created I could simply copy and paste the data into PostgreSQL and view it in table format.

Questions to Answer

1. Provide the DDL (create table) statements that you used for your application and product tables.

Application:

  create table Application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, 
  MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, 
  Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)

Product:

  create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus 
  int, TECode int, ReferenceDrug int, drugname varchar, activeingred varchar)

2. Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.

Application:

  cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | 
  sed "s/\~\~/~null~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | 
  sed "s/~/',/1" | sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" | 
  sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/'null/null/g" | sed "s/^/insert into application(ApplNo,ApplType,
  SponsorApplicant, MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,
  Orphan_Code) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/application.sql.txt

Product:

  cat Product.txt | sed "s/\*\*Federal Register determination that product was not discontinued 
  or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/\*\*Federal Register determination
  that product was not discontinued or withdrawn for safety or efficacy\*\*//g" | sed "s/\*\*Federal
  Register notice determination that product was not discontinued or withdrawn for safety or 
  efficacy reasons\*\*//g" | sed "s/\*\*Federal Registration determination that product was not discontinued
  or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/See current Annual Edition, 1.8 
  Description of Special Situations, Levothyroxine Sodium//g" | sed "s/'//g" | sed "s/\t/*/g" |
  sed -r "s/( ){4}//g" | sed -r "s/( ){2}//g" | sed "s/ \*/*/g" | sed "s/\* /*/g" | sed "s/\*\*/*null*/g" | 
  sed "s/\*\*/*null*/g" | sed "s/\*/,/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\*/',/1" | sed "s/\*/,'/1"
  | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\r$/'/g" | sed "s/^/insert into    
  product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) 
  values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/Product.sql.txt

3. Using the command line, how can you determine the number of records in each file? Provide the command.

The command for this is wc from Intro to the Command Line, which lists the number of lines, words, and characters when executed.
For the Application text, therefore, I used wc application.txt, which gave me 19747 147336 1615694. This indicates that there were 19,747 lines, with each line corresponding to one record; however, because the first line is simply the column names, the actual number of records is 19,746.
For the Product text I used wc Product.txt which gave me 32771 369379 2856017. This means that there are 32,770 entries in the Product file.

4. Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.

To select all records from Application I used the command select * from application. This gave me 19,746 records. All I had to do was scroll down on the table to the last row.
I repeated the process for Product using select * from product. This gave me 32,770 records in the Product table.

5. In your database, are these numbers the same or different? Explain why you think so.

These numbers were the same in the database as in the command line, but only because I did not count one entry (the first entry) in the command line answer as an actual record. SQL knows not to count the column names as records (the number of records starts with the first drug that has an ID number) but the command line does not.

NOTE: For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.

6. What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?

LOVENOX, VELCADE, VIDAZA, ENOXAPARIN SODIUM, ACTEMRA, AZACITIDINE
Command used: select * from Product where form = 'INJECTABLE;INTRAVENOUS, SUBCUTANEOUS';
Note that initially, I didn't capitalize what I was searching for in the form table, which gave me 0 results. Only when I capitalized injectable;intravenous, subcutaneous did I get the 6 results.

7. What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?

ATROPEN or ATROPINE.
Command used: select * from product where activeingred= 'ATROPINE';

8. In what forms and dosages can the drug product named BENADRYL be administered?

This was most easily done just by searching for Benadryl. It can be taken in the following ways in the following dosages:
- capsule;oral in 50 mg
- elixir;oral in 12.5 mg/ml
- capsule;oral in 25 mg
- injectable;injection in 10mg/ml
- injectable;inejction in 50mg/ml
Command used: select * from Product where drugname = 'BENADRYL'

9. Which drug products have a name ending in ESTROL?

DIETHYLSTILBESTROL, STILBESTROL, DIENESTROL
For this question I used the tilde to signify that I was trying to compare using patterns like grep and sed. Then I was able to use $ to indicate that I wanted only the drugs where the name had -estrol at the end. The command, therefore, was select * from Product where drugname ~ 'ESTROL$';
Note that this search yielded 24 results but only 3 different kinds of drugs. They vary in form and dosage.

10. Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)

Command used: select ther_potential, count (*) from application group by ther_potential

11. Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)

Command used: select chemical_type, count (*) from application group by chemical_group

12. What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?

"AMINOHIPPURATE SODIUM", "MANNITOL 25%", "REDISOL", "CORTONE", "BENEMID", "HYDROCORTONE", "CYCLAINE", "COGENTIN", "ARAMINE", "LERITINE", "HYDELTRA-TBA", "LERITINE", "FLOROPRYL", "HYDELTRASOL", "DIUPRES-500", "DECADRON", "HYDRODIURIL", "HUMORSOL", "HYDROPRES 25", "HYDROPRES 50", "COLBENEMID", "PERIACTIN", "DECASPRAY", "DECADRON W/ XYLOCAINE", "ALDOMET", "ALDORIL 15", "ALDORIL 25", "ALDORIL D30", "ALDORIL D50", "DECADERM", "ALDOCLOR-150", "ALDOCLOR-250", "DECADRON-LA", "CLINORIL", "BLOCADREN", "TIMOLIDE 10-25", "MODURETIC 5-50", "ALDOMET", "DOLOBID", "NOROXIN", "PEPCID", "PEPCID PRESERVATIVE FREE", "PRINIVIL", "MEVACOR", "PRINZIDE", "CHIBROXIN", "ZOCOR", "PRINZIDE", "PROSCAR", "TRUSOPT", "TIAMATE", "PEPCID RPD", "PROPECIA", "SINGULAIR", "MAXALT", "MAXALT-MLT", "VIOXX", "CANCIDAS", "EMEND", "FOSAMAX", "VIOXX", "FOSAMAX PLUS D", "ZOLINZA", "NEODECADRON", "NEO-HYDELTRASOL", "MEFOXIN IN SODIUM CHLORIDE 0.9% IN PLASTIC CONTAINER", "MEFOXIN IN DEXTROSE 5% IN PLASTIC CONTAINER", "PRIMAXIN", "ALPHAREDISOL", "ELSPAR"
Command used: Select * from Application inner join Product on (application.ApplNo = product.ApplNo) where sponsorapplicant = 'MERCK'

13. Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?

Actavis Labs UT Inc and Watson Labs
Command used: select * from Application inner join Product on (application.ApplNo = Product.ApplNo) where sponsorapplication like '%LABS%' and activeingred like '%ASPIRIN%CAFFEINE%';