Anuvarsh Week 6

From LMU BioDB 2015
Jump to: navigation, search

The FDA Drug Database

Electronic Lab Notebook

Some parts of this assignment were done individually, but the commands were written with my homework partner, Kristen. (Yay, team work!)

The first file that I attempted to convert was the application.txt file. From the hints I knew that the columns in the file were separated by tabs, so my first task was to replace all of the tabs with a character that would indicate the end of a column. I initially chose to replace the tabs with a comma, but quickly realized that this was unwise because some of the Sponsor Applicants had commas in their name. I then attempted to signify the end of a column with a star (*), but learned that that would not work because some of the entries in the Ther-Potential column had stars's in them. I finally decided to signify the end of each column with a tilde (~). This worked out quite well because it did not interfere with the commas within the entries while also providing me leverage to modify the beginnings and ends of each entry.

After replacing all of the tabs with tildes, I realized that there were still many spaces in the document. I decided to remove all instances of two spaces. The reason I chose to eliminate pairs of spaces instead of all spaces in the document is because that would modify the Sponsor Applicant names and we do not want that. After removing all pairs of spaces, I removed the spaces before and after each tilde so that each column was separated only by a tilde. At this point I recognized that the file contained several null entries. After quickly playing around with Postgre, I learned that simply leaving a column empty (i.e. ",,") would result in an error and that instead, I needed to insert the term "null" into ever undefined entry. In order to do this, I searched the document for two tildes right next to each other and inserted "null" between them. Because in some cases there were 3 tildes next to each other, I needed to run this command twice. Finally, I knew that if each line did not end with a character (or if each line ended with a tilde), there was a missing entry in the last column, so I should add "null" to the end of those lines. At this point, all nulls and column breaks were in place and ready to be modified.

Before proceeding, I created a table on a piece of scratch paper next to me that identified all of the columns that required apostrophes around their entries. These were the columns that had data that were varchars. I then proceeded to replace all of the tildes with commas or commas and apostrophes to systematically surround each necessary entry with apostrophes. Weirdly enough (and I'm still not sure why), these apostrophes were not added to any line where the final entry was "null". Instead, it was only added to lines where the final character was one that belonged to an actual entry. This becomes important later. I then added a final apostrophe at the end of each line since the last column contained varchars. If "null" is surrounded with apostrophes, SQL will read it as a term instead of a point where there is no data entry. In order to prevent "null" from being read as a data entry, I replaced all instances of 'null' with null. I then repeated this process for any situations where there was only an apostrophe in front of "null". Finally, I entered the insert into... for SQL and finished off each line with a closing parentheses and semicolon. I then deleted the first line from the file because it contained heading information that we do not want to insert into our table. I was finally left with the following command that allowed me to quickly convert the entire application.txt file into a series of SQL commands.

Application.txt --> SQL commands

   cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | 
   sed "s/~~/~null~/g" | sed "s/~~/~null~/g" | sed "s/~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | sed "s/~/',/1" | 
   sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" | sed "s/\r$/'/g" | 
   sed "s/'null'/null/g" | sed "s/'null/null/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,
   MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | 
   sed "s/$/);/g" | sed 1D > ~/public_html/application.sql.txt

After making progress with the application.txt file, I caught up with my partner who had taken the first few sed commands from the previous pipeline (after I explained how I figured it out and she tried it for herself) and applied them to the Product.txt file. (This is evident because she jumped into the Product.txt file while I still had stars instead of tildes - the tildes were the absolute last change that I made to our code). The first thing we noticed about this files were the several different versions of safety warnings that were embedded within the data. We found all of the derivatives of this warning and removed them completely. We did something similar for other random, unrelated pieces of text within the data. We then replaced all the apostrophes in the file with double apostrophes so that SQL would still show an apostrophe within those entries (This was important because there were some instances of "5'" within the file). We then proceeded to replace the tabs with stars, get rid of the spaces, add commas and apostrophes in the appropriate areas, and added the insert into... command and ending characters the same way as we did for the previous file. We also deleted the first line from this file because it only contained column headings instead of actual data.

We initially thought that the "TECODE" column contained integer values, but were surprised to see that the SQL commands were not working as we had hoped because some of the entries under the "TECODE" column were letters. We went back and added apostrophes around all of the entries in that column using the same method as earlier. After figuring that out, we were left with the following command that beautifully converted each line of the Product.txt into a SQL command.

Product.txt --> SQL commands

   cat Product.txt | sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or 
   efficacy reasons\*\*//g" | sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or 
   efficacy\*\*//g" | sed "s/\*\*Federal Register notice determination that product was not discontinued or withdrawn for safety or 
   efficacy reasons\*\*//g" | sed "s/\*\*Federal Registration determination that product was not discontinued or withdrawn for safety 
   or efficacy reasons\*\*//g" | sed "s/See current Annual Edition, 1.8 Description of Special Situations, Levothyroxine Sodium//g" | 
   sed "s/'//g" | sed "s/\t/*/g" | sed -r "s/( ){2}//g" | sed "s/ \*/*/g" | sed "s/\* /*/g" | sed "s/\*\*/*null*/g" | 
   sed "s/\*\*/*null*/g" | sed "s/\*/,/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/',/1" | 
   sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\r$/'/g" | sed "s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,
   TECode,ReferenceDrug,drugname,activeingred) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/Product.sql.txt

Questions to Answer

  1. Provide the DDL (create table) statements that you used for your application and product tables.
    • For application.txt
      • create table application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
    • For product.txt
      • create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus varchar, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)
  2. Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
    • For application.txt
      • cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ ~/~/g" | sed "s/~ /~/g" | sed "s/~~/~null~/g" | sed "s/~~/~null~/g" | sed "s/~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | sed "s/~/',/1" | sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" | sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/'null/null/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/application.sql.txt
    • For product.txt
      • cat Product.txt | sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy\*\*//g" | sed "s/\*\*Federal Register notice determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/\*\*Federal Registration determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/See current Annual Edition, 1.8 Description of Special Situations, Levothyroxine Sodium//g" | sed "s/'/(double apostrophe)/g" | sed "s/\t/*/g" | sed -r "s/( ){2}//g" | sed "s/ \*/*/g" | sed "s/\* /*/g" | sed "s/\*\*/*null*/g" | sed "s/\*\*/*null*/g" | sed "s/\*/,/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\r$/'/g" | sed "s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/Product.sql.txt
        • Due to formatting shortcuts inherent to MediaWiki (double apostrophe for italics), the sed command responsible for replacing all of the apostrophes in the original file with double apostrophes has been modified to explicitly say "double apostrophe". Without this modification, the rest of the code was italicized and the sed command itself was incorrect.
  3. Using the command line, how can you determine the number of records in each file? Provide the command.
    • For application.txt
      • wc application.txt
      • 19747 - 1 = 19746 records
    • For product.txt
      • wc Product.txt
      • 32771 - 1 = 32770 records
    • Then subtract 1 from the first number listed in both cases(since it represents the number of lines, and the first line consists of column headings while all other lines are records).
  4. Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.
    • For application.txt
      • select count(*) from application
      • 19746 records
    • For product.txt
      • select count(*) from product
      • 32770 records
  5. In your database, are these numbers the same or different? Explain why you think so.
    • After subtracting the line that contains the column from the command line answer, the numbers are the same. They are the same because ever line in each document (except the heading) was turned into an entry in the respective table.
    For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
  6. What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?
    • a)"LOVENOX", "VELCADE", "VIDAZA", "ENOXAPARIN SODIUM", "ACTEMRA", "AZACITIDINE"
    • b) select drugname from product where form = 'INJECTABLE;INTRAVENOUS, SUBCUTANEOUS';
  7. What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?
    • a)"ATROPEN", "ATROPINE"
    • b) select drugname from product where activeingred = 'ATROPINE';
  8. In what forms and dosages can the drug product named BENADRYL be administered?
    • a)"CAPSULE;ORAL";"50MG", "ELIXIR;ORAL";"12.5MG/5ML", "CAPSULE;ORAL";"25MG", "INJECTABLE;INJECTION";"10MG/ML", "INJECTABLE;INJECTION";"50MG/ML"
    • b)select form, dosage from product where drugname = 'BENADRYL';
  9. Which drug products have a name ending in ESTROL?
    • a)"DIETHYLSTILBESTROL", "STILBESTROL", "DIENESTROL"
    • b)select drugname from product where drugname like '%ESTROL';
  10. Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
    • a)
(index) ther_potential count
1 S* 3
2 S 3459
3 P 928
4 P* 26
    • b)select ther_potential, count(*) from application where ther_potential like '%' group by ther_potential;
  1. Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
    • a)
(index) chemical_type count
1 6 87
2 14 34
3 34 9
4 8 27
5 1 1271
6 2 139
7 3 1464
8 24 8
9 7 35
10 10 2
11 23 8
12 4 323
13 5 1046
14 9 1
    • b)select chemical_type, count(*) from application where chemical_type >=0 or chemical_type <0 group by chemical_type;
  1. What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?
    • a)"DECADRON", "HUMORSOL", "NEO-HYDELTRASOL", "PRINIVIL", "MAXALT-MLT", "HYDROCORTONE", "PERIACTIN", "PROPECIA", "PROSCAR", "CLINORIL", "PRINZIDE", "ELSPAR", "ALDOMET", "ALDORIL D30", "EMEND", "DIUPRES-500", "NEODECADRON", "ALDOCLOR-250", "MEFOXIN IN DEXTROSE 5% IN PLASTIC CONTAINER", "ARAMINE", "TIAMATE", "BLOCADREN", "CANCIDAS", "CHIBROXIN", "CORTONE", "PEPCID", "TRUSOPT", "REDISOL", "VIOXX", "FLOROPRYL", "ALPHAREDISOL", "DECADRON-LA", "MEFOXIN IN SODIUM CHLORIDE 0.9% IN PLASTIC CONTAINER", "PEPCID RPD", "NOROXIN", "ZOCOR", "LERITINE", "DECADERM", "DOLOBID", "MANNITOL 25%", "DECADRON W/ XYLOCAINE", "ALDORIL D50", "TIMOLIDE 10-25", "CYCLAINE", "HYDROPRES 25", "AMINOHIPPURATE SODIUM", "MEVACOR", "MODURETIC 5-50", "ALDORIL 25", "SINGULAIR", "COLBENEMID", "DIUPRES-250", "HYDELTRA-TBA", "PRIMAXIN", "BENEMID", "MAXALT", "FOSAMAX PLUS D", "HYDRODIURIL", "HYDELTRASOL", "HYDROPRES 50", "ZOLINZA", "ALDORIL 15", "FOSAMAX", "DECASPRAY", "COGENTIN", "ALDOCLOR-150", "PEPCID PRESERVATIVE FREE"
    • b)select drugname from product inner join application on (product.applno = application.applno) where sponsorapplicant = 'MERCK' group by drugname;
  2. Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?
    • a)"ACTAVIS LABS UT INC", "WATSON LABS"
    • b)select sponsorapplicant from application inner join product on (application.applno = product.applno) where sponsorapplicant like '%LABS%' and activeingred like '%ASPIRIN%' and activeingred like '%CAFFEINE%' group by sponsorapplicant;

Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15