Bklein7 Week 6

From LMU BioDB 2015
Jump to: navigation, search

Electronic Lab Notebook

Acquiring Data From the Drugs@FDA ZIP File

Downloading and Viewing

  • To download the zip file directly from the FDA website, I typed the following into the command line:
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip
  • Next, I unziped the FDA file
unzip UCM054599.zip
  • The resulting output included the application.txt and Product.txt files. To view the format of these files, I used more commands. The output I received appeared to have listed columns which were not clearly defined. Therefore, I visualized the data using Notepad. This made the columns easier to read.
    • In Notepad, I got a better idea of how I needed to edit the txt files to import them into a table in pgAdminIII.

Editing the .txt Files

  • application.txt
    1. Special Characters
      • To begin, I wanted to determine if there were any special characters in the text file such as semicolons or apostrophes that needed to be worked around. I did two searches in notepad for these characters and found that this was not the case.
    2. Removing Excess Tabs and Spaces
      • Next, I began the pipe with a cat command and then wanted to clean up the file to make it easier to read.
        • Data values were separated by tab delimiters on the FDA website. Therefore, I began by researching how to manipulate tabs on the command line. To figure out how to replace the tabs with commas, I did some google searching. I found a way to tackle this problem on electrictoolbox. Using this piece of information, I wrote a sed command to replace tabs with commas. This cleaned up excess spaces from the file, although not entirely.
        • In reviewing the output, I found there to be multiple instances of sequences of 2 or 4 spaces. Therefore, I wrote a sed -r command to delete these sequences.
        • Finally, I noticed that some values for the SponsorApplicant column included unnecessary spaces after the listed name. Therefore, I wrote a sed command to delete these extra spaces. The output given by this version of the command sequence only had spaces in the application sponsor names as was desired.
    3. Formatting the Values
      • Now that excess spaces were deleted, I needed to clean up the syntax so that the 9 data values would be read correctly by SQL.
        • The most glaring issue with the cleanup up output was the preponderance of missing data values. Doing a google search, I found here that missing values are labelled as "NULL" in the command line. Thus, I had to find a way to replace all missing values with NULL.
          • In every case except for the 9th data value at the end of the line, a null value was indicated by 2 commas in a row. Therefore, I wrote a sed command to insert the word NULL in between instances of 2 commas. Because occasionally there would be two missing values in a row, this command did not entirely work. Although I image there are more compact ways to address this issue, I simply repeated the command twice to clean up any skipped values.
          • To handle the null values at the end of a line, I wrote a sed command to add NULL at the end of each line that ended with a comma.
        • The next issue in formatting the data values for insert was placing all varchar inputs within single quotes. This presented a challenge because only some of the data values were varchars, whereas others were intergers. Additionally, some values were absent (null) and thus could not be set off by single quotes.
          • To tackle this issue, I wrote a long string of linked sed commands to selectively add single quotes before/after specific commas in each line. This way, I only added single quotes to data values in varchar columns.
        • To address the null values issue, I wrote a sed command to delete single quotes present around any occurrences of the word null.
        • In reviewing the output of the above commands, I noticed that some SponsorApplicant entries included commas as part of the data. This rendered my selective adding of single quotes around commas technique as ineffective for these lines. To solve this problem, I added a sed command at the very beginning of the pipe to replace any commas within the raw data with tildes. Then, at the end of the pipe, I added a command to replace these tildes with commas once again. This fix in formatting the data worked properly.
    4. Writing the Values as Part of SQL Commands
      • The next step in formatting the application.txt file was edit each line so it would read off as an SQL command.
        • Referencing the application.txt file in Notepad, I identified all the column names for the data. Then using a sed command, I added the appropriate insert into table string to the beginning of each line.
        • Next, I wrote a sed command to add the appropriate SQL closing to each line.
    5. Exporting the Data
      • Finally, the data output now appeared ready to export. To close out, I added a command to export the output to ~/public_html/application.sql.txt
        • When attempting to retrieve the above file, I noticed I forgot to delete the first line in the output (column labels). Therefore, I wrote one last sed command and exported this corrected file to ~/public_html/application2.sql.txt
    6. Final Product
      • The final command sequence used to create the input for SQL is as follows:
cat application.txt | sed "s/,/~/g" | sed "s/  /,/g" | sed -r "s/( ){2}//g" 
| sed "s/,,/,NULL,/g;s/,,/,NULL,/g" | sed "s/ ,/,/g" | sed "s/,/,'/1;s/,/','/2;
s/,/',/3;s/False/'False'/g;s/,/,'/5;s/,/',/6;s/,/,'/7;s/,/',/8" | sed "s/,\r$/,NULL/g;
s/V\r$/'V'/g" | sed "s/'NULL'/NULL/g" | sed "s/~/,/g" | sed "s/^/insert into 
application(ApplNo, ApplType, SponsorApplicant, MostRecentLabelAvailableFlag, 
CurrentPatentFlag, ActionType, Chemical_Type, Ther_Potential, Orphan_Code) 
values (/g" | sed "s/$/);/g" | sed "1D" > ~/public_html/application2.sql.txt

Note: After reviewing the Supplementary Information section on the Week 6 assignment page, I realized that the above command sequence could be significantly condensed. The extensive sequence of commands I made to selectively input commas was unnecessary, as int and boolean values are recognized by SQL in single quotes as well. Although the condensed sequence below is still longer than I would hope, this is mostly due to small issues with the text file that were present in only around 3 lines (e.g. replacing commas in sponsor names with <comma> and deleting excess spaces at the end of a sponsor name). Overall it is a much better sequence. I even came up with some new techniques, such as how to handle inserting NULL values in one command instead of 3 commands.

cat application.txt | sed "1D" | sed "s/,/<comma>/g;s/\t/,/g;s/  //g;s/,/','/g;s/^/insert 
into application(ApplNo, ApplType, SponsorApplicant, MostRecentLabelAvailableFlag, 
CurrentPatentFlag, ActionType, Chemical_Type, Ther_potential,Orphan_Code) values 
('/g;s/\r$/');/g;s/''/NULL/g;s/<comma>/,/g;s/ '/'/g"
  • Product.txt
    1. Special Characters
      • As above, I started with a search for special characters. Unlike before, I searched for commas in addition to semicolons and apostrophes. This file contained all three.
        • I added a sed command to double up all apostrophes present in the data to escape them in SQL.
        • I learned that the semicolons would not be misinterpreted by SQL as long as they were within a set of single quotes. Thankfully, all of the semicolons were present in varchar columns, so I did not need to edit them.
        • To address the comma issue I ran into for application.txt, I wrote a sed command to replace all commas in the data with <comma>.
    2. Removing Excess Tabs and Spaces
      • As above, I replaced tabs with commas. No other excess spaces were present in this text file.
    3. Formatting the Values
      • As with application.txt, I wrote commands to addNULL where missing values were present as needed.
      • As with application.txt, I wrote commands to add single quotes around varchar columns as needed.
      • As with application.txt, I wrote a command to remove single quotes from around instances of NULL.
      • Finally, I wrote a sed command to replace <comma> with an actual comma now that the single quotes had been introduced.
    4. Writing the Values as Part of SQL Commands
      • As with application.txt., I wrote commands to add the beginning and ends of insert into table commands to each line in the text file. In this case, I made the variable names specific to this table.
    5. Exporting the Data
      • I added a command to export the output to ~/public_html/Product.sql.txt
    6. Final Product
      • The final command sequence generated is as follows:
cat Product.txt | sed "s/'/''/g" | sed "s/,/<comma>/g" | sed "s/       /,/g" 
| sed "s/,,/,NULL,/g;s/,,/,NULL,/g" | sed "s/,\r$/,NULL/g" | sed "s/,/,'/2;s/,/','/3;
s/,/',/4;s/,/,'/5;s/,/',/6;s/,/,'/7;s/,/','/8;s/\r$/'/g" | sed "s/'NULL'/NULL/g" 
| sed "s/<comma>/,/g" | sed "s/^/insert into Product(ApplNo, ProductNo, Form, Dosage,
 ProductMktStatus, TECode, ReferenceDrug, drugname, activeingred) values (/g" 
| sed "s/$/);/g" | sed "1D" > ~/public_html/Product.sql.txt

Note: As with application.txt, I went back and condensed the above command sequence after reviewing the Supplementary Information section on the Week 6 assignment page.

cat Product.txt | sed "1D" | sed "s/'/<apos>/g;s/,/<comma>/g;s/\t/,/g;s/,/','/g;s/^/insert into Product
(ApplNo, ProductNo, Form, Dosage, ProductMktStatus, TECode, ReferenceDrug, drugname,
activeingred) values ('/g;s/\r$/');/g;s/''/NULL/g;s/<comma>/,/g;s/<apos>/''/g"

Creating a Database with the Drugs@FDA Data Using SQL

  • Creating the Database
    • To begin creating a database for the extracted FDA data, I logged into pgAdminIII and created a new database entitled bklein7_week6.
  • Creating the Tables
    • Within this database, I created the following tables: application & product.
      • To create the application table, I used the following command: create table application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
      • To create the product table, I used the following command: create table Product (ApplNo int references application, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)
  • Importing Data into the Tables
    • application table
      • I copied over the SQL command sequence from [1] and pasted it into the SQL editor. From there, I ran the command sequence.
    • product table
      • I copied over the SQL command sequence from [2] and pasted it into the SQL editor. From there, I ran the command sequence.
  • The data was now available in pgAdminIII.

The FDA Drug Database

Questions to Answer

  1. Provide the DDL (create table) statements that you used for your application and product tables.
    • application table: create table application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
    • product table: create table Product (ApplNo int references application, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)
  2. Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
    • The following sed command sequence was used to convert the application.txt file into sequences of SQL insert statements (can be found in an easier to read format in the section above): cat application.txt | sed "s/,/~/g" | sed "s/ /,/g" | sed -r "s/( ){2}//g" | sed "s/,,/,NULL,/g;s/,,/,NULL,/g" | sed "s/ ,/,/g" | sed "s/,/,'/1;s/,/','/2;s/,/',/3;s/False/'False'/g;s/,/,'/5;s/,/',/6;s/,/,'/7;s/,/',/8" | sed "s/,\r$/,NULL/g;s/V\r$/'V'/g" | sed "s/'NULL'/NULL/g" | sed "s/~/,/g" | sed "s/^/insert into application(ApplNo, ApplType, SponsorApplicant, MostRecentLabelAvailableFlag, CurrentPatentFlag, ActionType, Chemical_Type, Ther_Potential, Orphan_Code) values (/g" | sed "s/$/);/g" | sed "1D" > ~/public_html/application2.sql.txt
    • The following sed command sequence was used to convert the Product.txt file into sequences of SQL insert statements (can be found in an easier to read format in the section above): cat Product.txt | sed "s/'/''/g" | sed "s/,/<comma>/g" | sed "s/ /,/g" | sed "s/,,/,NULL,/g;s/,,/,NULL,/g" | sed "s/,\r$/,NULL/g" | sed "s/,/,'/2;s/,/','/3;s/,/',/4;s/,/,'/5;s/,/',/6;s/,/,'/7;s/,/','/8;s/\r$/'/g" | sed "s/'NULL'/NULL/g" | sed "s/<comma>/,/g" | sed "s/^/insert into Product(ApplNo, ProductNo, Form, Dosage, ProductMktStatus, TECode, ReferenceDrug, drugname, activeingred) values (/g" | sed "s/$/);/g" | sed "1D" > ~/public_html/Product.sql.txt
    • Updated versions (as detailed in Electronic Lab Notebook):
    • application.txt: cat application.txt | sed "1D" | sed "s/,/<comma>/g;s/\t/,/g;s/ //g;s/,/','/g;s/^/insert into application(ApplNo, ApplType, SponsorApplicant, MostRecentLabelAvailableFlag, CurrentPatentFlag, ActionType, Chemical_Type, Ther_potential,Orphan_Code) values ('/g;s/\r$/');/g;s/''/NULL/g;s/<comma>/,/g;s/ '/'/g"
    • Product.txt: cat Product.txt | sed "1D" | sed "s/'/<apos>/g;s/,/<comma>/g;s/\t/,/g;s/,/','/g;s/^/insert into Product(ApplNo, ProductNo, Form, Dosage, ProductMktStatus, TECode, ReferenceDrug, drugname, activeingred) values ('/g;s/\r$/');/g;s/''/NULL/g;s/<comma>/,/g;s/<apos>/''/g"
  3. Using the command line, how can you determine the number of records in each file? Provide the command.
    • For application.txt, I used the command wc application.txt. This yielded the output 19747 147336 1615694 application.txt.
    • The above output shows that the application.txt file has a total of 19,747 lines. Because the first line is the variable labels, we subtract one. The total number of records is thus 19746
    • For Product.txt, I used the command wc Product.txt. This yielded the output 32771 369379 2856017 Product.txt.
    • The above output shows that the Product.txt file has a total of 32,771 lines. Because the first line is the variable labels, we subtract one. The total number of records is thus 32770
  4. Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.
    • To count the number of records using SQL, I used the command select count(*) from table.
      • For the application table, I used the command select count(*) from application and received the output 19746. This matches the number of records calculated using the command line.
      • For the product table, I used the command select count(*) from product and received the output 32770. This matches the number of records calculated using the command line.
  5. In your database, are these numbers the same or different? Explain why you think so.
    • After the command line record count was adjusted, both numbers were the same. The only difference between the two was that command line counts the variable labels whereas SQL is able to differentiate these from actual records (as they were in fact entered separately). When this was accounted for, they both accounted for all of the records as would be expected.
  6. What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?
    • SQL statement: select * from product where form='INJECTABLE;INTRAVENOUS, SUBCUTANEOUS'
    • Answer: LOVENOX, VELCADE, VIDAZA, ENOXAPARIN SODIUM, ACTEMRA, AZACITIDINE.
  7. What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?
    • SQL statement: select * from product where activeingred='ATROPINE'
    • Answer: ATROPEN and ATROPINE.
  8. In what forms and dosages can the drug product named BENADRYL be administered?
    • SQL statement: select * from product where drugname='BENADRYL'
    • Answer: BENADRYL can be administered in the following ways:
      • 25mg capsule taken orally
      • 50mg capsule taken orally
      • 12.5mg/5ml elixir taken orally
      • 10mg/ml injection
      • 50mg/ml injection
  9. Which drug products have a name ending in ESTROL?
    • SQL statement: select * from product where drugname~'ESTROL$'
    • Answer: DIETHYLSTILBESTROL, STILBESTROL, AND DIENESTROL.
  10. Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
    • SQL statement: select ther_potential, count(*) from application group by ther_potential order by count desc
    • Answer: Ther potential.PNG
  11. Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
    • SQL statement: select chemical_type, count(*) from application group by chemical_type order by count desc
    • Answer: Chemical type.PNG
  12. What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?
    • SQL statement: select sponsorapplicant, drugname from application inner join product on (application.applno=product.applno) where sponsorapplicant='MERCK' order by drugname
    • Answer: ALDOCLOR-150, ALDOCLOR-250, ALDOMET, ALDORIL 15, ALDORIL 25, ALDORIL D30, ALDORIL D50, ALPHAREDISOL, AMINOHIPPURATE SODIUM, ARAMINE, BENEMID, BLOCADREN, CANCIDAS, CHIBROXIN, CLINORIL, COGENTIN, COLBENEMID, CORTONE, CYCLAINE, DECADERM, DECADRON, DECADRON-LA, DECADRON W/ XYLOCAINE, DECASPRAY, DIUPRES-250, DIUPRES-500, DOLOBID, ELSPAR, EMEND, FLOROPRYL, FOSAMAX, FOSAMAX PLUS D, HUMORSOL, HYDELTRA-TBA, HYDELTRASOL, HYDROCORTONE, HYDRODIURIL, HYDROPRES 25, HYDROPRES 50, LERITINE, MANNITOL 25%, MAXALT, MAXALT-MLT, MEFOXIN IN DEXTROSE 5% IN PLASTIC CONTAINER, MEFOXIN IN SODIUM CHLORIDE 0.9% IN PLASTIC CONTAINER, MEVACOR, MODURETIC 5-50, NEO-HYDELTRASOL, NEODECADRON, NOROXIN, PEPCID, PEPCID PRESERVATIVE FREE, PEPCID RPD, PERIACTIN, PRIMAXIN, PRINIVIL, PRINZIDE, PROPECIA, PROSCAR, REDISOL, SINGULAIR, TIAMATE, TIMOLIDE 10-25, TRUSPOT, VIOXX, ZOCOR, ZOLINZA.
  13. Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?
    • SQL statement: select sponsorapplicant, activeingred from application inner join product on (application.applno=product.applno) where sponsorapplicant like '%LABS%' and activeingred like '%ASPIRIN%CAFFEINE%'
    • Note: The final search term is inclusive of all combinations of ASPIRIN and CAFFEINE, because the active ingredients are sorted in alphabetical order.
    • Answer: ACTAVIS LABS & WATSON LABS.

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries