Vpachec3 Week 6

From LMU BioDB 2015
Jump to: navigation, search

application text file

My partner,Kevin, and I met on Sunday at 1:30pm in Seaver 120 to begin our assignment. Downloading and unzipping the file was not a problem. The problem occurred when we had to condense the information. What took me a while to understand is how to get rid of the tabs and spaces efficiently and where to go from there. We had massive kickstart help from Anu!!! We are very grateful she was there on Sunday to help get the thought process going.


Here is the break down for each section of the command:

cat application.txt 

This opened the text file.

cat application.txt|sed "s/\t/~/g"

The added sed command replaced each tab instance with a tilde. We used a tilde as a filler because it was easier to put the tilde instead of another marker such as an asterick in terms of syntax. Now we need to get rid of the extra spaces in the line. However, we can't just use a command to get rid of the the spaces because some of the company names have spaces in them. Since the names only have one space we can make the command to get rid of spaces in pairs.

cat application.txt|sed "s/\t/~/g"|sed -r "s/( ){2}//g"

The second sed command would allow for any pairs of spaces to reduce.

I am going to jump in my explanation and explain the next four added commands (see below). I am lumping the explanation because they all are going for the same goal. The fist of the four gets rid of the remaining space in front of the tilde by replacing it with just a tilde, no space. The following step was to make sure that the columns with no answer in them (any empty entry in the columns) have null instead of an empty section. We need to do this exact same command twice because the computer will only read the groups of two if there were a greater number it would only count the first two. The last command of the four newly added makes sure that the end of line has a null if there is no information in the column for each row.

cat application.txt|sed "s/\t/~/g"|sed -r "s/( ){2}//g"| sed "s/ \~/~/g"| sed "s/~~/~null~/g"| sed "s/~~/~null~/g"| sed "s/~\r$/~null/g"


adding apostrophes to the varchar components in the file. We need the apostrophes so that they can be read as text in SQL. Booleans don't need the single quotes so we are in the clear for those. We also have to make sure that the nulls do not have apostrophes. The following pipeline shows that we need to have go by instance of each tilde to add in the apostrophe. And at the end, we had to make sure that the nulls didn't have apostrophes.

cat application.txt|sed "s/\t/~/g"|sed -r "s/( ){2}//g"| sed "s/ \~/~/g"| sed "s/~~/~null~/g"| sed "s/~~/~null~/g"| sed "s/~\r$/~null/g"| sed "s/~/~'/1"| sed "s/~/'~'/2" | sed "s/~/'~/3"| sed "s/~/~'/5"|sed "s/~/'~/6"| sed "s/~/~'/7"|sed "s/~/'~/8"|sed "s/~'null'~/~null~/g"| sed "s/\r$/'/g"| sed "s/~/~'/8"| sed "s/~'null/~null/g"


change the tilde into commas.

cat application.txt|sed "s/\t/~/g"|sed -r "s/( ){2}//g"| sed "s/ \~/~/g"| sed "s/~~/~null~/g"| sed "s/~~/~null~/g"| sed "s/~\r$/~null/g"| sed "s/~/~'/1"| sed "s/~/'~'/2" | sed "s/~/'~/3"| sed "s/~/~'/5"|sed "s/~/'~/6"| sed "s/~/~'/7"|sed "s/~/'~/8"|sed "s/~'null'~/~null~/g"| sed "s/\r$/'/g"| sed "s/~/~'/8"| sed "s/~'null/~null/g"|sed "s/~/,/g"


To get to our final command we just added the wording and syntax necessary for SQL in front of the line so that we can just copy and paste all the commands straight into SQL. http://my.cs.lmu.edu/~vpachec3/application.sql.txt

FINAL COMMAND

cat application.txt|sed "s/\t/~/g"|sed -r "s/( ){2}//g"| sed "s/ \~/~/g"| sed "s/~~/~null~/g"| sed "s/~~/~null~/g"| sed "s/~\r$/~null/g"| sed "s/~/~'/1"| sed "s/~/'~'/2" | sed "s/~/'~/3"| sed "s/~/~'/5"|sed "s/~/'~/6"| sed "s/~/~'/7"|sed "s/~/'~/8"|sed "s/~'null'~/~null~/g"| sed "s/\r$/'/g"| sed "s/~/~'/8"| sed "s/~'null/~null/g"|sed "s/~/,/g"|sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_code) values(/g" |sed "s/$/);/g"  > ~/public_html/application.sql.txt

Product text file

cat Product.txt | sed "s/\t/~/g"|sed "s/~~/~null~/g"| sed "s/~/~'/2"|sed "s/~/'~'/3"|sed "s/~/'~/4"|sed "s/~/~'/7"| sed "s/~/'~'/8"|sed "s/~/,/g"|sed "s/^/insert into Product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) values(/g" |sed "s/\r$/);/g"  > ~/public_html/application.sql.txt

I used the same method to go about this pipe from the last pipe. However the above pipe left me with problems that need fixing. Here were the thing I needed to fix:

  1. There were two warning labels that became a problem.
  2. There were plenty of instances where there was a description similar to "5'-Phosphate" where the ' was meant to stand for prime. This was a problem because 'is also used as language so I had to take care of that.
  3. The column TECode, I had put as varchar when creating the table but I didn't put the column enteries with single quotes so that showed up as a problem.
  4. Finally, I need to have a command that got ride of the single quotes around the nulls.

Here is where I left off on Monday 10/12

Thus, after playing around with the different commands several times, I got the pipe that worked. http://my.cs.lmu.edu/~vpachec3/Product.sql.txt

FINAL COMMAND:

 cat Product.txt | sed "s/\t/~/g"|sed "s/'/\"/g"|sed "s/~~/~null~/g"| sed "s/~/~'/2"|sed "s/~/'~'/3"|sed "s/~/'~/4"|sed "s/~/~'/5"| sed s"/~/'~/6"|sed "s/~/~'/7"| sed "s/~/'~'/8"|sed "s/'null'/null/g"|sed "s/~/,/g"|sed "s/^/insert into Product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) values(/g" |sed "s/\r$/');/g"|sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy Register notice determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" > ~/public_html/Product.sql.txt


  • NOTE: I would like to acknowledge that after sending my pipe, Kevin and I realized that there was an easier way to get to the commands we need. Adding the tilde as a place holder was not needed. We used the tilde so we could count the instances so we could put in the commas and single quotes but it could have been better by replacing the tabs with commas and using the comma instances for putting in the single quotes.

Questions to Answer

1. Provide the DDL (create table) statements that you used for your application and product tables.


create table application(ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)


create table Product(ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus varchar, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)


2. Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.


cat application.txt|sed "s/\t/~/g"|sed -r "s/( ){2}//g"| sed "s/ \~/~/g"| sed "s/~~/~null~/g"| sed "s/~~/~null~/g"| sed "s/~\r$/~null/g"| sed "s/~/~'/1"| sed "s/~/'~'/2" | sed "s/~/'~/3"| sed "s/~/~'/5"|sed "s/~/'~/6"| sed "s/~/~'/7"|sed "s/~/'~/8"|sed "s/~'null'~/~null~/g"| sed "s/\r$/'/g"| sed "s/~/~'/8"| sed "s/~'null/~null/g"|sed "s/~/,/g"|sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_code) values(/g" |sed "s/$/);/g"  > ~/public_html/application.sql.txt


cat Product.txt | sed "s/\t/~/g"|sed "s/'/\"/g"|sed "s/~~/~null~/g"| sed "s/~/~'/2"|sed "s/~/'~'/3"|sed "s/~/'~/4"|sed "s/~/~'/5"| sed s"/~/'~/6"|sed "s/~/~'/7"| sed "s/~/'~'/8"|sed "s/'null'/null/g"|sed "s/~/,/g"|sed "s/^/insert into Product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) values(/g" |sed "s/\r$/');/g"|sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy Register notice determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" > ~/public_html/Product.sql.txt


3. Using the command line, how can you determine the number of records in each file? Provide the command.

Use this command:cat application.txt| wc and the first set of numbers is the number of records. For this case, you subtract one because the first line is just the column names. So there are 19746 lines which means there are 19746 records for the application text file. For the Product text file, it is 32770.


4. Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.

select count(*) from Product The number was the same 32770 select count(*) from application The number was the same 19746


5. In your database, are these numbers the same or different? Explain why you think so.

Initially, they would have been one off but I remember having to delete the first row in SQL because the first row is just the titles of the columns.



6. What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?

select drugname from Product where form = 'INJECTABLE;INTRAVENOUS, SUBCUTANEOUS'

    1. LOVENOX
    2. VELCADE
    3. VIDAZA
    4. ENOXAPARIN SODIUM
    5. ACTEMRA
    6. AZACITIDINE


7. What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?

select drugname from Product where activeingred = 'ATROPINE'

    1. ATROPEN
    2. ATROPINE


8. In what forms and dosages can the drug product named BENADRYL be administered?

select form, dosage from Product where drugname = 'BENADRYL'

    1. CAPSULE;ORAL;50MG
    2. ELIXIR;ORAL;12.5MG/5ML
    3. CAPSULE;ORAL";25MG"
    4. INJECTABLE;INJECTION;10MG/ML
    5. INJECTABLE;INJECTION";50MG/ML


9. Which drug products have a name ending in ESTROL?

select drugname from Product where drugname like '%ESTROL%'

    1. DIETHYLSTILBESTROL
    2. STILBESTROL
    3. DIENESTROL
    4. MEGESTROL ACETATE


10. Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)

select ther_potential, count (*) from application where ther_potential like '%' group by ther_potential

    1. S*;3
    2. S;3459
    3. P;928
    4. P*;26


11. Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)

I realized that I had put varchar for the chemical_type instead of int, so I had to go and change my table and to be able to do this command. After I did that, here is the command for SQL:select chemical_type, count (*) from application where chemical_type >=0 or chemical_type <0 group by chemical_type

6;87

14;34

34;9

8;27

1;1271

2;139

3;1464

24;8

7;35

10;2

23;8

4;323

5;1046

9;1


12. What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?

select drugname from product inner join application on (product.applno = application.applno) where sponsorapplicant = 'MERCK' group by drugname

    1. DECADRON
    2. HUMORSOL
    3. NEO-HYDELTRASOL
    4. PRINIVIL
    5. MAXALT-MLT
    6. HYDROCORTONE
    7. PERIACTIN
    8. PROPECIA
    9. PROSCAR
    10. CLINORIL
    11. PRINZIDE
    12. ELSPAR
    13. ALDOMET
    14. ALDORIL D30
    15. EMEND
    16. DIUPRES-500
    17. NEODECADRON
    18. ALDOCLOR-250
    19. MEFOXIN IN DEXTROSE 5% IN PLASTIC CONTAINER
    20. ARAMINE
    21. TIAMATE
    22. BLOCADREN
    23. CANCIDAS
    24. CHIBROXIN
    25. CORTONE
    26. PEPCID
    27. TRUSOPT
    28. REDISOL
    29. VIOXX
    30. FLOROPRYL
    31. ALPHAREDISOL
    32. DECADRON-LA
    33. MEFOXIN IN SODIUM CHLORIDE 0.9% IN PLASTIC CONTAINER"
    34. PEPCID RPD
    35. NOROXIN
    36. ZOCOR
    37. LERITINE
    38. DECADERM
    39. DOLOBID"
    40. MANNITOL 25%
    41. DECADRON W/ XYLOCAINE
    42. ALDORIL D50
    43. TIMOLIDE 10-25
    44. CYCLAINE
    45. HYDROPRES 25
    46. AMINOHIPPURATE SODIUM
    47. MEVACOR
    48. MODURETIC 5-50
    49. ALDORIL 25
    50. SINGULAIR
    51. COLBENEMID
    52. DIUPRES-250
    53. HYDELTRA-TBA
    54. PRIMAXIN
    55. BENEMID
    56. MAXALT
    57. FOSAMAX PLUS D
    58. HYDELTRASOL
    59. HYDROPRES 50
    60. ZOLINZA
    61. ALDORIL 15
    62. FOSAMAX
    63. DECASPRAY
    64. COGENTIN
    65. ALDOCLOR-150
    66. PEPCID PRESERVATIVE FREE


13. Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?

select sponsorapplicant from application inner join product on (application.applno = product.applno) where sponsorapplicant like '%LABS%' and activeingred like '%ASPIRIN%' and activeingred like '%CAFFEINE%' group by sponsorapplicant

    1. ACTAVIS LABS UT INC
    2. WATSON LABS

Links

Vpachec3 User Page