Using the XMLPipeDB Match Utility

From LMU BioDB 2013
Jump to: navigation, search

This page describes how to download and use the XMLPipeDB Match Utility, which can be thought of as “grep with counting.” General notes on downloading from SourceForge, decompressing/decoding archived files, and running Java programs are also included.

Contents

Downloading from SourceForge

XMLPipeDB Match can be downloaded from XMLPipeDB’s project site on SourceForge. When you get to that site, click on the Files link in the light gray navigation bar, then find the XMLPipeDB Match folder. Click on the XMLPipeDB Match link, and you will see the currently available versions. As of this writing, the software is at version 1.1.1.

Click on the XMLPipeDB Match 1.1.1 folder. You should now see two files: xmlpipedb-match-1.1.1.tar.gz and xmlpipedb-match-1.1.1.zip. The file to choose is determined by the next section...

Decompression

Downloadable software is usually provided in a compressed format—that is, the files in the software have been processed so that they take up less space, resulting in a faster download. Compression also typically “packs” any software that consists of multiple files into a single file, again making the download process simpler.

However, compressed files can’t be used “as-is;” they need to be decompressed by an appropriate application. The key word here is “appropriate”—there are different kinds of decompression, and each kind may require a different application. Each kind of decompression can be thought of as a different file format, the way Microsoft Word files are either .doc or .docx files, or how images can be in .jpg, .gif, .png, or other formats.

The XMLPipeDB Match software is available in two compression formats: .tar.gz (called a “tarball” in techie circles) and .zip. The final software is identical; the only difference is how the two choices are compressed.

.tar.gz

The .tar.gz format is actually two formats: the first one, .tar, is responsible for grouping multiple files and folders into a single file. That single file is then compressed, producing the .gz. This format is generally most readily available on Unix-flavored operating systems like Linux or Mac OS X. On the command lines of both Linux and Mac OS X, you would “extract” the files in a tarball using this command:

tar xzf filename.tar.gz

The same operation can also be done from the graphical user interface by double-clicking on the file’s icon. On Windows, the open-source 7-Zip and shareware WinRAR applications can handle .tar.gz and other formats.

If you happen a file that ends in just .gz and not .tar.gz, then it is compressed only. Command-line decompression then requires a different command:

gunzip filename.gz

Graphical user interface approaches can figure this out on their own and don’t require that you do anything different.

.zip

The ZIP format is typically more familiar to Windows users, but is supported on Linux and Mac OS X as well. Unzipping from the command line of both Linux and Mac OS X uses involves command:

unzip filename.zip

Alternatively, you can double-click on the file’s icon to unzip. On Windows, the ability to unzip depends on the version of Windows that you’re using. Some versions of Windows have the capability built-in; other versions require third-party applications such as the aforementioned 7-Zip and WinRAR, as well as WinZip.

There is a gotcha that you should be aware of when using the built-in Windows unzip functionality, if it is available to you: Windows defaults to doing “live” decompression; that is, as you double-click a .zip file and navigate through its contents, the .zip file may act like a folder, but in reality it’s still a single .zip file, which Windows opens as-you-go. This works fine when you’re just looking at files, but may cause confusion when you actually want to edit or run them. To be completely sure, right-click on the .zip and make sure to Extract the files so that they become actual files on the disk.

Other Formats

Other compression formats abound, including bunzip2, compress (.Z files), .rar, and .7z, to name a few. It’s generally useful to understand how to deal with files such as these, since downloads are frequently compressed in some way.

Compression

Of course, if it is possible to decompress these files, then someone must have compressed them first. Many of the utilities listed above go both ways: they can create and extract compressed files. This won’t be covered in detail here, but look things up or ask Dr. Dionisio if you’d like to know more.

Running Java

The XMLPipeDB Match Utility is written in a programming language called Java—this is worth mentioning because:

  1. you might need to install Java first (not all computers come with it “out of the box”), and
  2. many Java programs (Match included) are not invoked using the regular, double-click way, or even by a typical one-word command, the way grep and sed are.

Checking for and/or Installing Java

The computers in the Keck Lab, including my.cs.lmu.edu, all have Java already installed, so when you’re on those machines there’s nothing further to do. If you’re interested in running Java on your own computer, however, you might need to install it. Mac OS X will automatically download Java when needed, so if you use that operating system there is nothing further to do. Some Windows computers come with Java, and some don’t; to check, look under the Add/Remove Programs control panel to see if Java is there. If it isn’t, then you can download it from the Java download page. Make sure to get the Java SE Development Kit, or JDK for short. There is another version of Java called the “Java SE Runtime Environment” or “JRE.” Do not get this one; get the JDK.

Java installation then proceeds like most Windows installations; run the setup program and follow the instructions. Accept any default values, in case you’re more advanced and are comfortable with doing things differently.

Running Command-Line Java Programs

Java programs depend on another program, called (surprise surprise) java. This java program is then given a file, called a .jar, that contains the actual code to be run. Thus, running this type of Java program from the command line looks like this:

java -jar <filename>.jar <any additional information>

The Match utility is run in this manner. Once you have downloaded and uncompressed the utility from SourceForge, you’ll have a folder called xmlpipedb-match-1.1.1, within which are a README file and the .jar file itself, called xmlpipedb-match-1.1.1.jar.

The README file essentially contains the same information that is on this page, though with less background. The .jar file contains the actual Match program. Match requires a text pattern like those used by grep (“regular expressions”) plus a file on which to search for this pattern. Thus, running Match looks like this:

java -jar xmlpipedb-match-1.1.1.jar <pattern> < <filename-to-search>

Note the < symbol preceding the filename to search. Think of it as an arrow indicating that the data in the file should “go into” the Match utility.

Using the Utility

Finally, we can get down to using Match :) Note that everything prior to this section is largely background and general-purpose; if you are accustomed to downloading software, and more so if you have used Java software before, what came before was probably mostly review. If you’re new to all of this, consider what came before as background knowledge that will come in handy if you download more software in the future.

As mentioned, the XMLPipeDB Match Utility is essentially “grep with counting.” The “counting” involved goes beyond what wc can do; in particular, note how wc can only count words, lines, and characters. Piping grep and wc to count text matches only works if there is one match per line. But what if there are more?

Further, many patterns (like “TATA...ATG...ATT...TGA”) will actually find more than one specific piece of text. For example, “TATA...ATG...ATT...TGA” matches both TATACTTATGGTTATTTATTGA and TATACAAATGGAAATTTTATGA. grep treats every match in the same way; you can’t distinguish between different specific pieces of text.

This is where the XMLPipeDB Match Utility comes in. Some of the things you will need to do involve both pattern matching (like grep), but will require accurate counting of individual matched items. This is precisely how the Match utility differs from grep.

Note, for example, how the Match utility handles the pattern “TATA...ATG...ATT...TGA” on the sample data file hs_ref_GRCh37_chr19.fa:

$ java -jar ~/bin/xmlpipedb-match-1.1.1.jar "TATA...ATG...ATT...TGA" < hs_ref_GRCh37_chr19.fa 
tatagccatggagattccatga: 1
tatacaaatggaaattttatga: 1
tatacttatggttatttattga: 1

Total unique matches: 3

(in this example, the xmlpipedb-match-1.1.1.jar has been placed in a bin folder in the user’s home directory, as indicated by the ~ shorthand)

Like grep, the Match utility finds the pattern three times. Unlike grep, Match counted how many unique matches were found, and displays each one. Note also how everything comes up lowercase; Match makes the assumption that searches are case-insensitive.

Here’s a sample file search that yields more than one unique match:

$ java -jar ~/bin/xmlpipedb-match-1.1.1.jar "GO:00000.." < 493.P_falciparum.xml
go:0000015: 1
go:0000049: 1
go:0000062: 5
go:0000036: 2
go:0000059: 4
go:0000070: 1

Total unique matches: 6

The pattern requested represents an identifier for a Gene Ontology term; in particular, it represents identifies whose first five digits are zeros. The XMLPipeDB Match Utility finds that there are six such identifiers in the file 493.P_falciparum.xml. Three of them appear just once, while the others appear twice, four times, and five times. grep would have found these too, but you would only have known that the pattern was matched 14 (or fewer) times. (Quick sanity check question: why “or fewer?”)

May as well try it:

$ grep "GO:00000.." 493.P_falciparum.xml | wc
     14      56     714

Yes, 14 times. Unfortunately, you don’t know how many repeat matches were found, and how many repeats there were. Sure, you can do it manually:

$ grep "GO:00000.." 493.P_falciparum.xml   
  <dbReference type="Go" key="17" id="GO:0000015">
  <dbReference type="Go" key="24" id="GO:0000070">
  <dbReference type="Go" key="10" id="GO:0000059">
  <dbReference type="Go" key="20" id="GO:0000059">
  <dbReference type="Go" key="19" id="GO:0000036">
  <dbReference type="Go" key="14" id="GO:0000036">
  <dbReference type="Go" key="12" id="GO:0000062">
  <dbReference type="Go" key="17" id="GO:0000059">
  <dbReference type="Go" key="14" id="GO:0000062">
  <dbReference type="Go" key="15" id="GO:0000062">
  <dbReference type="Go" key="15" id="GO:0000062">
  <dbReference type="Go" key="15" id="GO:0000062">
  <dbReference type="Go" key="15" id="GO:0000049">
  <dbReference type="Go" key="17" id="GO:0000059">

...but do you really want to? What if there are hundreds of results, and not just 14? This is the XMLPipeDB Match Utility’s raison d’etre.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox