Using Microsoft Excel to Compare ID Lists

From LMU BioDB 2015
Jump to: navigation, search

Comparing lists of IDs is a frequent quality assurance task within the XMLPipeDB project, as data move from one source to another. Did all known IDs from an XML file make it to the relational database? Did all the needed IDs get copied from the relational database to the .gdb gene database file? If there are any discrepancies, what exactly are they? That is, which IDs are in one list but not the other?

For large lists of IDs, manual inspection is impractical. Fortunately, if the lists to be compared can be pasted into Microsoft Excel, the MATCH function facilitates this comparison.

Running Example

Consider the following lists of IDs:

  • AB01, AB02, AC01, AC04, AD00, AD01
  • AB01, AB02, AC02, AC03, AC04, AD00, AD02, AD03

The two lists are clearly different, and for this small size, the differences are easy to pinpoint by inspection. However, when these lists consist of thousands of IDs, automation is certainly welcome.

Initial ID Setup

The main assumption of this tutorial is that the ID lists to compare can be exported/reformatted such that they can be pasted as columns into Microsoft Excel. Accomplishing this initial setup depends on the source of the list.

Export from XMLPipeDB Match

For ID lists produced by XMLPipeDB Match, you can send the data to a text file:

   java -jar xmlpipedb-match-1.1.1.jar "PATTERN" < file-to-search.xml > id-list-file.txt

When you open the resulting file in Excel, designate the colon (:) as the delimiter. You might also want to delete the final reported totals at the end of the file.

Export from PostgreSQL

For ID lists from PostgreSQL, you can run an SQL query in pgAdmin III but instead of sending it to the window, you can send the result to a file. To do this, click on the green play button with the archaic 3.5" floppy disk icon, with tooltip "Execute query, write result to file" (encircled in red in the screenshot below).

Pg-admin-sql-export.png

Write a query that will list all of the IDs in a single column, such as:

   select value from genenametype where type = 'ordered locus';

When you click on the "Execute query, write result to file" button, instead of seeing the results, you will get a dialog box asking for the file to save. You can accept the defaults and just supply the desired filename (in the desired location). When you click on OK, that file should appear at the requested location.

Export from Microsoft Access

For ID lists from Microsoft Access, click on the table whose contents you would like to export, click on External Data from the ribbon, then click on the Excel button in the Export section. Provide the filename and location then click OK.

Copy/Import into Microsoft Excel

Once the list files are ready, you can compare them using the procedure:

  1. Open a new Microsoft Excel worksheet.
  2. Open the files that contain the ID lists that you plan to compare.
  3. Paste each ID list to compare into its own column in the worksheet. Reserve a header row so that you can label each ID column.

Our running example would look like this after these steps, including spreadsheet row and column headings:

A B
1 ID List 1 ID List 2
2 AB01 AB01
3 AB02 AB02
4 AC01 AC02
5 AC04 AC03
6 AD00 AC04
7 AD01 AD00
8 AD02
9 AD03

What MATCH Does

The workhorse function here is MATCH. MATCH takes:

  • A value to look up
  • A range within which to look for that value
  • A “match type,” which, for our purposes is zero

The result of MATCH is the position within the given range in which the looked-up value can be found, or #N/A if the value is not present in the range. There are other match types available, but those are not used here and so will not be discussed.

Using MATCH and the running example above, we can check whether the AB01 ID from ID List 1 is also in ID List 2 using this invocation of MATCH:

=MATCH(A2, B2:B9, 0)

The value we are looking up, AB01, is in spreadsheet cell A2; the range in which we are looking for AB01 is the entire span of ID List 2, or B2:B9; finally, the match type that we want is zero. Since AB01 is also in ID List 2, located in the first row of the range B2:B9, MATCH will produce 1 in this case.

Similarly, this MATCH expression:

=MATCH(A6, B2:B9, 0)

…will produce 6, since AD00 is the sixth ID in the ID List 2 column.

In contrast, this MATCH expression:

=MATCH(A4, B2:B9, 0)

...will produce #N/A, because the ID AC01 is not in ID List 2.

Overall Strategy

With this in mind, comparing ID lists is a matter of using MATCH on every ID from one column for lookup in the other column. IDs in one column that are not in the other column will produce #N/A, and a number otherwise (indicating the position of the ID within the other column).

For our running example, to identify the IDs in ID List 1 that are not in ID List 2, to the following:

  1. Label spreadsheet column C with a heading like 1 to 2 or anything that indicates the kind of lookup being done.
  2. Type the following formula into cell C2:
    =MATCH(A2, B$2:B$9, 0)
    We use the spreadsheet $ “lock” so that these rows do not change when we copy-paste this formula.
  3. Copy this formula and paste it into the rest of column C, up to the row of the last ID of ID List 1 (in this case row 7).
  4. Do the same with spreadsheet column D, this time calling it 2 to 1 to indicate that this column retrieves the locations of IDs in ID List 2 within ID List 1, if they are there.
  5. Specifically, the initial formula in cell D2 looks like this:
    =MATCH(B2, A$2:A$7, 0)
    We look up cell B2 because ID List 2 is in column B, and our lookup range is A$2:A$7 because ID List 1 occupies this range.
  6. Copy this formula and paste it into the rest of column D, up to the row of the last ID of ID List 2 (in this case row 9).

The spreadsheet, with formulas shown, should now look like this:

A B C D
1 ID List 1 ID List 2 1 to 2 2 to 1
2 AB01 AB01 =MATCH(A2, B$2:B$9, 0) =MATCH(B2, A$2:A$7, 0)
3 AB02 AB02 =MATCH(A3, B$2:B$9, 0) =MATCH(B3, A$2:A$7, 0)
4 AC01 AC02 =MATCH(A4, B$2:B$9, 0) =MATCH(B4, A$2:A$7, 0)
5 AC04 AC03 =MATCH(A5, B$2:B$9, 0) =MATCH(B5, A$2:A$7, 0)
6 AD00 AC04 =MATCH(A6, B$2:B$9, 0) =MATCH(B6, A$2:A$7, 0)
7 AD01 AD00 =MATCH(A7, B$2:B$9, 0) =MATCH(B7, A$2:A$7, 0)
8 AD02 =MATCH(B8, A$2:A$7, 0)
9 AD03 =MATCH(B9, A$2:A$7, 0)

Final Result

The results computed by the formulas should appear as follows:

A B C D
1 ID List 1 ID List 2 1 to 2 2 to 1
2 AB01 AB01 1 1
3 AB02 AB02 2 2
4 AC01 AC02 #N/A #N/A
5 AC04 AC03 5 #N/A
6 AD00 AC04 6 4
7 AD01 AD00 #N/A 5
8 AD02 #N/A
9 AD03 #N/A

Columns can be sorted to jump straight to the #N/A’s, which are the results of interest in this comparison. The #N/A means that the looked-up ID was not found in the other ID range. For example, this table tells us that:

  • The ID AB02 is in both ID lists.
  • The ID AC04 is also in both ID lists; it is the 5th ID in ID List 2 (cell C5) and the 4th ID in ID List 1 (cell D6).
  • The ID AD01 is in ID List 1 but not ID List 2 (cell C7).
  • The ID AC03 is in ID List 2 but not ID List 1 (cell D5).

Once missing IDs from each list have been identified, further ID quality assurance work may continue.

Finding Duplicates in a List of IDs using COUNTIF

  • One can use the COUNTIF function to scan a range for the number of times that a certain value appears in that range. The example below puts one COUNTIF per row, counting the ID in that row. Then you can filter for the cells whose value is > 1.
    • One wrinkle is that, for the IDs that do appear more than once, you'll get more than one count, but you should easily see that in the filter results, especially if the IDs have been sorted A-Z.
IDs	Counts    Formula
a	2         =COUNTIF(A$2:A$12,A2)
b	1         =COUNTIF(A$2:A$12,A3)
c	3         =COUNTIF(A$2:A$12,A4)      
d	1         =COUNTIF(A$2:A$12,A5)       
a	2         =COUNTIF(A$2:A$12,A6)
e	2         =COUNTIF(A$2:A$12,A7)
f	1         =COUNTIF(A$2:A$12,A8)
e	2         =COUNTIF(A$2:A$12,A9)
c	3         =COUNTIF(A$2:A$12,A10)
c	3         =COUNTIF(A$2:A$12,A11)
g	1         =COUNTIF(A$2:A$12,A12)

Filter for unique values

In Excel you can also filter for unique values; paste that phrase into the help search box and follow the instructions there.