Part 2: Identifying Duplicates Effectively

Given that cleansing the data can be a multi-faceted approach involving different methods, I thought it would useful to get an idea of the impact simple to more complicated cleansing techniques might have on your data. To this end, I have taken a real-life vendor table with 88,877 records and performed a series of duplicate tests using “Vendor Name” as the sole criteria.  I have employed a simple duplicate test (exact match) and also performed simple cleansing techniques (e.g. convert to all letters to caps and remove all non-alphabetic/non-numeric characters) and more complex duplicate techniques and cleansing approaches.  These are presented in each result – allowing you to see the incremental value of both the cleansing and the duplicates techniques.

1. Exact Match of Vendor Name; No Cleansing or Normalization
For the first test, I ran a simple: DUPLICATES on Vend_Name – without any data normalization or cleansing. The result was 9,435 (10.6%) duplicate records.
• Advantages: easy to perform and very few (none) False Positives
• Disadvantages: a high number of False Negatives (e.g. “ABC Ltd.” and “ABC Ltd” and “A.B.C. LTD”)

2. Exact Match of Vendor Name; Minimal Cleansing (Convert to Caps and retain only alphabetic and numeric characters (i.e. no special characters))
Before running the duplicates test, I created a new Vendor Name field called Vend_Name2.  I converted all letters to caps and removed all non-alphabetic and non-numeric characters (e.g. “123 Ltd.” became “123 LTD”; “A.B. O’Leary Supplies” became “AB OLEARY SUPPLIES”; “A-1 Products” became “A1 PRODUCTS”; and “1-2-3 Step/Process Planners” became “123 STEPPLANNERS”.

I then ran: DUPLICATES on Vend_Name2 which identified 11,326 (12.74%) duplicates – a 20% increase in the number of duplicates compared to test #1.
• Advantages: easy to perform and very few False Positives
• Disadvantages: still has a high number of False Negatives (e.g. “ABC LTD” and “ABC LIMITED”)

3. Exact Match of Vendor Name after Cleansing and Normalization (Convert to Caps and retain only alphabetic and numeric characters and Normalize (e.g. “Ltd” and “Limited” became “ “))

Before running the duplicates test, I created a new Vendor Name field called Clean_VendName.  I converted all letters to caps and removed all non-alphabetic and non-numeric characters and removed all abbreviations (e.g. Ltd, Corp, Assoc, Mr, Mrs, LLB, Inc) such that (e.g. “123 Products Ltd.” became “123 PRODUCTS”; “A.B. O’Leary Supplies Corp” became “AB OLEARY SUPPLIES”; and “Mr. John Simkin Consultants” became “JOHN SIMKIN CONSULTANTS”.
I then ran: DUPLICATES on Clean_VendName which identified 15,184 (17.08%) duplicates – a 61% increase in the number of duplicates compared to no cleansing (test #1).
• Advantages: decreased number of False Negatives (identified “ABC Ltd” and “ABC LIMITED” as duplicates; and “1-2-3 Abc Products Ltd.” and “123/ABC Products Limited” as duplicates)
• Disadvantages: still has False Negatives because of lack of consistency in name (e.g. “Dr. John Smith” and “Smith John Dr”) or spacing (e.g. “ABC Limited” and “A B C Limited”)

4. Exact Match of Vendor Name after Sorting the Cleansed and Normalization alphabetically by word within name (e.g. “Smith, John” and “John Smith” became “JOHN SMITH”)

Before running the duplicates test, I created a new Vendor Name field called Sort_Clean_VendName by sorting the Clean_VendName alphabetically by the words in the name.  The original Vendor names “John Smith”; “Mr. John Smith”; “John Smith”; “Smith John Mr”; and “Smith John Ltd” all became “JOHN SMITH”.  I then ran: DUPLICATES on Sort_Clean_VendName which identified 15,288 (17.20%) duplicates – a 62% increase in the number of duplicates compared to no cleansing (test #1), but only a slight increase over the cleansed by non-sorted duplicates test.

Note: most of the additional 104 duplicates found were personal names rather than company names (e.g. “John Smith” and “Smith, John”) and these were typically employee names for payment of T&E expenses.
• Advantages: decreased number of False Negatives, but created some possible False Positives like “Allen Gerard” (last name ‘Gerard’) and “Gerard Allen” (Last name ‘Allen’) were identified as duplicates
• Disadvantages: still has False Negatives because of spelling errors in name (e.g. “John Smith” and “Jonh Smith”) or spacing (e.g. “ABX Supplies “ which became “ABX SUPPLIES” and “A B X Supplies” which became “A B SUPPLIES X” were not identified as duplicates.)

5. Exact Match of Vendor Name after Sorting the Cleansed and Normalization alphabetically by letter/number within name (e.g. “Smith, John” and “John Smith” both became “HHIJMNOST”)

Before running the duplicates test, I created a new Vendor Name field called AlphaNum_Sort_Clean_VendName by sorting the Clean_VendName alphabetically by the letters/number in the name.  The original Vendor names “Mr. J.A. Smith” and “J A Smith” and “Smith J.A. Mr” and “Smith JA Ltd” and “J A Smith” all became “AHIJMST”; and “A-1 RC Supplies Corp” and “A1 R C Supplies Ltd” both became “1ACEILPPRSSU”.

I then ran: DUPLICATES on AlphaNum_Sort_Clean_VendName which identified 15,962 (17.96%) duplicates – a 69% increase in the number of duplicates compared to no cleansing (test #1).

• Advantages: decreased number of False Negatives but created a few False Positive like “246135 INDUSTRIES” and “135246 INDUSTRIES” both became “123456DEIINRSSTU”; and “KIM E SMITH” and “MIKE SMITH” both became “EHIIKMMST”
• Disadvantages: still has False Negatives because of spelling errors in name (e.g. “John Smith” and “Jonh Smith”)

6. Metaphone3 – Exact Match of the metaphone value of Vendor Name after Sorting the Cleansed and Normalization alphabetically by word within the vendor name.

Before running the duplicates test, I used the field called Sort_Clean_VendName and ran a Java script to convert the field values to the metaphone value called Metaph_VendName (e.g. “JOHN SMITH” and “JONH SMITH” both became “JNSMO”).
Note: Metaphone3 is similar to Soundex but more comprehensive and powerful
I then ran: DUPLICATES on Metaph_VendName which identified 21, 918 (24.66%) duplicates – a 132% increase in the number of duplicates compared to no cleansing (test #1).
• Advantages: decreased number of False Negatives; increase number of True Positives (e.g. “John Smith” and “Jon Smith”; and “Allan Smith”, Allen Smith”, “Alan Smith”, and “Alan Smith”)
• Disadvantages: created a many False Positive particularly where the sorted clean name was short (e.g. “ABC”). To address this issues, I added a filter that removed all duplicates where the length of the Meta_VendName was <= 3. This identified 19,802 (22.28%) duplicates – a 109% increase over test #1 and had less False Positives.

7. Dice Coefficient – Non-Exact Match of the Vendor Name after Sorting the Cleansed and Normalization alphabetically by word within the vendor name.

I used the field called Sort_Clean_VendName and instead of performing a Duplicate test, I joined the Vendor master file to a cop[y of the file.
I created a key in both files that was equal to “1” and performed a JOIN Many-to-Many on this key. This meant every record was joined to every other record. This would produce 88878 * 88788 records (almost 7.9B records). However, I added a filter requiring that that matched records have a Dice Coefficient of 95%. This means that the two vendor names have to be 95% identical. For example, the Dice Coefficient of “MARGURITE JOHNSTONE” and “MARGURITE JONSTONE” is 0.914, meaning that the two strings are 91.4% identical. In this case the only difference is the “H” in JOHNSTONE).

The Dice Coefficient JOIN identified 130,788 (147.16%) duplicates (more duplicates than original records?).  This is because each record identified itself as a duplicate; also if there were 3 records with the same vendor name – the JOIN Many produced 3 * 3 = 9 duplicates.  To address this, I added a filter that specified that the Sort_Clean_VendName in table1 was not equal to the Sort_Clean_VendName in table2. This identified 1,412 duplicates. This may not seem like many duplicates compared to previous tests, but they are likely to be duplicates that are not an exact match. All other tests, except Metaphone, required there to be an exact match on the name.  The Dice Coefficient identified these plus duplicates where the name was not an exact match (e.g. “JOHNSTONE MANUFACTURING” and JOHNSTON MANUFACTURING”; “ABC SERVICES AND PRODUCTS” and “ABC SERVICE AND PRODUCTS”.
• Advantages: decreased number of False Negatives
• Disadvantages: the processing time is extremely long – (almost 9 hours) to compare the 7.9B possible matches – and may not justify the identification of the additional duplicates.

Conclusion
I hope that I have given you enough information to draw your own conclusion based on your objective, impact and risk tolerance for duplicates.  Typically, for my duplicate analysis I certainly want more than a simple duplicate test (test #1) before cleansing and normalizing the data.  Given the increase in the number of identified duplicates, at a minimum, I would Cleanse (Test #2) (e.g. all caps and remove special characters), Normalization (test #3) (e.g. change “Ltd” and “Limited” to “ “), and perform an Alpha_Numeric Sort (test #5) on the Name (e.g. “Smith, John” and “John Smith” both become “HHIJMNOST”).  I would also likely use Metaphone (test #6) – which will identify many of the same duplicates as Dice Coefficient, but was much faster (10 minutes versus 9 hours). However, unless the impact of having duplicates was high, I would not use Dice Coefficient (test #7).  I hope you enjoyed and benefited from this practical exercise in finding duplicate Vendors based on Name. A similar approach can be used to look for duplicate employee names, duplicate addresses (after normalizing “Street” and “St”; “Road” and Rd”; “E” and “East”; etc.) and other criteria. The steps are: cleanse, normalize, sort, and use Metaphone.

 

David Coderre is President of Computer Assisted Analysis Techniques and Solutions (CAATS). DDaveavid Coderre was awarded a lifetime achievement award by the Canadian body of the Institute of Internal Auditors for his contributions to the field of Internal Audit and CAATS,. David Coderre has published numerous books on the topic of Internal Audit and CAATs including Internal Audit: Efficiency through Automation and Computer Aided Fraud Prevention and Detection: A Step by Step Guide.