Identifying Duplicates Effectively

Identifying Duplicates Effectively
By Dave Coderre – Published on January 5, 2019

The concept of identifying duplicates is fairly simple: do two records have the same values?  If yes, then they are duplicates. However when dealing with names and addresses the process can be much more complicated.  It is obvious to the human mind that “Dr. John Smith” and “Smith, John Dr” are duplicates; but to the computer – and many duplicate tests approaches – these will not be identified as duplicates.  Addresses are even more challenging because of their free-form nature.  We would recognize “123 – 456 Main St E” and “123 – 456 Main Street East” and “456 Main St East #123” as duplicate addresses; but, again, the computer might not.

Dave is President of Computer Assisted Analysis Techniques and Solutions (CAATS )

If human are better, why not perform the duplicate test manually? Firstly, thousands of records would take many days to review manually.   Secondly, the order of presentation is important.  Unless the potential duplicates were listed one after the other – making comparison easier – we would not notice that record #2,950 (“Dr. John Smith”) and record #104,552 (“Smith, John Dr”) are duplicates. This means that we have to rely on the computer to assist us with our duplicate testing.  Before I discuss the merits of different approaches and algorithms for duplicate testing, I would like you to consider something: “Do you need to find all of the duplicates?” Your first reaction maybe “yes”, but if your objective is to determine if the controls preventing and detecting duplicates are working, then you only have to find ‘some’ of the duplicates. Finding 3,000 duplicates that were not prevented (entry controls) or detected (processing or output controls) allows you to conclude that the controls are not adequate and effective.  The fact that there may be an additional ‘x’ duplicates that you did not find is not relevant to the binary (Yes/No) question “Are the controls over duplicates working?”  Ideally, we would like to identify the majority of the duplicates so that you can provide a level of significance and impact to the analysis; and to support the requirement to fix the controls.  If you only found one duplicate you could legitimately say the controls over duplicates had failed, but the significance and impact of a single failure may not support the additional work required to tighten the controls.

The identification of duplicates ultimately comes down to finding a balance between the False Positives, False Negatives and True Duplicates.  The ideal situation is one where the number of False Positive (records identified as duplicates which are not duplicates) is Low; the number of False Negatives (records identified as not duplicate when they are duplicates) is Low; and the number of True Duplicates (correctly identified duplicates) is High.  This balance is directly affected by the criteria you select for the duplicates analysis. If you tighten your criteria for what constitutes a duplicate (e.g. using “Last Name, First Name and Initials” instead of “Last Name and First Name”), you will reduce the number of False Positives, but you will also increase the number of False Negatives. Conversely if you loosen the criteria (e.g. using “Last Name” instead of “Last Name and First Name”) you will get less False Negatives, but more False Positives. So, the selection of the criteria directly affects the identification of duplicates; and the number of False Positive and False Negatives.  Where the balance point exists for your organization will depend on: the objective of your test for duplicates (why are you trying to identify duplicates); the impact of having duplicates; the risk tolerance level for duplicates; and the amount of time and effort you are willing to spend on identifying and validating the duplicate test results.

After selecting your criteria, the next most important task is to cleanse the data. Cleaner and more consistent the data will produce a lower the number of False Negatives and a lower the number of False Positives. Without cleansing the data, comparing “Dr John Smith” and “Smith John Dr.” would result in a “not duplicate” assessment – a False Negative.  However, the test would be more effective if all records were cleansed and had the format “Last Name, First Name, Salutation” so both records would be “Smith, John, Dr.” and would be correctly identified as a duplicate.

So after “What criteria should I use?”, the next question should be “How can I cleanse (normalize) my criteria?”  A closely related questions is “How much time and effort should I spend on cleansing (normalizing) my data?”

If we go back to the factors that influence the response to this question, you can arrive at a better idea of how much time and effort should be expended. For example, if the objective of the duplicates analysis was to “Determine the size of a known fraud”, then the False Positive and the False Negatives would have a significant impact. Fortunately, in this situation, you are likely to be dealing with a smaller number of criteria values (e.g. an employee or a vendor) and you can manually cleanse the variations in the data (e.g. “A.B.C. Inc” “ABC Inc” “A B C Inc” and “ABC Incorporated” can all easily be manually cleansed to “ABC Inc”) producing a better duplicate result.

If the objective was to “Test controls over duplicates”, you would still want to cleanse the data to decrease the number of False Negatives. This can be done by the computer to automatically deal with variations in the data (e.g. “Ltd.” “Ltd” and “Limited” can all be normalized to “Limited”).  However, other variations can make the normalization process more difficult. Ultimately, the amount of time and effort to fully normalize the data should be kept to a minimum unless the impact of having duplicates is high.

Given that cleansing the data can be a multi-faceted approach involving different methods, I thought it would useful to get an idea of the impact simple to more complicated cleansing techniques might have on your data. To this end, I have taken a real-life vendor table with 88,877 records and performed a series of duplicate tests using “Vendor Name” as the sole criteria.  I have employed a simple duplicate test (exact match) and also performed simple cleansing techniques (e.g. convert to all letters to caps and remove all non-alphabetic/non-numeric characters) and more complex duplicate techniques and cleansing approaches.  These are presented in each result – allowing you to see the incremental value of both the cleansing and the duplicates techniques.

1. Exact Match of Vendor Name; No Cleansing or Normalization
For the first test, I ran a simple: DUPLICATES on Vend_Name – without any data normalization or cleansing. The result was 9,435 (10.6%) duplicate records.
• Advantages: easy to perform and very few (none) False Positives
• Disadvantages: a high number of False Negatives (e.g. “ABC Ltd.” and “ABC Ltd” and “A.B.C. LTD”)

2. Exact Match of Vendor Name; Minimal Cleansing (Convert to Caps and retain only alphabetic and numeric characters (i.e. no special characters))
Before running the duplicates test, I created a new Vendor Name field called Vend_Name2.  I converted all letters to caps and removed all non-alphabetic and non-numeric characters (e.g. “123 Ltd.” became “123 LTD”; “A.B. O’Leary Supplies” became “AB OLEARY SUPPLIES”; “A-1 Products” became “A1 PRODUCTS”; and “1-2-3 Step/Process Planners” became “123 STEPPLANNERS”.

I then ran: DUPLICATES on Vend_Name2 which identified 11,326 (12.74%) duplicates – a 20% increase in the number of duplicates compared to test #1.
• Advantages: easy to perform and very few False Positives
• Disadvantages: still has a high number of False Negatives (e.g. “ABC LTD” and “ABC LIMITED”)

3. Exact Match of Vendor Name after Cleansing and Normalization (Convert to Caps and retain only alphabetic and numeric characters and Normalize (e.g. “Ltd” and “Limited” became “ “))

Before running the duplicates test, I created a new Vendor Name field called Clean_VendName.  I converted all letters to caps and removed all non-alphabetic and non-numeric characters and removed all abbreviations (e.g. Ltd, Corp, Assoc, Mr, Mrs, LLB, Inc) such that (e.g. “123 Products Ltd.” became “123 PRODUCTS”; “A.B. O’Leary Supplies Corp” became “AB OLEARY SUPPLIES”; and “Mr. John Simkin Consultants” became “JOHN SIMKIN CONSULTANTS”.
I then ran: DUPLICATES on Clean_VendName which identified 15,184 (17.08%) duplicates – a 61% increase in the number of duplicates compared to no cleansing (test #1).
• Advantages: decreased number of False Negatives (identified “ABC Ltd” and “ABC LIMITED” as duplicates; and “1-2-3 Abc Products Ltd.” and “123/ABC Products Limited” as duplicates)
• Disadvantages: still has False Negatives because of lack of consistency in name (e.g. “Dr. John Smith” and “Smith John Dr”) or spacing (e.g. “ABC Limited” and “A B C Limited”)

4. Exact Match of Vendor Name after Sorting the Cleansed and Normalization alphabetically by word within name (e.g. “Smith, John” and “John Smith” became “JOHN SMITH”)

Before running the duplicates test, I created a new Vendor Name field called Sort_Clean_VendName by sorting the Clean_VendName alphabetically by the words in the name.  The original Vendor names “John Smith”; “Mr. John Smith”; “John Smith”; “Smith John Mr”; and “Smith John Ltd” all became “JOHN SMITH”.  I then ran: DUPLICATES on Sort_Clean_VendName which identified 15,288 (17.20%) duplicates – a 62% increase in the number of duplicates compared to no cleansing (test #1), but only a slight increase over the cleansed by non-sorted duplicates test.

Note: most of the additional 104 duplicates found were personal names rather than company names (e.g. “John Smith” and “Smith, John”) and these were typically employee names for payment of T&E expenses.
• Advantages: decreased number of False Negatives, but created some possible False Positives like “Allen Gerard” (last name ‘Gerard’) and “Gerard Allen” (Last name ‘Allen’) were identified as duplicates
• Disadvantages: still has False Negatives because of spelling errors in name (e.g. “John Smith” and “Jonh Smith”) or spacing (e.g. “ABX Supplies “ which became “ABX SUPPLIES” and “A B X Supplies” which became “A B SUPPLIES X” were not identified as duplicates.)

5. Exact Match of Vendor Name after Sorting the Cleansed and Normalization alphabetically by letter/number within name (e.g. “Smith, John” and “John Smith” both became “HHIJMNOST”)

Before running the duplicates test, I created a new Vendor Name field called AlphaNum_Sort_Clean_VendName by sorting the Clean_VendName alphabetically by the letters/number in the name.  The original Vendor names “Mr. J.A. Smith” and “J A Smith” and “Smith J.A. Mr” and “Smith JA Ltd” and “J A Smith” all became “AHIJMST”; and “A-1 RC Supplies Corp” and “A1 R C Supplies Ltd” both became “1ACEILPPRSSU”.

I then ran: DUPLICATES on AlphaNum_Sort_Clean_VendName which identified 15,962 (17.96%) duplicates – a 69% increase in the number of duplicates compared to no cleansing (test #1).

• Advantages: decreased number of False Negatives but created a few False Positive like “246135 INDUSTRIES” and “135246 INDUSTRIES” both became “123456DEIINRSSTU”; and “KIM E SMITH” and “MIKE SMITH” both became “EHIIKMMST”
• Disadvantages: still has False Negatives because of spelling errors in name (e.g. “John Smith” and “Jonh Smith”)

6. Metaphone3 – Exact Match of the metaphone value of Vendor Name after Sorting the Cleansed and Normalization alphabetically by word within the vendor name.

Before running the duplicates test, I used the field called Sort_Clean_VendName and ran a Java script to convert the field values to the metaphone value called Metaph_VendName (e.g. “JOHN SMITH” and “JONH SMITH” both became “JNSMO”).
Note: Metaphone3 is similar to Soundex but more comprehensive and powerful
I then ran: DUPLICATES on Metaph_VendName which identified 21, 918 (24.66%) duplicates – a 132% increase in the number of duplicates compared to no cleansing (test #1).
• Advantages: decreased number of False Negatives; increase number of True Positives (e.g. “John Smith” and “Jon Smith”; and “Allan Smith”, Allen Smith”, “Alan Smith”, and “Alan Smith”)
• Disadvantages: created a many False Positive particularly where the sorted clean name was short (e.g. “ABC”). To address this issues, I added a filter that removed all duplicates where the length of the Meta_VendName was <= 3. This identified 19,802 (22.28%) duplicates – a 109% increase over test #1 and had less False Positives.

7. Dice Coefficient – Non-Exact Match of the Vendor Name after Sorting the Cleansed and Normalization alphabetically by word within the vendor name.

I used the field called Sort_Clean_VendName and instead of performing a Duplicate test, I joined the Vendor master file to a cop[y of the file.
I created a key in both files that was equal to “1” and performed a JOIN Many-to-Many on this key. This meant every record was joined to every other record. This would produce 88878 * 88788 records (almost 7.9B records). However, I added a filter requiring that that matched records have a Dice Coefficient of 95%. This means that the two vendor names have to be 95% identical. For example, the Dice Coefficient of “MARGURITE JOHNSTONE” and “MARGURITE JONSTONE” is 0.914, meaning that the two strings are 91.4% identical. In this case the only difference is the “H” in JOHNSTONE).

The Dice Coefficient JOIN identified 130,788 (147.16%) duplicates (more duplicates than original records?).  This is because each record identified itself as a duplicate; also if there were 3 records with the same vendor name – the JOIN Many produced 3 * 3 = 9 duplicates.  To address this, I added a filter that specified that the Sort_Clean_VendName in table1 was not equal to the Sort_Clean_VendName in table2. This identified 1,412 duplicates. This may not seem like many duplicates compared to previous tests, but they are likely to be duplicates that are not an exact match. All other tests, except Metaphone, required there to be an exact match on the name.  The Dice Coefficient identified these plus duplicates where the name was not an exact match (e.g. “JOHNSTONE MANUFACTURING” and JOHNSTON MANUFACTURING”; “ABC SERVICES AND PRODUCTS” and “ABC SERVICE AND PRODUCTS”.
• Advantages: decreased number of False Negatives
• Disadvantages: the processing time is extremely long – (almost 9 hours) to compare the 7.9B possible matches – and may not justify the identification of the additional duplicates.

Conclusion
I hope that I have given you enough information to draw your own conclusion based on your objective, impact and risk tolerance for duplicates.  Typically, for my duplicate analysis I certainly want more than a simple duplicate test (test #1) before cleansing and normalizing the data.  Given the increase in the number of identified duplicates, at a minimum, I would Cleanse (Test #2) (e.g. all caps and remove special characters), Normalization (test #3) (e.g. change “Ltd” and “Limited” to “ “), and perform an Alpha_Numeric Sort (test #5) on the Name (e.g. “Smith, John” and “John Smith” both become “HHIJMNOST”).  I would also likely use Metaphone (test #6) – which will identify many of the same duplicates as Dice Coefficient, but was much faster (10 minutes versus 9 hours). However, unless the impact of having duplicates was high, I would not use Dice Coefficient (test #7).  I hope you enjoyed and benefited from this practical exercise in finding duplicate Vendors based on Name. A similar approach can be used to look for duplicate employee names, duplicate addresses (after normalizing “Street” and “St”; “Road” and Rd”; “E” and “East”; etc.) and other criteria. The steps are: cleanse, normalize, sort, and use Metaphone.

Dave Coderre – www.caats.ca

Computer Assisted Analysis Techniques and Solutions (CAATS)