Merging Duplicate Records: Getting Started

Sep 23

Duplicate Records in your Marketing Applications will cause problems until they are merged.

The problem with duplicates

Most marketing applications stacks will eventually run into issues caused by the accumulation of duplicate records: duplicate Leads, Contacts and Accounts.

Duplicates records in Marketing applications lead to the following problems:

Multiple, identical touches to the same individual
Incomplete or fragmented record of interactions with an individual which undermine the ability to personalize or send relevant communication.
Misleading statistics about audience, accounts and pipeline
Incorrect routing of lead, contacts and accounts to sales teams causing duplicate sales effort and confusing interactions to the potential customer
Increased record counts which lead to increased licensing costs

Because of this, it is clearly a desired goal to clean up an existing duplicate record problem and put in controls to reduce, or prevent, their occurrence.

A general approach for managing a deduplication effort

Every dedupe effort is a custom project. This is because the components of the marketing stack vary, the customizations vary, and the business meaning of what makes a duplicate and how to merge them will vary.

Yet, over the last several years, we have managed dozens of duplicate clean-up projects. We have learned some lessons, including a few painful ones. While no two merge projects are the same, there are some general strategies that can help you tackle a dedupe project that offer a better chance of a successful outcome.

Define what is, and what is not, a duplicate record
Analyze your existing data to get counts of duplicate records, including details of the types of duplicates
Define criteria for the “winner” and what fields should, and should not, be merged
Alert and prepare other teams that may be impacted by the clean-up, and take a backup
If needed, break the clean up into several iterations that start small and get larger

1. What is duplicate data?

Answering this question is the first step in resolving problems around duplicate data, and it is sometimes not as straightforward as it may first seem. For Marketing Operations application apps, I think a good start is to define a “duplicate” as two or more records that represent the same entity. An entity is a real-world thing that the marketing application communicates with such as a person (who is a lead, a contact, a customer) or an account (a company or organization).

What can sometimes get confusing is that the same person can represent multiple entities.

Often there is an assumption that a globally unique identifier such as an email address is what represents a unique entity. While this is often true it is not always true. For example, you might have the same person, with the same email address, working on behalf of multiple companies. Or, you might have the same person with the same email address sign up as a customer with different account ids.

In these situations, a composite set of fields is required to identify a unique entity, such as email + account ID or email + company. These are sometimes called “intentional duplicates” because there are multiple records with the same email address, but they represent different entities. When sending personalized marketing communications to them, it probably important to know which company or account the communication is referencing.

2. Analyze your data to get duplicate counts

Once you have defined what a duplicate record is, it is helpful to understand the scope of the problem. Develop reports that show how many duplicate records and further categorize the records by:

Lead and Contact - CRMs usually have this distinction and this can be important when running the actual merges
If you have “intentional duplicates” (as mentioned above), break out the difference between records with the same email address vs records that are actual duplicates.
Lead Source – if possible, this can help identify where the duplicate records or originating
Accounts ownership and duplicates – a report showing the number of duplicates by account, and also a list of duplicates that are owned by more than one account can be very useful

3. Define field merging rules

When two or more records are merged, one record will be selected as the “winner”. The winning record will persist in your systems, and the losing records will be merged into the winner and then removed. Therefore, it is critical to define:

What attributes will be used to select the winner
What fields from the losing records should override the winner’s fields.

Often the winning record is selected by date of creation, with the oldest records being the winner. However, there ae usually other fields that should be considered. For example, you would want Contacts to take precedence over Leads, and you might want to take the Status of the record into account.

Once the winner is picked, by default, the winner’s fields persisted over any losing record’s fields unless the winner has empty fields. This can lead to some data errors. For example, the winner’s fields might have address fields filled in but missing the “State” field. Then, a losing record’s state field could fill in the empty state on the winner with an incorrect value.

Also, there might be some custom fields with important values on the losing record that you want to make sure are persisted onto the winning record.

Sometimes the only way to deal with these types of situations is to write some custom logic that updates the winning record after merge. If so, it is critical to understand this before the merge takes place, as the data from the losing records is removed.

We have run “trial merges” that would simulate the output of running the merge, dumping the results to a file instead of actually running the merge in your applications, so that the results could be verified without actually updating the production data.

4. Prepare other stakeholders before a merge

There are often unintended consequences that occur after running a merge on a large batch of duplicates. This is when two records are merged, it is not just their fields that get combined but their activity history as well. This is a good thing because you want the merged record to have the full history of what the visitor has done. But be aware that merging activities can have downstream effects by affecting Lead Scoring and Lead Status of the newly merged record. This in turn, might cause the newly merged record to automatically entera new Marketing campaign triggering emails to be sent, and triggering Lead assignment rules to occur in the CRM. Remember, a merge updates data in your Marketing Stack…not just a single application.

Because of this, the sales team and other stakeholders should be given a heads up before you run a merge and give them an indication of the number of leads that will be removed after the merge.

We also have learned to always take a backup of the duplicate records. In practice I have found that a simple CSV dump of all the records that are involved with the key fields that were in place prior to the merge. In this way, if needed it is possible to reset data values in case something unexpected happens.

5. Run the Merge in groups, and small increments then scale up

We have learned to always group the dedupe runs by record type:

Lead to Lead
Contact to Contact
Lead to Contact

There are esoteric reasons for this, due to the way that CRMs deal with lead conversion. We have found the merging Leads to Contact can often run into permission or validation errors during implicit lead conversion and the underlying errors are not reported back to the Marketing Application that attempted the merge. In this case, all you will know is that the merge failed, but will not know why unless you can access the CRM error logs.

Also, when performing a large dedupe, we recommend to initially run a test case of about a dozen merges. Then, check not only the merge results but also the effects of any campaigns that may have been triggered due to lead status changes.

When satisfied, run another trial with a few hundred records and again check the results. This gives you an opportunity to adjust and merge rules, to temporarily disable any automated campaigns and check in with the sales team to verify no unintended consequences are causing problems.

Once you have had a chance to examine the results of this run, and are satisfied with the results, you will feel much more confident in running the full clean up.

Avoiding Duplicates

Once you complete a deduplication clean up, you realize that it involves a lot of work and would probably not look forward to repeating it every year. This is when you realize it is worth the effort to put measure in place to avoid duplicate or to clean them up continually.

There are off the shelf solutions as well as homegrown procedures that can be put in place to keep the number of duplicates to a Minium so that a big clean-up is never required.

Getting Started

Hopefully, this article has given you a framework to get started in tackling your duplicate problems, and to surface some issues you will likely face before you run into them on your own. If you have any questions on how Helix can help clean up your data, please let us know.

Peter Liske