Why duplicates are more expensive than they look
A duplicate contact is not just a wasted send. At the direct-mail rate of around £0.80 per piece, 3,000 duplicates in a 25,000-record B2C file costs £2,400 in print and postage before you have touched the creative. For a telemarketing campaign, dialling the same number twice in a week is a compliance event waiting to happen under the Telephone Preference Service (TPS) rules. For B2B email, a contact who receives the same message from two slightly different list segments will flag your domain as a source of duplicate sends, hurting future deliverability.
Then there is the data enrichment problem. If you run CRM appending on a file that still contains duplicates, you will hit the same contact twice, spend the per-record fee twice, and potentially overwrite a complete record with a less complete one from the duplicate row. In our experience, clients who dedup before enrichment reduce their enrichment spend by 10% to 18% compared with those who do not.
The GDPR dimension is often overlooked. Under Article 5(1)(c) of UK GDPR, personal data must be "adequate, relevant and limited to what is necessary" for the purpose. Holding ten copies of the same contact's mobile number across different tables is hard to justify as minimal. The Information Commissioner's Office (ICO) guidance on data minimisation is explicit: if you hold the same data twice without a documented reason, you are carrying unnecessary risk.
What are the five deduplication best practices?
1. Define your match rules before the job runs
The single biggest mistake in dedup projects is letting the tool decide what counts as a duplicate. You must specify the match rules in advance and document them. The standard approach for UK marketing lists is a cascade of three passes, applied in order.
The first pass uses exact email address. Two records sharing [email protected] are duplicates regardless of how the name fields are spelled. Email is the strongest single key for B2B and B2C alike, though it fails when the same person appears on different files with different addresses.
The second pass applies name plus postcode. Standardise to uppercase, strip punctuation, and reduce the postcode to the outward code (the first half) before comparing, otherwise legitimate siblings at the same address get wrongly collapsed. A Manchester-based B2C file with "Jane Smith, M1 3AB" and "J Smith, M1 3AX" are almost certainly the same person if no other distinctions exist, but "Jane Smith, M1 3AB" and "Jane Davies, M1 3AB" at the same postcode are probably not.
The third pass uses name plus company name for B2B files. This catches contacts who changed personal email addresses but stayed at the same employer. Standardise company suffixes first: strip "Ltd", "Limited", "PLC", "LLP", and their variants to a root name before comparing. "Acme Ltd" and "Acme Limited" are the same company; failing to normalise them means the pass misses obvious duplicates.
A composite pass, requiring two or more keys to agree, catches edge cases the single-key passes miss and reduces false positives. The right combination depends on your data, but name plus email domain plus postcode sector is a reliable fourth pass for large B2C files.
2. Select survivors by completeness and recency, not row position
Once you have identified a cluster of duplicate records, you need to decide which row to keep. Row position (first in, first out) is the laziest and worst-performing rule. A record imported in 2019 will frequently be less accurate than one added in 2024, and the older row may have fewer populated fields.
The correct survivor selection logic scores each record in the cluster on two dimensions. Recency: the record with the most recent last_updated or date_added timestamp scores higher. Completeness: count the number of non-null, non-empty fields and score proportionally. The record with the highest combined score becomes the survivor.
Where two records in a cluster hold different non-null values in the same field (for example, one has a mobile number and the other does not), consider a merge rather than a straight survivor selection: copy the unique values from the non-survivor into the survivor before suppressing the non-survivor row. This is sometimes called "golden record" creation. It takes more engineering than a simple dedup, but on enriched CRM data it is nearly always worth the effort. See our guide on enrichment and deduplication strategy for a full worked approach.
3. Flag, suppress, and log: never delete
This is the rule most often ignored by teams who want a clean database and treat deletion as the obvious outcome. Do not delete duplicate records. Flag them with a status column (for example, dedup_status = 'suppressed'), record the ID of their survivor, and log when the suppression happened and which match rule triggered it.
The audit trail serves three purposes. First, it lets you reverse the dedup if you discover an error. False positives happen: two contacts with the same name at the same postcode who are genuinely different people. Without the audit log, you cannot recover the wrongly suppressed record.
Second, it supports Subject Access Request (SAR) responses. Under Article 15 UK GDPR, a data subject who submits a SAR is entitled to know what data you hold. If you have deleted a duplicate row that contained their historical email address, you cannot confirm or deny holding that data. The suppressed row, still present and flagged, gives you the complete picture.
Third, it supports right-to-erasure requests under Article 17 UK GDPR. When a contact asks to be erased, you must locate and delete every copy of their data, including suppressed duplicates. The audit log tells you how many rows to erase and where they are. Without it, you risk leaving orphaned copies in archived files or backup tables.
GDPR note on suppression versus erasure
Suppression and erasure are not the same thing. A suppressed record is still personal data and remains subject to all UK GDPR obligations. Suppression prevents the record from being used for marketing; it does not constitute deletion. When a right-to-erasure request arrives, suppressed records must be deleted alongside active ones, unless a legitimate retention reason (such as a legal claim) applies.
4. Run deduplication before enrichment, and re-run after file merges
The sequencing matters. Enrichment appends third-party data to your records: telephone numbers, job titles, company firmographics, email addresses. If you enrich first and dedup second, you pay the enrichment fee per record including duplicates. Worse, you may have enriched the non-survivor record with a more current phone number, then suppressed that row, discarding the update.
Run dedup, select survivors, perform the golden-record merge if applicable, then enrich. Your enrichment hit rate will be higher (fewer stale duplicates pulling down the average) and your cost per clean record will drop.
Re-running dedup after any file merge is non-negotiable. A Manchester-based financial services firm buying a vertical-specific B2B file to append to their CRM will almost always introduce duplicates: their existing prospects appear on the bought file under slightly different name formatting or with a previous email address. Running the full cascade of match rules after the merge, before the combined file goes anywhere near a mailing platform, catches those overlaps. The same applies after any CRM migration, system integration, or data import from a third-party platform.
5. Keep the audit trail GDPR-ready
The audit trail requirement has already appeared in best practice 3, but it deserves its own section because the content of the log matters as much as its existence. A minimal GDPR-ready dedup log records, per suppressed record: the original row ID, the survivor row ID, the match rule that triggered the suppression, the date and time of suppression, and the source file or system that contained the record.
That last field, source file, is particularly important for bought-in data. When a contact bought from a third-party data provider is suppressed as a duplicate of an existing CRM contact, the log must show both origins. If that contact later submits a deletion request, you must delete the record from both the CRM and any archived copy of the original bought file. Without the source field in the log, you may satisfy the deletion request for the survivor but leave the original bought record intact in a file store.
Store the audit log in a system that persists beyond the working file. A dedup log that lives only in the same database as the marketing list gets lost if the list is archived or deleted. A separate audit table, or a CSV export stored to a controlled location, is the minimum. For teams handling large consumer files, a dedicated data governance platform is worth the investment.
Deduplication best practices: comparison table
| Best practice | What it covers | Common mistake | UK B2B rate | UK B2C rate |
|---|---|---|---|---|
| Define match rules first | Exact email, name + postcode, name + company, composite | Letting the tool use defaults; skipping normalisation of Ltd/Limited | 5%–15% duplicates found | 8%–20% duplicates found |
| Survivor selection by score | Recency + field completeness, with golden-record merge for unique values | Defaulting to first-in or last-in row position | Applies to all sizes | Applies to all sizes |
| Flag and suppress, not delete | Suppression with audit log; reversible dedup | Hard-deleting non-survivor rows before GDPR obligations are clear | Required under UK GDPR Article 5 | Required under UK GDPR Article 5 |
| Dedup before enrichment | Sequencing to avoid double enrichment cost and field overwrites | Enriching the full file including duplicates, then deduping | Saves 10%–18% enrichment cost | Saves 10%–18% enrichment cost |
| Re-run after merges | Post-merge dedup pass on combined files | Assuming the bought file has no overlap with the existing CRM | Critical after every file merge | Critical after every file merge |
What tools should I use for deduplication?
SQL for exact-match dedup at scale
SQL window functions handle exact-match deduplication cleanly on tens of millions of records without any additional software. The standard pattern uses ROW_NUMBER() OVER (PARTITION BY match_key ORDER BY completeness_score DESC, last_updated DESC). Rows where the row number is greater than one are the non-survivors. Set their dedup_status to suppressed, write the survivor ID into a survivor_id column, log the event, and you are done. This works on PostgreSQL, SQL Server, BigQuery, and most modern databases.
The limit of SQL exact-match dedup is that it cannot catch fuzzy variants: "Jon Smith" and "Jonathan Smith" at the same postcode will not match on a string equality check. For smaller files (under 100,000 records), a manual review pass after exact dedup is feasible. For larger files, you need fuzzy matching.
Python for fuzzy matching
The Python dedupe library uses a trained machine-learning model to compare record pairs and assign match probabilities. It handles name variants, address abbreviations, and other common causes of missed exact-match duplicates. The trade-off is training time: you need to label a sample of pairs as matches or non-matches before the model performs well on your specific data. Budget at least half a day for initial training on a new dataset.
For simpler fuzzy work, the rapidfuzz library provides fast string similarity scoring (Levenshtein, Jaro-Winkler, token set ratio) without the training overhead. Combine it with a blocking step, comparing only records that share an exact postcode outward code, for instance, to keep runtime manageable on large files.
When to use a specialist bureau
Specialist data bureaus add value in three situations: your volumes exceed a few million records and in-house fuzzy matching is too slow; the data is dirty enough (inconsistent address formatting, missing postcodes, mixed encoding) that configuring a reliable in-house pipeline is not cost-effective; or you need a third-party audit trail for compliance purposes. For routine dedup on files under 500,000 records, a well-structured SQL or Python pipeline is almost always faster and cheaper than an outsourced job.
For an end-to-end look at combining deduplication with enrichment in a single workflow, read our article on enrichment and deduplication strategy. If your dedup reveals structural gaps in your CRM data (fields consistently missing across large record cohorts), the CRM gap analysis guide covers how to categorise and prioritise those gaps before enrichment.
How does deduplication interact with the GDPR right to erasure?
Article 17 of UK GDPR gives data subjects the right to request deletion of their personal data where the data is no longer necessary for the purpose for which it was collected, where consent has been withdrawn, or where the processing was unlawful. The interaction with deduplication is specific and often mishandled.
When a contact submits an erasure request, the obvious step is to delete their record from the active marketing list. The less obvious step is to delete the suppressed duplicate rows too. Those rows still contain personal data (name, email, address) and are still subject to UK GDPR obligations. A dedup process that created those suppressed rows without a clear audit trail of which rows belong to which individual makes the erasure response much harder to execute cleanly.
The practical answer is to store individual identifiers consistently across the audit log. If your dedup log links every suppressed row back to a person_id or a canonical email address, an erasure request can be resolved with a single parameterised query that locates all rows for that identifier across active, suppressed, and archived tables. Without that link, the erasure team must search manually, which introduces the risk of missing rows and the corresponding risk of an ICO enforcement notice.
One further point: suppression lists used for TPS or MPS compliance are a legitimate retention exception. A record kept solely to prevent future unsolicited contact (a "do not contact" flag) is not personal data held for marketing purposes; it is held to honour an opt-out. The ICO accepts this distinction. But the flag must contain only what is necessary to achieve that purpose: typically a hashed identifier or the minimum data needed to match future inbound records. A full record with job title, mobile number, and company revenue is not a suppression entry; it is a marketing record with a suppression flag, and it is still subject to full UK GDPR obligations.
