Dirty, unstructured structured data, dozen-plus name variations, and inconsistent field definitions across disparate sources. This can of worms is an almost staple occupational hazard for any data analyst working on a project involving thousands of records. And the implications are anything but ordinary:

Global financial institutions were fined $5.6 billion in penalties from failure to meet compliance regulations in 2020
Poor patient matching led to a third of claims getting denied in healthcare organizations in a survey from Black Book Market Research
Sales representatives lose 25% of their time due to bad prospect data.

So, here’s the key question: Is there a better way of overcoming these problems?

Unlike entity resolution tools that can perform data ingestion from multiple points and find non-exact matches at unparalleled speeds, manually entity resolving data using complex algorithms and techniques proves to be a far costly (not to mention exhausting) endeavor. Research from Gartner has found that bad data quality costs companies $15 million every year especially for those with operations spanning across multiple territories and business units.

This detailed guide will walk you through entity resolution, how it works, why manual entity resolution is problematic for enterprises, and why opting for entity resolution tools is optimal.

What is Entity Resolution?

The book Entity Resolution and Information Quality describes entity resolution (ER) as determining when references to real-world entities are equivalent (refer to the same entity) or not equivalent (refer to different entities)’.

In other words, it is the process of identifying and linking multiple records to the same entity when the records are described differently and vice versa.

For example, it asks the question: are data entries Jon Snow’ and John Snowden’ the same person or are they two different people entirely?

This also applies to addresses, postal and zip codes, social security numbers, etc.

ER is done by looking at the similarity of multiple records by checking it against unique identifiers. These are records that are least likely to change over time (such as social security numbers, date of birth, postal codes, etc.). Finding out if these records are the same or not involves matching it against a unique identifier in the following way:

In the above example, John Oneil, Johnathan O, and Johny O’neal are all matched through a unique identifier which is the national ID number.

ER usually consists of linking and matching data across multiple records to find possible duplicates and removing the matched duplicates which is why it is used interchangeably with:

Record linking
Fuzzy matching
Merge/purging
Entity clustering
Deduplication and more

How Entity Resolution Works in Practice

There are several steps involved in an ER activity. Let’s look at these in more detail.

Ingestion

This involves putting all data from multiple sources under one centralized view. An enterprise often has data scattered across disparate databases, CRMs, Excel and PDFs, and data formats including string, date, and both.

For example, a large mortgage and financial services company can have a central database in MySQL, claims forms data in PDF, and its homeowners list in Excel. Importing data from all these sources will help set the stage for linking records and finding duplicates.

In other cases, combining different sources into one can also mean changing the schema of the databases into one predefined schema for further processing.

Profiling

After the data sources are imported, the next step is to check its health to identify any kind of statistical anomalies in the form of missing and inaccurate data and casing issues (i.e., lowercase and uppercase). Ideally, a data analyst will try to find potential problem areas that need to be fixed before doing any kind of cleaning and entity resolving.

Here a user may want to check if the fields conform to RegEx regular expressions that determine string types for different data fields. Based on this, the user can determine how many records are either unclean or don’t conform to a set encoding.

Doing so can help reveal crucial data statistics including but not limited to:

Presence of null values e.g., missing email addresses in lead gen forms
Number of records with leading and trailing spaces e.g. David Matthews
Punctuation issues e.g. hotmail,com instead of Hotmail.com
Casing issues e.g. nEW yORK , dAVID mATTHEWS, MICROSOFT
Presence of letters in numbers and vice versa e.g. TEL-516 570-9251 for contact number and NJ43 for state

Deduplication and Record Linking

Through matching, multiple records that are potentially related to the same entity are joined to remove duplicates, or deduplicated using unique identifiers. The matching techniques can vary depending on the type of field such as exact, fuzzy, or phonetic.

For names, for instance, exact match is often used where unique identifiers such as SSN or address are accurate in the entire dataset. If the unique identifiers are inaccurate or invalid, fuzzy matching proves to be a much more reliable form of matching to easily pair two similar records (e.g., John Snow and Jon Snowden).

Deduplication and record linking, in most cases, are understood to be one and the same thing. However, a key difference is that the former is about detecting duplicates and consolidating it within the same dataset (i.e. normalizing the schema) while the latter is about matching the deduplicated data across other datasets or data sources.

Canonicalization

Canonicalization is another key step in ER where entities that have multiple representations are converted into a standard form. It involves taking the most complete info as the final record and leaving out outliers or noisy data that could distort the data.

Interested in what the future will bring? Download our 2024 Technology Trends eBook for free.

Blocking

When finding matches for an entity across hundreds and thousands of records, the potential combinations that could yield the right matches can end up in thousands (if not millions). To avoid this problem, blocking is used to limit the potential pairings using specific business rules.

Challenges of Entity Resolution

Despite the many approaches and techniques available for ER, it falls short on several fronts. These include:

1. ER Works Well Only If the Data Is Rich and Consistent

Perhaps the biggest problem of ER is that the accuracy of the matches is dependent on the data’s richness and consistency across datasets.

For instance, deterministic matching is quite straightforward. Say you have Mike Rogers’ in database 1 and Mike Rogers’ in database 2. Through simple record linking (or exact matching), we can easily identify that one is a duplicate of another.

However, probabilistic matching, where similar data records exist in the form of misspellings, abbreviations or nicknames (e.g. Mike Rogers’ in database 1 and Michael Rogers’ in database 2) is another story. A unique identifier (such as address, SSN, or birth date) may not be consistent across the databases and any kind of exact or deterministic matching will become nearly impossible especially when dealing with data in large volumes.

2. ER Algorithms Don’t Scale Well

Big Data enterprise projects that deal with terabytes of data in the financial, government, or healthcare industry have too much information for traditional ER, record linking and deduplication to work properly. The business rules required to make the algorithms work would have to account for far larger data to work consistently.

For example, the blocking technique used to limit mismatched pairs when finding duplicates is dependent on the quality of the record fields. If you have fields containing errors, missing values and variations, you can end up inserting data into the wrong blocks and face higher false negatives.

3. Manual ER is Complex

It is not uncommon for enterprise companies or institutions dealing with large volumes of data to opt for ER projects in-house. The rationale is that they can make use of technical resources (software engineers, consultants, database administrators) without having to purchase any of the entity resolution tools available in the market.

There are a few problems with this. Firstly, entity resolution isn’t a subset of software development. Sure, there are publicly available algorithms and blocking techniques that might be useful. But in the grand scheme of things, the skills required are vastly different. The user will have to:

Combine disparate unstructured and structured data sources
Be aware of different types of encoding, nicknames, variations for matching accuracy
Know how to entity resolve records for different use-cases
Ensure different matching techniques complement one another for consistency

Ticking all these boxes for the right user can be unlikely, and even if it is possible you have the risk of them leaving the firm that can put the entire project on hold.

4 Reasons Why Entity Resolution Tools Are Better

Entity resolution tools can provide many benefits that traditional ER can’t. These include:

1. Greater Match Accuracy

Dedicated entity resolution tools that have sophisticated fuzzy matching algorithms and entity resolving capabilities in place can give far better record linking and deduplication results than common ER algorithms.

When dealing with heterogeneous datasets, finding the similarity of two records can be exceptionally difficult due to the different types of entities, encoding, formatting issues and languages. Schema changes can also pose a problem. Healthcare organizations, for example, use both SQL and NoSQL-powered databases and converting all data into a pre-defined schema through schema-matching and data exchange can be risky as a lot of valuable information can be lost in the process.

Furthermore, a data analyst may have to use several string-metrics to do fuzzy matching effectively such as Levenshtein Distance, Jaro-Winkler Distance, Damerau-Levenshtein Distance and more. Incorporating all of these manually to improve match accuracy can be problematic.

Entity resolutions tools, on the other hand, can seamlessly link records by employing a wide range of string-metrics and other algorithms to give higher match results.

2. Lower Time-To-First Result

In most cases, time is critical for ER projects especially in the case of master data management (MDM) initiatives that require a single source of truth. The information relating to an entity can quickly change within weeks or months that can pose serious data quality risks.

Let’s say a B2B sales and marketing organization wants to run campaigns on its top-tier accounts, Ideally, It will want to make sure that its targeted prospects haven’t switched jobs, changed titles, or retired before wasting any marketing spend. In such cases, doing ER within a deadline is critical.

ER, if done manually, can take as long as 6+ months, by which time many records in databases may become obsolete and inaccurate. Entity resolution tools, however, can take half as long and more advanced tools can give a time-to-first result of 15 minutes.

3. Better Scalability

Entity resolution tools are far more adept at ingesting data from multiple points and run record linkage, deduplication, and cleansing tasks at a much larger scale. Government databases such as those containing tax collection and census data store millions (if not trillions) of records. A government institution deciding to do ER for fraud prevention, for instance, would be restricted in using manual ER approaches and algorithms. A user would become inundated with the data that needs to be worked with any business rules for blocking techniques to minimize the number of similar comparisons would be futile.

Entity resolution tools, however, can not only import data from various sources but also ensure its ER efficiency remains intact across large data volumes.

4. Cost-savings

Entity resolution tools, particularly for enterprise-level applications, can cost a sizable investment. Data professionals tasked with ER may be reluctant to consider opting for this reason alone. They may reason that doing it manually would be far cost-effective and improve their chances of promotion.

Although this may sound reasonable at first glance, the costs of project delays, poor matching accuracy, and labor resources may end up becoming higher than that of an ER tool.

How to Choose the Right Entity Resolution Software

Choosing the right entity resolution software is equally important. Many entity resolution tools differ in their features, scope, and value. Enterprises can have data stored in a wide variety of formats and sources such as Excel, delimited files, web applications, databases, and CRMs. An entity resolution software must be capable of importing data from disparate sources for the specific use case.

Originally published here

A Detailed Guide to Using Entity Resolution Tools for Enterprise Projects

What is Entity Resolution?

How Entity Resolution Works in Practice

Ingestion

Profiling

Deduplication and Record Linking

Canonicalization

Interested in what the future will bring? Download our 2024 Technology Trends eBook for free.

Blocking

Challenges of Entity Resolution

1. ER Works Well Only If the Data Is Rich and Consistent

2. ER Algorithms Don’t Scale Well

3. Manual ER is Complex

4 Reasons Why Entity Resolution Tools Are Better

1. Greater Match Accuracy

2. Lower Time-To-First Result

3. Better Scalability

4. Cost-savings

How to Choose the Right Entity Resolution Software

The Advantages of IT Staff Augmentation Over Traditional Hiring

The State of Digital Asset Management in 2023

Test Data Management – Implementation Challenges and Tools Available

Recent

Search

What is Entity Resolution?

How Entity Resolution Works in Practice

Ingestion

Profiling

Deduplication and Record Linking

Canonicalization

Interested in what the future will bring? Download our 2024 Technology Trends eBook for free.

Blocking

Challenges of Entity Resolution

1. ER Works Well Only If the Data Is Rich and Consistent

2. ER Algorithms Don’t Scale Well

3. Manual ER is Complex

4 Reasons Why Entity Resolution Tools Are Better

1. Greater Match Accuracy

2. Lower Time-To-First Result

3. Better Scalability

4. Cost-savings

How to Choose the Right Entity Resolution Software

About Javeria Gauhar

Footer

Recent

Search

Tags