In music or any data-related field, enriching data from an external source is a common still tedious task. External data can come from open datasets like wikipedia, web crawling (if legal) or any other data sources. The main idea is that you own locally your dataset, public or not, and you want to enrich it with public and rich data. In music, Musicbrainz and Discogs publish such datasets, that can be used for matching. As I recently worked with the RISM dataset (Répertoire International des Sources Musicale), I will use this as an example, to match person entities.
However, the algorithm I’ll build for you is source and field (and language) agnostic.
First things first
It (always) seems easy. A person in my database is the same as a person in the external source if… they share the same metadata. This is the core of our algorithm.
foreach ext_person in external_source if int_person from internal_source share_the_same_metadata int_person = ext_person # MATCHED! else # NOT MATCHED
Which metadata to compare
This is up to you, and it depends a lot of which metadata is available. In my case, it was decent to use the following metadata for classical music composers:
- Last name
- First name
- Date of birth
- Date of death
- Birth country
Generally, it is not possible to compare all int_persons to the current ext_person, so a lookup in my database is first necessary. A lookup such as
last_name=ext_person.last_name AND first_name=ext_person.first_name AND birth_date=ext_person.birth_date etc...
would work but be very limited. I will start with an OR instead of AND. It will get a very small subset of the entire database, and still big enough for interesting pruning for a second step.
Lookup function:
function lookup(ext_person) int_persons = select in my_database where last_name=ext_person.last_name OR first_name=ext_person.first_name OR birth_date=ext_person.first_date OR death_date=ext_person.death_date OR country=ext_person.country return int_persons
A matching function can be:
function matching(int_person, ext_person) if (ext_person.last_name==int_person.last_name && ext_person.first_name==int_person.first_name && ext_person.birth_date==int_person.birth_date && ext_person.death_date==int_person.death_date && ext_person.country==int_person.country) return true # MATCHED! else return false
Core algorithm updated:
foreach ext_person in external_source potential_int_persons = lookup(ext_person) foreach potential_int_persons as int_person if matching(int_person, ext_person) # MATCHED! else # NOT MATCHED
Blank values
If one field is blank from one side or the other, it means either the metadata is incomplete or it’s actually blank (applicable to death date).
In either case, we need to process them differently. If two metadata don’t match because one of them is blank, it does not necessarily means the two person are not the same…
int_person
last_name: Doe |
ext_person
last_name: Doe |
result
matched |
Two fields don’t match, but it’s the same person. Score = 3.
… on the other hand, two missing values will match, but it does not mean that they are the same.
int_person
last_name: Doe |
ext_person
last_name: Smith |
result
unmatched |
Three fields match, but it’s not the same person. Score = 3.
To match one single metadata, such a function is useful:
function matchMetadata(metadata_name, int_person, ext_person) # metadata_name can be birth_date for example if int_person.metadata_name==ext_person.metadata_name && int_person.metadata_name!=null && ext_person.metadata_name!=null # technically, this one is not needed because deduced from the two other conditions return true else return false # null==null is not considered a match
function matchDeathDate(int_person, ext_person) # this is a special case because null==null is acceptable here as maybe the person is not dead yet # a more advance script would be to check the birth_date to evaluate if the missing death_date is missing or potentially not available yet if int_person.death_date==ext_person.death_date return true # null==null is acceptable here else return false
The matching function becomes:
function matching(int_person, ext_person) if (matchMetadata(last_name, int_person, ext_person) && matchMetadata(first_name, int_person, ext_person) && matchMetadata(birth_date, int_person, ext_person) && matchDeathDate(int_person, ext_person) && matchMetadata(country, int_person, ext_person)) return true #MATCHED! else return false
Scoring matches
More realistically, it can be acceptable to not match all of the fields to consider it a match. In fact, regarding my classical music composers issue, all metadata can mismatch and still the person be the same. Let’s consider such a case (extreme, I admit) with the following composer:
Serguei Rachmaninoff
1 April 1873 in Semionovo, Russia – 28 March 1943 in Beverly Hills, USA
- Last name: Rachmaninoff can also be spelled Rachmaninow, depending on the language it has been transliterated (here English with -ff and German with -w). I wrote about such cases in my previous post : Translating Track Titles.
- First name: Similarly, first name are transliterated and translated differently according to the target language. For Russian persons, the second first name (son of …) can also be added, like in “Sergei Vasilyevich”, making the first name very different (for computers) from one source to another.
- Birth and death dates: One may think this one is accurate. It’s not! (although, it’s usually is) I missed some matches because at some point RISM used the Julian calendar rather than the Gregorian calendar… Rachmaninoff is born April 1st 1873 in the Gregorian calendar, which is March 20th 1873 in the Julian calendar… They don’t match, but we are talking about the same date here. (very deceptive, I know).
- Country: Rachmaninoff was Russian. No, wait, Soviet. No, wait, American! (he got his citizenship one month before his death). Again, sources can differ and still describe the same person. Person’s life are not always metadata-able.
This kind of confusing situations make matching a lot less reliable than it looked like on the first place.
One convenient way to work around these issues is to score each metadata matching. After comparing all metadata, we can look at the global score and then decide if the two persons are the same or not. In my experience, using scores is far more complex than it looks. Scoring, in fact, depends on the quality of the data on each side. For example, if RISM used the Julian calendar most of the time (it’s not the case), I would lower the birth and death dates matches. If the last names are expected to be exactly the same (not in my case, but it could be on many other cases), then it should be given more importance, i.e. a higher score.
The matchMetadata function now needs scores:
metadata_scores = {
"last_name": 2, # for example
"first_name": 1,
"birth_date": 1,
"death_date": 1,
"country": 1
}
and the functions use it:
function matchMetadata(metadata_name, int_person, ext_person) # metadata_name can be birth_date for example if int_person.metadata_name==ext_person.metadata_name && int_person.metadata_name!=null && ext_person.metadata_name!=null return metadata_scores[metadata_name] # now returns an int and not a bool else return 0
function matchDeathDate(int_person, ext_person) # this is a special case because null==null is acceptable here as maybe the person is not dead yet # a more advance script would be to check the birth_date to evaluate if the missing death_date is missing or potentially not available yet if int_person.metadata_name==ext_person.metadata_name return metadata_scores["death_date"] else return 0
The matching function will adds all the values and decide if it matches or not, according to a threshold:
threshold=5 function matching(ext_person, int_person) if (matchMetadata(last_name, int_person, ext_person) + matchMetadata(first_name, int_person, ext_person) + matchMetadata(birth_date, int_person, ext_person) + matchDeathDate(int_person, ext_person) + matchMetadata(country, int_person, ext_person) >= threshold) return true #MATCHED! else return false
Several matches
Depending on your threshold and scoring algorithm, it becomes possible to match one ext_person to several int_person (or the other way around). It’s generally not an expected behavior. If it happens, be sure to detect it and decide accordingly (raise a warning, cancel the matching, match it to both, etc.).
On the other hand, it can also detects that your own database has some duplicates. Use it then as a good reason to investigate further, by a human or another algorithm.
Not matched?
According to your use case, when a person does not match, you may want to add it to your own dataset (if you have the right to). If so, be extremely careful about the following things:
- Enrich your dataset such as the next lookup will possibly find it. In this way, you will also detect potential duplicates in the external source, and won’t reproduce them in your own dataset.
- Do not add all unmatched persons! According to this algorithm, some did not match because some metadata were null, or just below threshold, but still represented the same person. You take the risk to create duplicates. I encourage to have a manual look at the “unmatches” that almost make it (in this case a score of 4 for instance), and for the other “unmatches” to check if it was because metadata were really different or just missing. An option is to add the unmatched persons when the score is very low.
- Do not add entries for the sake of adding entries. If the risk of creating duplicates is too high, or if the metadata for a given ext_person is too poor, just skip it.
A create function can look like this:
lower_threshold = 2 # if the matching score is lower than that, create a new person function createNewPerson(ext_person, score) if score <= lower_threshold && isRichEnough(ext_person) # Check in this function that you have at least a last name and first name for instance addToDatabase(ext_person)
… and can be called by the matching function in the else statement:
function matching(ext_person, int_person) score = matchMetadata(last_name, int_person, ext_person) + matchMetadata(first_name, int_person, ext_person) + matchMetadata(birth_date, int_person, ext_person) + matchDeathDate(int_person, ext_person) + matchMetadata(country, int_person, ext_person) if (score >= threshold) return true else return createNewPerson(ext_person, score)
Match and enrich
On the other hand, if you matched one of your internal person, you may want to pull the extra metadata from the external source. It’s generally the reason to develop a matching algorithm in the first place. In this case, you can safely replace the null values of the int_person by the value of the the ext_person.
int_person
last_name: Doe
|
ext_person
last_name: Doe
|
result
matched MATCHED! |
enriched int_person
last_name: Doe
|
However, if the unmatched metadata are different, then you have conflicted data. There are many reasons this can happen, such as human error, formatting issues or translation issues. It’s up to you to decide to keep the original data, override it and/or raise a warning to a human.
int_person
last_name: Doe
|
ext_person
last_name: Doe
|
result
matched MATCHED! |
enriched int_person
last_name: Doe
|
The last words
Matching is a tedious task. It takes time to develop the right algorithm corresponding to your needs, and to explore the data of the external source to know what to expect.
I know it looks sexy to have the most data possible, however, it’s a very hard work to keep a huge database with qualitative data that you can trust over time. My only advice is to pay a careful attention to the data coming in at all times.