This tutorial deals with an easy use case of matching identities of two datasets, called
deaths (a list of dead people) and
clients (a list of our clients, inlcuding some dead people) which are both reasonably big.
An advanced tutorial is available for dealing with a more complex case (over 1 million records),
where maching learning is necessary to find a big number of match.
We have first to do data preparation. So we’ll learn here how to cook your data with recipes, with a 4 or 5-step iteration method :
Iterating through these steps will allow you to create recipes and datasets for two purposes:
In the advanced tutorial, we will go through more steps, among those:
But for now, we don’t need to.
We provide you a tutorial matchID tutorial serveur, where you can login with your GitHub account :
WARNING : This tutorial site should not be used for big usecases as it is a small server (vCPUs are 3x slower than an up-to-date laptop, but there arer 8vCPU).
You should see this screen just after the login :
We’ll first have to create a project. This will basically be a folder, containing datasets and recipes (data transformation).
Here we’ll simply name it
Which leads to:
Now you have your first dataset:
Please care about the screen composition :
- on the left pane: yaml coding part (thx to codemirror) with the dataset declaration.
- on the right pane: the dataset view, as declared in the yaml on the left.
About the data: it is a clean synthetic data of French civil statuses. This file is a simple csv with no formatting or encoding problem. In the advanced tutorial you’ll learn to deal with fixed with formats and encoding problems.
Create a new recipe :
A default recipe is created with no valid dataset, just replace it with the uploaded dataset,
deaths_test_csv - as this can be done now, we already figure out we have a
deaths dataset we’ll configure after finishing the recipe.
recipes: dataprep_deaths_test: input: deaths_test_csv # <==== necessary to change to continue # there are advanced usage of dataset, such as filtered dataset, cf advanced docs output: deaths # to be configured after the recipe steps: # this is the beginning of the recipe - eval: - matchid_id: sha1(row) # just change the default 'new_col' to matchid_id which will be used further
Save it (
Save button or
Ctrl+S), it should display the first imported dataset, but with an additionnal column,
new_col which is basically a hash of the row:
So now you have an interactive way to deal with your data. Every new step of the recipe will add a new transformation on your data. You can have the exhaustive list of recipes here, and see advanced recipes in advanced tutorial.
We have to create the
deaths dataset as formerly pointed as the output dataset of the recipe.
Just create it from the menu and paste this content :
datasets: deaths: connector: elasticsearch # this is the predeclared connector to the dockerized elasticsearch table: deaths # name of the index
And don’t forget to save it (
Save button or
Note that you can configure many options for an elasticsearch dataset, still illustrated in the advanced tutorial.
So once everything is configured, you can go to the recipe
dataprep_death_test and run it with the green button :
This run is needed to index the deaths with elasticearch, which will enable a match of up to 98% (recall) in real usecases.
You can follow the job either directly in the bottom in the “Real logs”:
This should take about 45 seconds on your laptop to index the 71k rows.
The job log last line should summarize the time and bugs for the recipe :
2018-04-02 21:17:39.715749 - 0:00:42.308387 - end : run - Recipe dataprep_deaths_test successfully fininshed with no error, 71404 lines processed, 71404 lines written
You should be able to follow the former steps on the new file,
Then import the dataset
This data is quite easy: the only data preparation we need to do is to change the separator in the recipe
sep: "\\s+|,|;" to
Create the dataset
clients_x_deaths from the menu
datasets: clients_x_deaths: connector: elasticsearch table: clients_x_deaths
Here comes the first important part : the fuzzy match with elasticsearch. We choose here to use elasticsearch as is it quite versatile (can perform ngram, phonetic and string-distance tolerant) for fuzzy matching.
So, now you have to match every client against the already-indexed-in-step-1 deaths.
First create a recipe named
recipes: clients_deaths_matching_test: input: clients_test_csv output: clients_x_deaths steps: - join: type: elasticsearch dataset: deaths query: size: 1 query: bool: must: - match: DCD_NOM: Nom - match: DCD_DATE_NAISSANCE: Date
So you’ll see this first matching results:
Some observations :
keep_unmatched: trueto add in
jointo make clients with no match appear
Here is a more complete search :
recipes: clients_deaths_matching_test: input: clients_test_csv output: clients_x_deaths steps: - join: type: elasticsearch dataset: deaths keep_unmatched: True # keeps rows with no match query: size: 1 query: bool: must: - match: DCD_NOM: query: Nom fuzziness: auto # tolerate fuzzy (up to 2 errors in name) - match: DCD_DATE_NAISSANCE: Date - match: DCD_PRENOMS: # one token at least should match, with up to 2 errors query: Prenom fuzziness: auto should: - match: # if place of birth match it is better but not mandatory DCD_COMMUNE_NAISSANCE: Lieu
Now we see there is some noise we should be able to filter easily
As we are in a recipe, we can add additionnal steps, we’ll use to score distance of names, place and dates of birth. Just add theses lines to the current recipe, which will remove wrong matches
- eval: # scores - score_date: levenshtein_norm(hit_DCD_DATE_NAISSANCE, Date) - score_nom: levenshtein_norm(hit_DCD_NOM, Nom) - score_prenom: jw(hit_DCD_PRENOMS, Prenom) - score_lieu: jw(hit_DCD_COMMUNE_NAISSANCE, Lieu) - confiance: round(100 * score_lieu * score_nom * score_prenom * score_date)
This will add a
confiance column you’ll be able to filter.
Depending on whether you want to have a new clients file with deaths match or a new clients_x_deaths only with the intersection, you should configure your recipe to blank the lines or to remove bad matches :
To blank bad matches :
- eval: # blank low score lines - hit_DCD_COMMUNE_NAISSANCE: hit_DCD_COMMUNE_NAISSANCE if (confiance > 30) else "" - hit_DCD_DATE_NAISSANCE: hit_DCD_DATE_NAISSANCE if (confiance > 30) else "" - hit_DCD_NOM: hit_DCD_NOM if (confiance > 30) else "" - hit_DCD_PRENOMS: hit_DCD_PRENOMS if (confiance > 30) else "" - hit_matchid_id: hit_matchid_id if (confiance > 30) else ""
Or to filter bad matches :
- keep: where: confiance > 20
It should take about 3 minutes on a laptop to proceed the 16k rows. Go to
clients_x_deaths to see the results !
We now have the results. To make it more visual, we can display the results in the validation app. Simply declare the dataset in a special way to format the results:
datasets: clients_x_deaths: connector: elasticsearch table: clients_x_deaths validation: columns: - field: matchid_id label: Id display: false searchable: true - field: - Nom - hit_DCD_NOM label: nom display: true searchable: true callBack: formatDiff - field: - Prenom - hit_DCD_PRENOMS label: prenom display: true searchable: true callBack: formatDiff - field: - Date - hit_DCD_DATE_NAISSANCE label: date display: true searchable: true callBack: formatDate appliedClass: head: head-centered body: has-text-centered - field: - Lieu - hit_DCD_COMMUNE_NAISSANCE label: Lieu display: true searchable: true callBack: coloredDiff - field: confiance label: Score display: true searchable: false type: score callBack: formatNumber appliedClass: head: head-centered body: has-text-centered min-column-width-100 view: display: true column_name: view fields: operation: excluded names: - none scores: column: confiance range: - 0 - 100 colors: success: 80 info: 60 warning: 30 danger: 0 statisticsInterval: 5 preComputed: decision: 55 indecision: - 40 - 65
Don’t forget to click on save the configuration and to refresh the page
Then you should have a blue
Validation button you can click to have this final display:
This first quick-n-dirty try gives quite good results withs scores above 40 :
Note: This configuration can be avoided if you had previously mapped your column names as in the advanced tutorial
For advanced results, you should get a strong environnement (8vCPU at least, we recommend 16vCPU and the higher the better), and go to the advanced tutorial with more than 1M datasets.