At the end of 2016, the French President launched a new program largely inspired by Barack Obama’s Presidential Innovation Fellows (PIF).
This program was called Entrepreneur d’Intérêt Général and the first session took place from January 2017 to October 2017.
Martin was picked for one of the challenge inside French Ministry of Interior, under the supervision of Daniel ANSELLEM and Fabien ANTOINE. That is there that Fabien and Martin met and decided to launch matchID as an open-source project.
The challenge consisted in suppressing all the deaths from French Driving Licenses database.
SQL without fuzziness allowed 70+ % of matching between deaths’ database and drivers’ database.
But to reach higher levels of matching (recall and precision), we needed better data cleansing, fuzzy matching and matching learning.
Our first attempts to get high results were successful thanks to Data Science Studio from the company Dataiku.
But, as we were talking about our solution to match people’s identities inside the French administration, many people asked if they could use it on their own datasets. For many administrations, licenced software are strictly controled.
So we decided to rewrite our project totally and to create adaptative solution for record linkage with a focus on performance evaluation. You get it now, it’s matchID.
There is a large variety on record linkage use cases. Here is a non-exhaustive list of use cases which justify an adaptative solution :
Some questions to ask :
Depending on all of these question, the solution can be very different. All use cases should be implemented whith matchID, but we only implemented a dozen of use cases and each needed specific piplines. We do our best to pulish every code but every use case has to be anonymized and publishing is a work by itself, we do it regularily.
We had some fails in methodolgy. Depending of your time, organisation and compuation power you can have various approaches.
You should first ask how to iterate quick and evaluate as soon as possible your results. And add consuming algorithms depending of your first results.
Evaluation should drive every decision, and thus annotation is necessary. Annoating a hundred and progressively more depending of your use case should be a focus of your organisation. If you plan to have a precision of 99.99% you will have to do more than 10k annotation. It’s time consuming but depends of what you want.
A project cannot easily live inside an administration without being open-sourced.
So we open-sourced it. No risk of divulging secret defense stuff and so, you will not have all of our recipes - only the best algorithms.