Datasets

from csv to elasticsearch

Connectors

Connectors are a declaration to a filesystem or database.

Supported connectors are:

filesystem
Elasticsearch
PostGreSQL

SQLAlchemy is used for PostGreSQL and should could be any SQL connexion, but only PostGreSQL has at been tested yet. See the connectors roadmap for support of other databases in the future.

Default connectors are included in the initial configuration (and can me modified or removed):

upload: filesystem upload/, which is connected to the upload api
referential_data: filesystem referential_data/, used for referential data like french citycodes, countries, …
models: filesystem models/, used for storing machine learning kernels
elasticsearch: connector to the elasticsearch provided with docker
postgres: connector to the PostGreSQL provided with docker (you have to run make postgres from commandline to start the docker database)

To create other connectors, we haven’t provided an online editor for now, your have to add it manually in the conf/connectors/connectors.yml:

connectors:
  upload:
    type: filesystem
    database: upload/
    chunk: 2000                 # <==== default chunks size (number of rows) for reading and writing the datasets
  referential_data:
    type: filesystem
    database: referential_data/
    chunk: 10000
  models:
    type: filesystem
    database: models/
  elasticsearch:
    type: elasticsearch
    host: elasticsearch        # <===== hostname of an elasticsearch node, can be an array for multiple data nodes for better performance
    port: 9200                 # <===== port of elasticsearch
    thread_count: 3            # <===== default number of threads for injecting data
    chunk: 500                 # <===== default number of row for read and write
    chunk_search: 20           # <===== default number of rows for bulk search
  postgres:
    type: sql
    uri: postgres://user:password@postgres:5432 # <==== URI for SQL Alchemy
    chunk: 2000

Datasets

Datasets are a file, a set of files, a table or an index within a connector.

datasets:
  my_dataset:
    connector: upload         # <===== upload, referential_datal, elasticsearch, postgres or any custom connector
    table: my_dataset
    chunk:                    # <===== override connector's default chunk

filesystem datasets

NO EXCEL NOR OPENDOCUMENT HERE !

Note that we’ll probably never make the effort of dealing with excel (xls, xlsx) and opendocument (odt) formats as they are too rich to be processed efficiently and with stability. Even if formatting is cute, it is meaning-less in an automation world, where formatting has to be carried by metadatas, which have to be data by themselves.

If you want to be serious about re-producing data transformation you have to forget about editing manually data without making it auditable, and self-certification is a wrong security pattern in a globally connected world.

So you’ll have to meet us half way and export your excel files in csv, and know what is your business process to version your data.

CSV

Comma-separated files have been the classical I/O in opendata for many years. Even if it’s quite a mess for data types, it is a go-between, between strongly structured format like XML for IT specialists and the mainstream office excel-style.

If you’re familiar with using many sources you know CSV is not standard and that most guesser can’t guess all : separator, encoding, escaping, etc. We didn’t have time to build a top-of Pandas for guessing, and prefered to enable you to set manual options to handle all the cases.

For robust and stable processing, we desactivated every guessing of types, so every cell is string or unicode in input: casting data types will be possible next within recipes. For the same reason, we didn’t deal here with the na_values and keep_default_na is forced to False.

option	default	other	objective
sep	;	any regex	specify columns separator
header	infer	false	use included head
encoding	utf8	latin1 …	specify the encoding if not ascii
names		[col, names]	replace header (column) names
compression	infer	None, gzip …	specify if compressed
skiprows	0	any number	skip n rows before processing

Fwf

This is the old brother of the CSV, the fixed-width tabular is a variant, but in some cases this will be more stable to parse your tabular files as fixed width. Often Oracle or PostGreSQL exports are to be parsed like that.

Like in csv, for robust and stable processing, we desactivated every guessing of types, so that every cell is string or unicode in input: casting data types will be possible next within recipes. For the same reason, we didn’t deal here with the na_values and keep_default_na is forced to False.

option	default	other	objective
encoding	utf8	latin1 …	specify the encoding if not ascii
names		[col, names]	replace header (column) names
width	[1000]	[2, 5, 1, …]	columns width for the fixed format
compression	infer	None, gzip …	specify if compressed
skiprows	0	any number	skip n rows before processing

msgpack

Well you may not be familiar with that format. This is a simple and quite robust format for backing up data when the data is typed. We could have chosen HDFS (strong, but boring for integration, and ageing), json/bson (quite slow), or pickle (too much instability between versions), or parquet (seducing, but the pandas library doesn’t deal well with chunks).

other formats

Check the roadmap about future support for json, xml or any other type.

(PostGre)SQL

There is no specific option when using a (PostGre)SQL dataset at the moment.

elasticsearch

option	default	other	objective
random_view	True	False	display random sample by default
select	{“query”: {“match_all”: “”} }	any es query	query for filtering index
doc_type	table name	any string	change document type name
body	{}	cf infra	settings and mappings
max_tries	3	any integer	number of retries (with exponential backoff)
thread_count	connector value	any integer	number of threads for inserting
timeout	10	time in seconds	timeout for bulk indexing & reading
safe	True	False	doesn’t use `_id` field to index, which can lead to doubles when retrying
chunk_search	connector value	any number	number of row for search queries when using fuzzy join

index settings, mapping

The big challenge with elasticsearch datasets and scaling is the tuning possibilities. The body value is the equivalent of the body value when creating the index with curl, except it is written in yaml instead of json, which makes it easier to read.

Here is an example of configuration :

datasets:
  my_dataset:
    connector: elasticsearch
    table: my_dataset
    body:
      settings:
        index.refresh_interval: 30s     # <==== disable the "instant" indexing, which speed up indexing
        index.number_of_replicas: 0     # <==== disable replicas, which speeds up indexing
        index.number_of_shards: 9       # <==== should be a multiple of your nodes number, and increased if large datasets
                                        #       a shard should have less that 1M docs
      mappings:
        agrippa:
          _all:                         # <==== disable "_all" fileds indexing, for speeding up
            enabled: False
          dynamic: False
          properties:                   # <==== specify here the fields only needed for fuzzy join
            matchid_id:
              type: keyword
            matchid_name_match:
              type: text
            matchid_date_birth_str:
              type: keyword
            matchid_location_city:
              type: keyword
            matchid_location_country:
              type: keyword

For large clusters, you have to choose between low and large number_of_replicas for more robust indices. The body setting can be used for specific indexing and parsing, like ngrams and phonetics.

Validations options

Elasticsearch can be used to validate matches (as seen in [tutorial])(/tutorial#step-3-validate-matches-and-train-rescoring-with-machine-learning).

The validation option is activated by adding the validation: true option :

datasets:
  clients_x_deaths:
    connector: elasticsearch
    table: clients_x_deaths
    validation: true

Validation mode is configured with the conf/matchID_validation.conf file for default behaviour. They can be overriden for each validation dataset like this :

    [...]
    validation:
      actions:
        display: true
        action:
          label: Results
          indecision_display: true   # if you don't want to use indecision
        done:
          label: Status
      elasticsearch:
        size: 200                    # number of search
      columns:                       # colum specific action
        - field: matchid_id
          label: Id
          display: false                 # some fields can be used for selection but not being displayed
          searchable: true
        - field:                         # array gathers two fields in one column
            - matchid_name_last_src
            - hit_matchid_name_last_src
          label: last name               #
          display: true
          searchable: true
          callBack: formatDiff           # this callback highlights the differents between the two first elements of an array
        - field:
            - matchid_name_first_src
            - hit_matchid_name_first_src
          label: first name
          display: true
          searchable: true
          callBack: formatDiff
        - field:
            - matchid_sex
            - hit_matchid_sex
          label: sex
          display: true
          searchable: true
          callBack: formatSex
          appliedClass:
            head: head-centered
            body: has-text-centered
        - field:
            - matchid_date_birth_str
            - hit_matchid_date_birth_str
          label: birth date
          display: true
          searchable: true
          callBack: formatDate
          appliedClass:
            head: head-centered
            body: has-text-centered
        - field:
            - matchid_location_city
            - hit_matchid_location_city
          label: birth city
          display: true
          searchable: true
          callBack: coloredDiff
        - field: matchid_hit_distance
          label: Distance
          display: true
          searchable: false
          callBack: formatDistance
          appliedClass:
            head: head-centered
            body: has-text-centered
        - field: matchid_clique_size    # example of an added field
          label: Group size
          display: true
          searchable: false
          callBack: formatNumber
          appliedClass:
            head: head-centered
            body: has-text-centered
        - field: matchid_clique_id
          label: Group id
          display: true
          searchable: true
          callBack: formatNumber
          appliedClass:
            head: head-centered
            body: has-text-centered
        - field: confiance
          label: Score
          display: true
          searchable: false
          type: score
          callBack: formatNumber
          appliedClass:
            head: head-centered
            body: has-text-centered min-column-width-100
        - field: matchid_hit_score
          label: Pre_Score
          display: true
          searchable: false
          type: score
          callBack: formatNumber
          appliedClass:
            head: head-centered
            body: has-text-centered min-column-width-100
      view:
        display: true
        column_name: view
        fields:
          operation: excluded
          names:
            - none
      scores:
        column: confiance
        range:                # if you have a specific range for scoring
          - 0
          - 100
        colors:                # color depending coloring customization
          success: 80
          info: 60
          warning: 30
          danger: 0
        statisticsInterval: 5
        preComputed:           #  if you want to change the "pre-decision" thresholds
          decision: 55
          indecision:
            - 40
            - 65

Datasets

from csv to elasticsearch

Connectors

Datasets

filesystem datasets

NO EXCEL NOR OPENDOCUMENT HERE !

CSV

Fwf

msgpack

other formats

(PostGre)SQL

elasticsearch

index settings, mapping

Validations options

Fabien ANTOINE

Cristian PEREZ BROKATE

Simon CAQUÉ