Welcome to the documentation of the ijsbeer ai pipeline!

The purpose of the ijsbeer ai pipeline is to enrich historical Dutch texts with useful data. This consists of four separate modules:

  1. String-to-sentence: split input string into words

  2. Post-correction: Correct any presumed mistakes in the HTR for each word.

  3. Named Entity Recognition (NER) using BERT.

  4. Named Entity Recognition (NER) using lists. This can be defined to only search within the results of the NER BERT.

  5. Modernisation of the words. This step is done last so that found entities can be skipped.

Each of these steps is documented in more detail in the corresponding subpackage of the lib package documentation.

Typical usage

The intended use of this pipeline is that it is containerised and called via POST-requests to the server. Below we present a number of examples of server calls as they might be used by a user of this pipeline. The examples are presented by calls from Python, but can easily be implemented in whatever language the user wants.

First, we import the requests package in Python and specify the URL for our host.

import requests
URL = "http://0.0.0.0:5002/

A “default” call executes all steps in the pipeline.

server_response = requests.post(
    url=URL + 'pipeline',
    json={"input_data": "Brieven van Cornelis Lardijn, Eijndhoven, 25 maart 2021"}
)
full_json = server_response.json()

This would produce an output like (where most entried have been omited for clarity)

{
  "message": "Success",
  "results": [
    [
      {},
      {
        "begin_char": 30,
        "bio": "B-location",
        "end_char": 40,
        "entity_chars": [
          [
            30,
            41
          ]
        ],
        "labels": {
          "BERT": {
            "date": {
              "bio": "O",
              "label_probabilities": {
                "<PAD>": 0.0021120477467775345,
                "B": 0.002442566677927971,
                "I": 0.001100641442462802,
                "O": 0.008404848165810108
              }
            },
            "location": {
              "bio": "B",
              "label_probabilities": {
                "<PAD>": 0.0021120477467775345,
                "B": 0.946304202079773,
                "I": 0.001456045312806964,
                "O": 0.008404848165810108
              }
            },
            "person": {
              "bio": "O",
              "label_probabilities": {
                "<PAD>": 0.0021120477467775345,
                "B": 0.036251552402973175,
                "I": 0.001928190584294498,
                "O": 0.008404848165810108
              }
            }
          },
          "lists": []
        },
        "modernisation": "Eijndhoven,",
        "ner": true,
        "post_correction": "Eijndhoven,",
        "remove_whitespace_for_modernisation": false,
        "word": "Eijndhoven,"
      },
      {}
    ]
  ]
}

If desired, the pipeline can be limited to only using a number of steps, e.g.

import requests
URL = "http://0.0.0.0:5002/
server_response = requests.post(
    url=URL + 'pipeline',
    json={"input_data": "Brieven van Cornelis Lardijn, Eijndhoven, 25 maart 2021"}
)
full_json = server_response.json()

Results in

{
  "message": "Success",
  "results": [
    [
        {},
        {
          "begin_char": 30,
          "end_char": 41,
          "modernisation": "Eindhoven,",
          "ner": true,
          "post_correction": "Eijndhoven,",
          "remove_whitespace_for_modernisation": false,
          "word": "Eijndhoven,"
        },
        {}
        ]
    ]
}

Note that because BERT was not used, the named Entity “Eindhoven” is modernized in this example, where it was not before.

If the string_to_sentences is not called by the pipeline, the input must be provided in terms of a list of lists of dicts, as usually output by string_to_sentences. This can be useful if performing a step separate from others. In the example below, we use the result from above and only call ner_bert on the result.

server_response = requests.post(
    url=URL + "pipeline",
    json={
        "input_data": full_json["results"],
        "steps": ["ner_bert"]
    }
)
server_response.json()

More details on this interface are given in the server package documentation.

Indices and tables