lib.modernisation package

Submodules

lib.modernisation.modernisation module

class lib.modernisation.modernisation.Modernisation(modernisable_entities=('O', 'B-time', 'I-time'), min_word_length=3)

Bases: object

Class responsible for translating historical Dutch into modern Dutch. It uses a variety of techniques. The goal is solely to aid of the user reading the historical texts.

__call__(sentences)
Parameters:

sentences (List[List[Dict]]) – List of lists of words

Returns:

The same list of lists, with the fields “remove_whitespace_for_modernisation” and “modernisation” added to each word. The former is currently unused, but could be used if modernisation is extended to consider multiple words.

__init__(modernisable_entities=('O', 'B-time', 'I-time'), min_word_length=3)
Parameters:
  • modernisable_entities (Union[List[str], Tuple[str, …]]) – Which entity types can be modernised.

  • min_word_length – Minimum length of words for which to apply full modernisation. Shorter words are only matched against line-break and abbreviation patterns.

lib.modernisation.regex_rules module

class lib.modernisation.regex_rules.RegexRules

Bases: object

Deals with modernising historical Dutch words based on regular expressions. All regular expressions are to be divided in three categories:

  1. Line-break regular expressions that define how words can be divided over multiple lines in the historical text.

  2. Direct word-to-word translations.

  3. Further regular expressions on parts of words, these are all heuristically found.

The first and third are both applied by calls to the regex_subs method. The second is handled by the dict_lookup method.

__init__()

Loads the line-break rules, dictionaries and other rules from various sources.

dict_lookup(word_form_lowercase)
Parameters:

word_form_lowercase (str) – Lower case form of the word to modernise

Returns:

A tuple, the first element indicated the found replacement (if any), the second whether a replacement was found (True) or not (False).

regex_subs(word_form_lowercase, filename)
Parameters:
  • word_form_lowercase (str) – Lower case form of the word to modernise

  • filename (str) – The filename for which to apply regexes.

Returns:

A tuple, the first element indicated the result of applying the regular expressions, the second whether the regex changed the word-form (True) or not (False).

tree_regexes = {'initial_letter': {'regex_form': re.compile('^\\^[a-z0-9].+'), 'key_from_regex_fun': <function RegexRules.<lambda>>, 'key_from_word_fun': <function RegexRules.<lambda>>}, None: {'regex_form': re.compile('.+'), 'key_from_regex_fun': <function RegexRules.<lambda>>, 'key_from_word_fun': <function RegexRules.<lambda>>}}

lib.modernisation.syllable_corrector module

class lib.modernisation.syllable_corrector.SyllableCorrector(file_name=None)

Bases: object

Modernisation strategy by substituting syllables and combinations thereof by their modernised equivalents. The default replacements used are given by the human-readable switches.json. The splitting into syllables is handled by the SyllableTokenizer class.

__call__(word_form_lowercase, verbose=False)
Parameters:
  • word_form_lowercase – The lower case word-form to be modernised.

  • verbose – Flag whether to output extra info for debugging (True) or not (False).

Returns:

A tuple, the first element indicated the result of applying the syllabel corrector, the second whether the regex changed the word-form (True) or not (False).

__init__(file_name=None)
Parameters:

file_name – The name of the switches file, defaults to switches.json the path is given relative to the constants.DATA_DIR directory.

lib.modernisation.syllable_tokenizer module

class lib.modernisation.syllable_tokenizer.SyllableTokenizer(pad='<PAD>', start='[CLS]', end='[SEP]', unk='[UNK]')

Bases: object

Splits words into syllables based on some basic rules from (modern) Dutch. Four basic rules are followed:

  1. Two vowels are separated by a single consonant: break before that consonant

  2. Two vowels are separated by a multiple consonants: break after the first consonant

  3. Compound words must be split according to their components, e.g. broodoven (bread oven) becomes brood.oven and not broo.doven as rule 1. would suggest. This rule is currently not implemented.

  4. The syllable formed must be pronouncable. This rule is implemented by modifying rule 2. to only allow syllables to start with combinations of consonants that are considered pronouncable in Dutch. This list is given by syll_starts.txt. The list was formed by studying all consonant combinations that exist at the beginning of Dutch words from a frequency table.

See www.dutchgrammar.com for more elaborate examples.

IJ_PATTERN = re.compile('ij')
IJ_SUB_PATTERN = re.compile('ÿ')
SAFETY_PATTERN = re.compile('^[a-z ]+$')
__init__(pad='<PAD>', start='[CLS]', end='[SEP]', unk='[UNK]')
Parameters:
  • pad (str) – Representation of the pad-token

  • start (str) – Representation of the start-token

  • end (str) – Representation of the end-token

  • unk (str) – Representation for any unknown token

decode(tokens)
Parameters:

tokens (List[str]) – Tokens of the word

Returns:

Word formed from the tokens

encode(word)
Parameters:

word (str) – Word to be split

Returns:

List of tokens for the word