lib.modernisation package¶
Submodules¶
lib.modernisation.modernisation module¶
-
class
lib.modernisation.modernisation.
Modernisation
(modernisable_entities=('O', 'B-time', 'I-time'), min_word_length=3)¶ Bases:
object
Class responsible for translating historical Dutch into modern Dutch. It uses a variety of techniques. The goal is solely to aid of the user reading the historical texts.
-
__call__
(sentences)¶ - Parameters:
sentences (
List
[List
[Dict
]]) – List of lists of words- Returns:
The same list of lists, with the fields “remove_whitespace_for_modernisation” and “modernisation” added to each word. The former is currently unused, but could be used if modernisation is extended to consider multiple words.
-
__init__
(modernisable_entities=('O', 'B-time', 'I-time'), min_word_length=3)¶ - Parameters:
modernisable_entities (
Union
[List
[str
],Tuple
[str
, …]]) – Which entity types can be modernised.min_word_length – Minimum length of words for which to apply full modernisation. Shorter words are only matched against line-break and abbreviation patterns.
-
lib.modernisation.regex_rules module¶
-
class
lib.modernisation.regex_rules.
RegexRules
¶ Bases:
object
Deals with modernising historical Dutch words based on regular expressions. All regular expressions are to be divided in three categories:
Line-break regular expressions that define how words can be divided over multiple lines in the historical text.
Direct word-to-word translations.
Further regular expressions on parts of words, these are all heuristically found.
The first and third are both applied by calls to the
regex_subs
method. The second is handled by thedict_lookup
method.-
__init__
()¶ Loads the line-break rules, dictionaries and other rules from various sources.
-
dict_lookup
(word_form_lowercase)¶ - Parameters:
word_form_lowercase (
str
) – Lower case form of the word to modernise- Returns:
A tuple, the first element indicated the found replacement (if any), the second whether a replacement was found (True) or not (False).
-
regex_subs
(word_form_lowercase, filename)¶ - Parameters:
word_form_lowercase (
str
) – Lower case form of the word to modernisefilename (
str
) – The filename for which to apply regexes.
- Returns:
A tuple, the first element indicated the result of applying the regular expressions, the second whether the regex changed the word-form (True) or not (False).
-
tree_regexes
= {'initial_letter': {'regex_form': re.compile('^\\^[a-z0-9].+'), 'key_from_regex_fun': <function RegexRules.<lambda>>, 'key_from_word_fun': <function RegexRules.<lambda>>}, None: {'regex_form': re.compile('.+'), 'key_from_regex_fun': <function RegexRules.<lambda>>, 'key_from_word_fun': <function RegexRules.<lambda>>}}¶
lib.modernisation.syllable_corrector module¶
-
class
lib.modernisation.syllable_corrector.
SyllableCorrector
(file_name=None)¶ Bases:
object
Modernisation strategy by substituting syllables and combinations thereof by their modernised equivalents. The default replacements used are given by the human-readable
switches.json
. The splitting into syllables is handled by theSyllableTokenizer
class.-
__call__
(word_form_lowercase, verbose=False)¶ - Parameters:
word_form_lowercase – The lower case word-form to be modernised.
verbose – Flag whether to output extra info for debugging (True) or not (False).
- Returns:
A tuple, the first element indicated the result of applying the syllabel corrector, the second whether the regex changed the word-form (True) or not (False).
-
__init__
(file_name=None)¶ - Parameters:
file_name – The name of the switches file, defaults to switches.json the path is given relative to the constants.DATA_DIR directory.
-
lib.modernisation.syllable_tokenizer module¶
-
class
lib.modernisation.syllable_tokenizer.
SyllableTokenizer
(pad='<PAD>', start='[CLS]', end='[SEP]', unk='[UNK]')¶ Bases:
object
Splits words into syllables based on some basic rules from (modern) Dutch. Four basic rules are followed:
Two vowels are separated by a single consonant: break before that consonant
Two vowels are separated by a multiple consonants: break after the first consonant
Compound words must be split according to their components, e.g. broodoven (bread oven) becomes brood.oven and not broo.doven as rule 1. would suggest. This rule is currently not implemented.
The syllable formed must be pronouncable. This rule is implemented by modifying rule 2. to only allow syllables to start with combinations of consonants that are considered pronouncable in Dutch. This list is given by
syll_starts.txt
. The list was formed by studying all consonant combinations that exist at the beginning of Dutch words from a frequency table.
See www.dutchgrammar.com for more elaborate examples.
-
IJ_PATTERN
= re.compile('ij')¶
-
IJ_SUB_PATTERN
= re.compile('ÿ')¶
-
SAFETY_PATTERN
= re.compile('^[a-z ]+$')¶
-
__init__
(pad='<PAD>', start='[CLS]', end='[SEP]', unk='[UNK]')¶ - Parameters:
pad (
str
) – Representation of the pad-tokenstart (
str
) – Representation of the start-tokenend (
str
) – Representation of the end-tokenunk (
str
) – Representation for any unknown token
-
decode
(tokens)¶ - Parameters:
tokens (
List
[str
]) – Tokens of the word- Returns:
Word formed from the tokens
-
encode
(word)¶ - Parameters:
word (
str
) – Word to be split- Returns:
List of tokens for the word