lib.string_to_sentences package

Submodules

lib.string_to_sentences.replacer module

class lib.string_to_sentences.replacer.Replacer(from_key, to_key)

Bases: object

Handles several types of regex-based on operations on one or multiple words.

__init__(from_key, to_key)
Parameters:
  • from_key (str) – The key of the word dict that is the input for the methods, e.g. “word”

  • to_key (str) – The key of the word dict that is to contain the output of the methods, e.g. “word”

get_replacement_inputs(words_sub, patterns)
Parameters:
  • words_sub (List[Dict]) – The words that are found to be a match to the patterns.

  • patterns (List[Pattern]) – The patterns to which the words were matched.

Returns:

Those parts of the words_sub that match the group(s) in the corresponding regex patterns.

joinor()
Returns:

Function (N.B. the method returns a function) that will take a series of words and join them into a single word. To be used as an input for Replacer.replace_words(..., replacement_function=Replacer.joinor(), ...)

replace_words(words, patterns, replace_function, recursive=False)
Parameters:
  • words (List[Dict]) – The list of words in which to search for patterns and apply replacements.

  • patterns (List[Pattern]) – The patterns to search.

  • replace_function (Callable) – The replace_function is called on the words found to match the patterns and the output is placed in the list of words in their place.

  • recursive (bool) – Whether to apply the replacement also to words that have already been replaced. If False (default), the search will skip to the end of the replaced words after a replacement.

Returns:

return_matches(words, patterns, start=0)
Parameters:
  • words (List[Dict]) – List of words through which to search for the pattern

  • patterns (List[Union[str, Pattern]]) – The patterns to search a subsequence of the words is matched against the sequence of patters: each word must match the appropriate pattern

  • start (int) – Which element of the patterns is considered to be the start of the match.

Returns:

Indices of matches, offset by the start parameter.

splittor(regex)
Parameters:

regex (Pattern) – The regex that splits a word into multiple words. Regex groups indicate what should become new words.

Returns:

Function (N.B. the method returns a function) that will take a single word and split it into several different words. To be used as an input for Replacer.replace_words(..., replacement_function=Replacer.splittor(), ...)

lib.string_to_sentences.string_to_sentences module

class lib.string_to_sentences.string_to_sentences.StringToSentences(**_)

Bases: object

Converts a given string to a list of lists of dicts, see __call__ for final result form.

__call__(text)
Parameters:

text – A string representing the historic text. Words are considered to be separated, broadly speaking, by either spaces and/or newlines. A double newline is interpreted to separate sentences. Other methods are also used to separate sentences based on punctuation.

Returns:

a list of lists of words object with format example given below.

out = {"word": "Weerld!", "begin_char": 7, "end_char": 14, "ner": True}

Here begin/end char represents the original place inside text, can be used for locating this word in the original document. The “ner” key tells if this word should be used for named entity recognition.

__init__(**_)
Parameters:

_ – Unused kwarg argument, included for symmetry with the other pipeline steps.

static compile_line_word_split_patterns()
static get_list_of_dict_word_and_chars(text)

Split the string into words where a word can be a normal word, but also a linebreak, space or any other special character.

last_word_regex = re.compile('(^[^ ]+[a-z])([.!?]$)')
line_word_splits(words)

Deal with detectable word splits caused by line breaks. For applying ner and also do post correction or modernisation it is better to look at the whole word.

ner_pattern = re.compile('(^[^a-zA-Z0-9:;.,?!"’\'`]$)')
replacer = <lib.string_to_sentences.replacer.Replacer object>
set_ner_flag(words)

Set ner to False for words that match the NER (anti)pattern

split_in_sentences(words)

Deal with the line breaks.

split_rule = re.compile('([ \n])')