lib.ner_lists package

Submodules

lib.ner_lists.direct_finder module

class lib.ner_lists.direct_finder.DirectFinder(data_dir, entity_type, word_getter)

Bases: lib.ner_lists.finder.Finder

Finds entities from lists where the searchable forms are not to be permuted nor to be matched fuzzily. An object is instantiated for each different type of entity in the project, e.g. person, location and date.

__init__(data_dir, entity_type, word_getter)
Parameters:
  • data_dir (str) – Path to the data files to use relative to the pipeline_data director, e.g. “ner_lists/SZSA”.

  • entity_type (str) – See Finder

  • word_getter (str) – See Finder

class lib.ner_lists.direct_finder.FindEntitiesGivenDirectList(direct_list)

Bases: lib.ner_lists.finder.FindEntities

Finds entities from a single list where the searchable forms are not to be permuted nor to be matched fuzzily.

__call__(i_sentence, sentence)
Parameters:
  • i_sentence (int) – A number indicating which sentence we are searching

  • sentence (List[Dict]) – List of words. Note that this is not a “true sentence” in the sense used throughout the project. It is a subset of all words, depending on the object’s get_words.

Return type:

List[Entity]

Returns:

A list of found entities.

__init__(direct_list)
Parameters:

direct_list (Dict) – The list to search, given by a dictionary where the keys are the searchables and the values represent the (canonical) information on each entity.

lib.ner_lists.entity module

class lib.ner_lists.entity.Entity(sentence, begin, end, score, searchable, canonical_form, extra_attributes)

Bases: object

Defines an Entity found in the text, contains all necessary properties as fields.

__init__(sentence, begin, end, score, searchable, canonical_form, extra_attributes)
Parameters:
  • sentence (int) – The number of the sentence in which the entity was found

  • begin (int) – Begin character of the entity

  • end (int) – End character of the entity

  • score (float) – Score of the match

  • searchable (str) – The “searchable” form of the word(s) that was/were matched

  • canonical_form (str) – The canonical form of the entity

  • extra_attributes (Dict[str, str]) – Any extra attributes

to_dict(**kwargs)
Parameters:

kwargs – Additional key-value pairs to place on the dict.

Returns:

A dictionary containing relevant information on the entity for the output json as data

lib.ner_lists.finder module

class lib.ner_lists.finder.FindEntities(*args, **kwargs)

Bases: object

Abstract base class for the FindEntitiesGiven… classes

abstract __call__(*args, **kwargs)
Parameters:
  • i_sentence – A number indicating which sentence we are searching

  • sentence – List of words. Note that this is not a “true sentence” in the sense used throughout the project. It is a subset of all words, depending on the object’s get_words.

Returns:

A list of found entities.

abstract __init__(*args, **kwargs)
class lib.ner_lists.finder.Finder(entity_type, entity_finders, word_getter)

Bases: object

Finds entities from lists. Derived classes exist for different types of matchings.

__call__(sentences)
Parameters:

sentences (List[List[Dict]]) – List of lists of words.

Returns:

None, the words are modified

__init__(entity_type, entity_finders, word_getter)
Parameters:
  • entity_type (str) – The type of entity for which the finder is searching, e.g. “person”.

  • entity_finders (Dict[str, FindEntities]) – The entity finders that look through a specific list or tree.

  • word_getter (str) – Passed on the class GetWords to determine how to get the list of searchable words.

static create_lists(dirpath)
Parameters:

dirpath (str) – The directory where to search for files.

Return type:

Dict[str, Dict]

Returns:

A dictionary where each key-value-pair is a combination of a filename with a corresponding dictionary of searchables.

lib.ner_lists.finder.load_file(dirpath, list_name, extension)
Parameters:
  • dirpath (str) – The directory where the file is located

  • list_name (str) – The name of the file

  • extension (str) – The extension of the file

Return type:

Dict[Tuple[str, …], List[Dict]]

Returns:

A dicrionary with the key “searchable” pointing to a list of dictionaries containing the keys “searchable”, “canonical_form” and “extra_attributes”. The first is given by a tuple of words (to allow reordering of the elements later), the second as a string and the third as a dictionary.

Example:

dict_searchables[('middelburg',)] = [
    {
        'searchable': ('middelburg',),
        'canonical_form': 'Maspeth',
        'extra_attributes': {'geometry': 'Point(40.75 -73.9333)'}
    },
    {
        'searchable': ('middelburg',),
        'canonical_form': 'Middelburg',
        'extra_attributes': {'geometry': 'Point(51.5021 3.6141)'}
    },
    {
        'searchable': ('middelburg',),
        'canonical_form': 'Middelburg',
        'extra_attributes': {'geometry': 'Point(0 0)'}
    }
]

lib.ner_lists.fuzzy_finder module

class lib.ner_lists.fuzzy_finder.FindEntitiesGivenList(fuzzy_list, cutoff_score)

Bases: lib.ner_lists.finder.FindEntities

Finds entities from a single list where the searchable forms are not to be permuted but are to be matched

fuzzily.

__call__(i_sentence, sentence)
Parameters:
  • i_sentence (int) – A number indicating which sentence we are searching

  • sentence (List[Dict]) – List of words. Note that this is not a “true sentence” in the sense used throughout the project. It is a subset of all words, depending on the object’s get_words.

Return type:

List[Entity]

Returns:

A list of found entities.

__init__(fuzzy_list, cutoff_score)
Parameters:
  • fuzzy_list (Dict) – The list to search, given by a dictionary where the keys are the searchables and the values represent the (canonical) information on each entity.

  • cutoff_score (float) – The lowest score for entities to still be considered a hit.

class lib.ner_lists.fuzzy_finder.FuzzyFinder(data_dir, entity_type, word_getter, cutoff_score)

Bases: lib.ner_lists.finder.Finder

Finds entities from lists where the searchable forms are not to be permuted but are to be matched fuzzily. An object is instantiated for each different type of entity in the project, e.g. person, location and date.

__init__(data_dir, entity_type, word_getter, cutoff_score)
Parameters:
  • data_dir (str) – Path to the data files to use relative to the pipeline_data director, e.g. “ner_lists/SZSA”.

  • entity_type (str) – See Finder

  • word_getter (str) – See Finder

  • cutoff_score (float) – The lowest score for entities to still be considered a hit.

lib.ner_lists.fuzzy_matcher module

class lib.ner_lists.fuzzy_matcher.FuzzyListMatcher(fuzzy_list, cutoff_score)

Bases: lib.ner_lists.fuzzy_matcher.FuzzyMatcher

Finds matches given a “list”, a list of “searchables” grouped by a key element of the searchable. The key element is searched for first, after which different ordereings of the match are considered.

__init__(fuzzy_list, cutoff_score)
Parameters:
  • fuzzy_list (Dict[str, Dict]) – The list through which to search for entities

  • cutoff_score (float) – Score below which to disregard matches.

find_matches(i_sentence, sentence, n_approx=2)
Parameters:
  • i_sentence (int) – An index of the sentence, used to match it to the correct place later on.

  • sentence (List[Dict]) – List of words to match againse the tree

  • n_approx (int) – The approximate expected number of words in a hit. It influences the threshold for the first

hit. Setting it higher results in a higher threshold, causing faster performance but possible missed hits. Setting it too low results in poor performance. :return: Yields entities as they are found

yield_matches(i_sentence, sentence, i)
Parameters:
  • i_sentence (int) – string index

  • sentence (List[Dict]) – list of words

  • i (int) – index of word in sentence (which matches the key of an element in the tree-dictionary)

class lib.ner_lists.fuzzy_matcher.FuzzyMatcher(cutoff_score)

Bases: object

Class to match words with list elements

__init__(cutoff_score)
Parameters:

cutoff_score (float) – Score below which to disregard matches.

static get_close_matches_scores(word_group, possibilities, n, cutoff)
Parameters:
  • word_group – a string, might be multiple words concatenated without spaces.

  • possibilities – list of strings. word_group will be compared to all strings in this list.

  • n – positive integer. the top-n matches will be returned.

  • cutoff – percentage of characters that should match before something is considered a match

Returns:

either an empty list -> no match/hit or a size 2 tuple with score and word_group string

score(word_group, word_list, cutoff_score=None)
Parameters:
  • cutoff_score – threshold for % of characters that should match between word_group and an el. of word_list

  • word_group – a string, which is a concatenated list of words that are to be matched

  • word_list – a list of (space-less concatenated) words that are compared to the word_group elements might

be permutations of others, e.g. fortbatavia and bataviafort :return: a number, 0<=number<=1. It is the percentage of characters that match in the best match (from the list)

class lib.ner_lists.fuzzy_matcher.FuzzyTreeMatcher(tree, cutoff_score)

Bases: lib.ner_lists.fuzzy_matcher.FuzzyMatcher

Finds matches given a “tree”, where a “tree” is a list of “searchables” grouped by a key element of the searchable. The key element is searched for first, after which different ordereings of the match are considered.

__init__(tree, cutoff_score)
Parameters:
  • tree (Dict[str, Dict]) – The tree through which to search for entities, each key points to a list of “searchables” containing that key as one of the elements.

  • cutoff_score (float) – Score below which to disregard matches.

find_matches(i_sentence, sentence, n_approx=2)
Parameters:
  • i_sentence (int) – An index of the sentence, used to match it to the correct place later on.

  • sentence (List[Dict]) – List of words to match againse the tree

  • n_approx (int) – The approximate expected number of words in a hit. It influences the threshold for the first

hit. Setting it higher results in a higher threshold, causing faster performance but possible missed hits. Setting it too low results in poor performance. :return: Yields entities as they are found

yield_matches(i_sentence, sentence, i, entity_key)
Parameters:
  • i_sentence (int) – string index

  • sentence (List[Dict]) – list of words

  • i (int) – index of word in sentence (which matches the key of an element in the tree-dictionary)

  • entity_key (str) – The tree is essentially a large dictionary. entity_key is the key of the most likely entity

Returns:

Yields entities as they are found

lib.ner_lists.fuzzy_permutative_finder module

class lib.ner_lists.fuzzy_permutative_finder.FindEntitiesGivenTree(tree, cutoff_score)

Bases: lib.ner_lists.finder.FindEntities

__call__(i_sentence, sentence)
Parameters:
  • i_sentence – A number indicating which sentence we are searching

  • sentence – List of words. Note that this is not a “true sentence” in the sense used throughout the project. It is a subset of all words, depending on the object’s get_words.

Return type:

List[Entity]

Returns:

A list of found entities.

__init__(tree, cutoff_score)
Parameters:
  • tree – The list to search, given by a dictionary where the keys are the searchables and the values represent the (canonical) information on each entity.

  • cutoff_score – The lowest score for entities to still be considered a hit.

class lib.ner_lists.fuzzy_permutative_finder.FuzzyPermutativeFinder(data_dir, entity_type, word_getter, cutoff_score)

Bases: lib.ner_lists.finder.Finder

Finds entities from lists where the searchable forms are to be permuted and to be matched fuzzily. An object is instantiated for each different type of entity in the project, e.g. person, location and date.

__init__(data_dir, entity_type, word_getter, cutoff_score)
Parameters:
  • data_dir (str) – Path to the data files to use relative to the pipeline_data director, e.g. “ner_lists/SZSA”.

  • entity_type (str) – See Finder

  • word_getter (str) – See Finder

  • cutoff_score (float) – The lowest score for entities to still be considered a hit.

lib.ner_lists.fuzzy_permutative_finder.create_freq_table(all_list)
Parameters:

all_list (Dict) – A dictionary of searchable-canonical pairs.

Returns:

A dictionary stating how often each searchable element occurs in the list.

lib.ner_lists.fuzzy_permutative_finder.create_tree(all_list)
Parameters:

all_list (Dict) – A dictionary of searchable-canonical pairs

Return type:

Dict[str, Dict]

Returns:

A dictionary where the keys are the first word that would be searched, the values are nested dicts of the same form as the input, i.e. searchable-canonical pairs.

lib.ner_lists.fuzzy_permutative_finder.remove_overlap(entities, sentence_length)
Parameters:
  • entities (List[Entity]) – The found entities by the permutative searching.

  • sentence_length (int) – The length of the sentence in which the entities occur.

Return type:

List[Entity]

Returns:

A trimmed list of found entities where overlapping results are discarded.

lib.ner_lists.get_searchable_words module

class lib.ner_lists.get_searchable_words.GetWords(**kwargs)

Bases: object

Abstract base class for the GetWordsNer and GetWordsBert classes

abstract __call__(sentences)
abstract __init__(**kwargs)
classmethod from_string(getter_name)
Parameters:

getter_name (str) – Name of the derived class, either “NER” or “BERT”

Returns:

The uninstantiated class of the desired kind.

class lib.ner_lists.get_searchable_words.GetWordsBert(entity_type, **kwargs)

Bases: lib.ner_lists.get_searchable_words.GetWords

__call__(sentences)
Parameters:

sentences – list of lists. Each element is a dict that represents a word. The intermediate level lists represent a sentence each. spaces and comma’s are still considered words at this level

Returns:

found_entities. A list of lists. Each intermediate list represents a BERT-found entity of type self.type_of_list (e.g. a location)

__init__(entity_type, **kwargs)
Parameters:
  • entity_type – The type of entity (e.g. location or person) which should be searched.

  • kwargs – any additional key-word arguments are passed to GetWords

class lib.ner_lists.get_searchable_words.GetWordsNer(**kwargs)

Bases: lib.ner_lists.get_searchable_words.GetWords

__call__(sentences)
Returns:

A list of lists, where the inner lists contain all words that have word[‘ner’] == True for a sentence.

__init__(**kwargs)
Parameters:

kwargs – any additional key-word arguments are passed to GetWords

lib.ner_lists.ner_lists module

class lib.ner_lists.ner_lists.NerLists(data_dir, word_getter, cutoff_score, entity_types)

Bases: object

Find entities by comparing words with lists

__call__(sentences)
Parameters:

sentences (List[List[Dict]]) – List of lists of words

Returns:

The same sentences, adorned with information found in various types of lists.

__init__(data_dir, word_getter, cutoff_score, entity_types)
Parameters:
  • data_dir (Union[List[str], str]) – Path of the lists, given as either a string or a list of strings, is processed using os.path.join.

  • word_getter (str) – Indicates how the words to be searched through are achieved. Currently only “NER” and “BERT” are supported.

  • cutoff_score (float) – The cut-off score below which matches will not be returned.

  • entity_types (List[str]) – Which entity types to search for, e.g. [“person”, “location”].

static prefill_empty_labels(sentences)
Parameters:

sentences (List[List[Dict]]) – List of lists of words

Returns:

None, the word dicts are modified by adding the “labels” key, if the word has “ner” set to True.