lib.ner_lists package¶
Submodules¶
lib.ner_lists.direct_finder module¶
-
class
lib.ner_lists.direct_finder.
DirectFinder
(data_dir, entity_type, word_getter)¶ Bases:
lib.ner_lists.finder.Finder
Finds entities from lists where the searchable forms are not to be permuted nor to be matched fuzzily. An object is instantiated for each different type of entity in the project, e.g. person, location and date.
-
__init__
(data_dir, entity_type, word_getter)¶ - Parameters:
data_dir (
str
) – Path to the data files to use relative to the pipeline_data director, e.g. “ner_lists/SZSA”.entity_type (
str
) – SeeFinder
word_getter (
str
) – SeeFinder
-
-
class
lib.ner_lists.direct_finder.
FindEntitiesGivenDirectList
(direct_list)¶ Bases:
lib.ner_lists.finder.FindEntities
Finds entities from a single list where the searchable forms are not to be permuted nor to be matched fuzzily.
-
__call__
(i_sentence, sentence)¶ - Parameters:
i_sentence (
int
) – A number indicating which sentence we are searchingsentence (
List
[Dict
]) – List of words. Note that this is not a “true sentence” in the sense used throughout the project. It is a subset of all words, depending on the object’sget_words
.
- Return type:
List
[Entity
]- Returns:
A list of found entities.
-
__init__
(direct_list)¶ - Parameters:
direct_list (
Dict
) – The list to search, given by a dictionary where the keys are the searchables and the values represent the (canonical) information on each entity.
-
lib.ner_lists.entity module¶
-
class
lib.ner_lists.entity.
Entity
(sentence, begin, end, score, searchable, canonical_form, extra_attributes)¶ Bases:
object
Defines an Entity found in the text, contains all necessary properties as fields.
-
__init__
(sentence, begin, end, score, searchable, canonical_form, extra_attributes)¶ - Parameters:
sentence (
int
) – The number of the sentence in which the entity was foundbegin (
int
) – Begin character of the entityend (
int
) – End character of the entityscore (
float
) – Score of the matchsearchable (
str
) – The “searchable” form of the word(s) that was/were matchedcanonical_form (
str
) – The canonical form of the entityextra_attributes (
Dict
[str
,str
]) – Any extra attributes
-
to_dict
(**kwargs)¶ - Parameters:
kwargs – Additional key-value pairs to place on the dict.
- Returns:
A dictionary containing relevant information on the entity for the output json as data
-
lib.ner_lists.finder module¶
-
class
lib.ner_lists.finder.
FindEntities
(*args, **kwargs)¶ Bases:
object
Abstract base class for the FindEntitiesGiven… classes
-
abstract
__call__
(*args, **kwargs)¶ - Parameters:
i_sentence – A number indicating which sentence we are searching
sentence – List of words. Note that this is not a “true sentence” in the sense used throughout the project. It is a subset of all words, depending on the object’s
get_words
.
- Returns:
A list of found entities.
-
abstract
__init__
(*args, **kwargs)¶
-
abstract
-
class
lib.ner_lists.finder.
Finder
(entity_type, entity_finders, word_getter)¶ Bases:
object
Finds entities from lists. Derived classes exist for different types of matchings.
-
__call__
(sentences)¶ - Parameters:
sentences (
List
[List
[Dict
]]) – List of lists of words.- Returns:
None, the words are modified
-
__init__
(entity_type, entity_finders, word_getter)¶ - Parameters:
entity_type (
str
) – The type of entity for which the finder is searching, e.g. “person”.entity_finders (
Dict
[str
,FindEntities
]) – The entity finders that look through a specific list or tree.word_getter (
str
) – Passed on the classGetWords
to determine how to get the list of searchable words.
-
static
create_lists
(dirpath)¶ - Parameters:
dirpath (
str
) – The directory where to search for files.- Return type:
Dict
[str
,Dict
]- Returns:
A dictionary where each key-value-pair is a combination of a filename with a corresponding dictionary of searchables.
-
-
lib.ner_lists.finder.
load_file
(dirpath, list_name, extension)¶ - Parameters:
dirpath (
str
) – The directory where the file is locatedlist_name (
str
) – The name of the fileextension (
str
) – The extension of the file
- Return type:
Dict
[Tuple
[str
, …],List
[Dict
]]- Returns:
A dicrionary with the key “searchable” pointing to a list of dictionaries containing the keys “searchable”, “canonical_form” and “extra_attributes”. The first is given by a tuple of words (to allow reordering of the elements later), the second as a string and the third as a dictionary.
Example:
dict_searchables[('middelburg',)] = [ { 'searchable': ('middelburg',), 'canonical_form': 'Maspeth', 'extra_attributes': {'geometry': 'Point(40.75 -73.9333)'} }, { 'searchable': ('middelburg',), 'canonical_form': 'Middelburg', 'extra_attributes': {'geometry': 'Point(51.5021 3.6141)'} }, { 'searchable': ('middelburg',), 'canonical_form': 'Middelburg', 'extra_attributes': {'geometry': 'Point(0 0)'} } ]
lib.ner_lists.fuzzy_finder module¶
-
class
lib.ner_lists.fuzzy_finder.
FindEntitiesGivenList
(fuzzy_list, cutoff_score)¶ Bases:
lib.ner_lists.finder.FindEntities
- Finds entities from a single list where the searchable forms are not to be permuted but are to be matched
fuzzily.
-
__call__
(i_sentence, sentence)¶ - Parameters:
i_sentence (
int
) – A number indicating which sentence we are searchingsentence (
List
[Dict
]) – List of words. Note that this is not a “true sentence” in the sense used throughout the project. It is a subset of all words, depending on the object’sget_words
.
- Return type:
List
[Entity
]- Returns:
A list of found entities.
-
__init__
(fuzzy_list, cutoff_score)¶ - Parameters:
fuzzy_list (
Dict
) – The list to search, given by a dictionary where the keys are the searchables and the values represent the (canonical) information on each entity.cutoff_score (
float
) – The lowest score for entities to still be considered a hit.
-
class
lib.ner_lists.fuzzy_finder.
FuzzyFinder
(data_dir, entity_type, word_getter, cutoff_score)¶ Bases:
lib.ner_lists.finder.Finder
Finds entities from lists where the searchable forms are not to be permuted but are to be matched fuzzily. An object is instantiated for each different type of entity in the project, e.g. person, location and date.
-
__init__
(data_dir, entity_type, word_getter, cutoff_score)¶ - Parameters:
data_dir (
str
) – Path to the data files to use relative to the pipeline_data director, e.g. “ner_lists/SZSA”.entity_type (
str
) – SeeFinder
word_getter (
str
) – SeeFinder
cutoff_score (
float
) – The lowest score for entities to still be considered a hit.
-
lib.ner_lists.fuzzy_matcher module¶
-
class
lib.ner_lists.fuzzy_matcher.
FuzzyListMatcher
(fuzzy_list, cutoff_score)¶ Bases:
lib.ner_lists.fuzzy_matcher.FuzzyMatcher
Finds matches given a “list”, a list of “searchables” grouped by a key element of the searchable. The key element is searched for first, after which different ordereings of the match are considered.
-
__init__
(fuzzy_list, cutoff_score)¶ - Parameters:
fuzzy_list (
Dict
[str
,Dict
]) – The list through which to search for entitiescutoff_score (
float
) – Score below which to disregard matches.
-
find_matches
(i_sentence, sentence, n_approx=2)¶ - Parameters:
i_sentence (
int
) – An index of the sentence, used to match it to the correct place later on.sentence (
List
[Dict
]) – List of words to match againse thetree
n_approx (
int
) – The approximate expected number of words in a hit. It influences the threshold for the first
hit. Setting it higher results in a higher threshold, causing faster performance but possible missed hits. Setting it too low results in poor performance. :return: Yields entities as they are found
-
yield_matches
(i_sentence, sentence, i)¶ - Parameters:
i_sentence (
int
) – string indexsentence (
List
[Dict
]) – list of wordsi (
int
) – index of word in sentence (which matches the key of an element in the tree-dictionary)
-
-
class
lib.ner_lists.fuzzy_matcher.
FuzzyMatcher
(cutoff_score)¶ Bases:
object
Class to match words with list elements
-
__init__
(cutoff_score)¶ - Parameters:
cutoff_score (
float
) – Score below which to disregard matches.
-
static
get_close_matches_scores
(word_group, possibilities, n, cutoff)¶ - Parameters:
word_group – a string, might be multiple words concatenated without spaces.
possibilities – list of strings. word_group will be compared to all strings in this list.
n – positive integer. the top-n matches will be returned.
cutoff – percentage of characters that should match before something is considered a match
- Returns:
either an empty list -> no match/hit or a size 2 tuple with score and word_group string
-
score
(word_group, word_list, cutoff_score=None)¶ - Parameters:
cutoff_score – threshold for % of characters that should match between word_group and an el. of word_list
word_group – a string, which is a concatenated list of words that are to be matched
word_list – a list of (space-less concatenated) words that are compared to the word_group elements might
be permutations of others, e.g. fortbatavia and bataviafort :return: a number, 0<=number<=1. It is the percentage of characters that match in the best match (from the list)
-
-
class
lib.ner_lists.fuzzy_matcher.
FuzzyTreeMatcher
(tree, cutoff_score)¶ Bases:
lib.ner_lists.fuzzy_matcher.FuzzyMatcher
Finds matches given a “tree”, where a “tree” is a list of “searchables” grouped by a key element of the searchable. The key element is searched for first, after which different ordereings of the match are considered.
-
__init__
(tree, cutoff_score)¶ - Parameters:
tree (
Dict
[str
,Dict
]) – The tree through which to search for entities, each key points to a list of “searchables” containing that key as one of the elements.cutoff_score (
float
) – Score below which to disregard matches.
-
find_matches
(i_sentence, sentence, n_approx=2)¶ - Parameters:
i_sentence (
int
) – An index of the sentence, used to match it to the correct place later on.sentence (
List
[Dict
]) – List of words to match againse thetree
n_approx (
int
) – The approximate expected number of words in a hit. It influences the threshold for the first
hit. Setting it higher results in a higher threshold, causing faster performance but possible missed hits. Setting it too low results in poor performance. :return: Yields entities as they are found
-
yield_matches
(i_sentence, sentence, i, entity_key)¶ - Parameters:
i_sentence (
int
) – string indexsentence (
List
[Dict
]) – list of wordsi (
int
) – index of word in sentence (which matches the key of an element in the tree-dictionary)entity_key (
str
) – The tree is essentially a large dictionary. entity_key is the key of the most likely entity
- Returns:
Yields entities as they are found
-
lib.ner_lists.fuzzy_permutative_finder module¶
-
class
lib.ner_lists.fuzzy_permutative_finder.
FindEntitiesGivenTree
(tree, cutoff_score)¶ Bases:
lib.ner_lists.finder.FindEntities
-
__call__
(i_sentence, sentence)¶ - Parameters:
i_sentence – A number indicating which sentence we are searching
sentence – List of words. Note that this is not a “true sentence” in the sense used throughout the project. It is a subset of all words, depending on the object’s
get_words
.
- Return type:
List
[Entity
]- Returns:
A list of found entities.
-
__init__
(tree, cutoff_score)¶ - Parameters:
tree – The list to search, given by a dictionary where the keys are the searchables and the values represent the (canonical) information on each entity.
cutoff_score – The lowest score for entities to still be considered a hit.
-
-
class
lib.ner_lists.fuzzy_permutative_finder.
FuzzyPermutativeFinder
(data_dir, entity_type, word_getter, cutoff_score)¶ Bases:
lib.ner_lists.finder.Finder
Finds entities from lists where the searchable forms are to be permuted and to be matched fuzzily. An object is instantiated for each different type of entity in the project, e.g. person, location and date.
-
__init__
(data_dir, entity_type, word_getter, cutoff_score)¶ - Parameters:
data_dir (
str
) – Path to the data files to use relative to the pipeline_data director, e.g. “ner_lists/SZSA”.entity_type (
str
) – SeeFinder
word_getter (
str
) – SeeFinder
cutoff_score (
float
) – The lowest score for entities to still be considered a hit.
-
-
lib.ner_lists.fuzzy_permutative_finder.
create_freq_table
(all_list)¶ - Parameters:
all_list (
Dict
) – A dictionary of searchable-canonical pairs.- Returns:
A dictionary stating how often each searchable element occurs in the list.
-
lib.ner_lists.fuzzy_permutative_finder.
create_tree
(all_list)¶ - Parameters:
all_list (
Dict
) – A dictionary of searchable-canonical pairs- Return type:
Dict
[str
,Dict
]- Returns:
A dictionary where the keys are the first word that would be searched, the values are nested dicts of the same form as the input, i.e. searchable-canonical pairs.
-
lib.ner_lists.fuzzy_permutative_finder.
remove_overlap
(entities, sentence_length)¶
lib.ner_lists.get_searchable_words module¶
-
class
lib.ner_lists.get_searchable_words.
GetWords
(**kwargs)¶ Bases:
object
Abstract base class for the GetWordsNer and GetWordsBert classes
-
abstract
__call__
(sentences)¶
-
abstract
__init__
(**kwargs)¶
-
classmethod
from_string
(getter_name)¶ - Parameters:
getter_name (
str
) – Name of the derived class, either “NER” or “BERT”- Returns:
The uninstantiated class of the desired kind.
-
abstract
-
class
lib.ner_lists.get_searchable_words.
GetWordsBert
(entity_type, **kwargs)¶ Bases:
lib.ner_lists.get_searchable_words.GetWords
-
__call__
(sentences)¶ - Parameters:
sentences – list of lists. Each element is a dict that represents a word. The intermediate level lists represent a sentence each. spaces and comma’s are still considered words at this level
- Returns:
found_entities. A list of lists. Each intermediate list represents a BERT-found entity of type self.type_of_list (e.g. a location)
-
__init__
(entity_type, **kwargs)¶ - Parameters:
entity_type – The type of entity (e.g. location or person) which should be searched.
kwargs – any additional key-word arguments are passed to GetWords
-
-
class
lib.ner_lists.get_searchable_words.
GetWordsNer
(**kwargs)¶ Bases:
lib.ner_lists.get_searchable_words.GetWords
-
__call__
(sentences)¶ - Returns:
A list of lists, where the inner lists contain all words that have word[‘ner’] == True for a sentence.
-
__init__
(**kwargs)¶ - Parameters:
kwargs – any additional key-word arguments are passed to GetWords
-
lib.ner_lists.ner_lists module¶
-
class
lib.ner_lists.ner_lists.
NerLists
(data_dir, word_getter, cutoff_score, entity_types)¶ Bases:
object
Find entities by comparing words with lists
-
__call__
(sentences)¶ - Parameters:
sentences (
List
[List
[Dict
]]) – List of lists of words- Returns:
The same sentences, adorned with information found in various types of lists.
-
__init__
(data_dir, word_getter, cutoff_score, entity_types)¶ - Parameters:
data_dir (
Union
[List
[str
],str
]) – Path of the lists, given as either a string or a list of strings, is processed using os.path.join.word_getter (
str
) – Indicates how the words to be searched through are achieved. Currently only “NER” and “BERT” are supported.cutoff_score (
float
) – The cut-off score below which matches will not be returned.entity_types (
List
[str
]) – Which entity types to search for, e.g. [“person”, “location”].
-
static
prefill_empty_labels
(sentences)¶ - Parameters:
sentences (
List
[List
[Dict
]]) – List of lists of words- Returns:
None, the word dicts are modified by adding the “labels” key, if the word has “ner” set to True.
-