API reference#
- class neofuzz.process.Process(vectorizer, metric='cosine', metric_kwds=None, n_neighbors=30, n_trees=None, leaf_size=None, pruning_degree_multiplier=1.5, diversify_prob=1.0, n_search_trees=1, tree_init=True, init_graph=None, init_dist=None, random_state=None, low_memory=True, max_candidates=None, n_iters=None, delta=0.001, n_jobs=None, compressed=False, parallel_batch_queries=False, verbose=False)#
TheFuzz-compatible process class for quick searching options. Beyond the vectorizer all parameters refer to the approximate nearest neighbour search.
- Parameters:
vectorizer (
sklearn vectorizer
) – Some kind of vectorizer model that can vectorize strings. You could use tf-idf, bow or even a Pipeline that has multiple steps.metric (
string
orcallable
, default'cosine'
) –The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:
euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
mahalanobis
wminkowski
seuclidean
cosine
correlation
haversine
hamming
jaccard
dice
russelrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule
hellinger
wasserstein-1d
Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.
metric_kwds (
dict
, default{}
) – Arguments to pass on to the metric, such as thep
value for Minkowski distance.n_neighbors (
int
, default30
) – The number of neighbors to use in k-neighbor graph graph_data structure used for fast approximate nearest neighbor search. Larger values will result in more accurate search results at the cost of computation time.n_trees (
int
, defaultNone
) – This implementation uses random projection forests for initializing the index build process. This parameter controls the number of trees in that forest. A larger number will result in more accurate neighbor computation at the cost of performance. The default of None means a value will be chosen based on the size of the graph_data.leaf_size (
int
, defaultNone
) – The maximum number of points in a leaf for the random projection trees. The default of None means a value will be chosen based on n_neighbors.pruning_degree_multiplier (
float
, default1.5
) – How aggressively to prune the graph. Since the search graph is undirected (and thus includes nearest neighbors and reverse nearest neighbors) vertices can have very high degree – the graph will be pruned such that no vertex has degree greater thanpruning_degree_multiplier * n_neighbors
.diversify_prob (
float
, default1.0
) – The search graph get “diversified” by removing potentially unnecessary edges. This controls the volume of edges removed. A value of 0.0 ensures that no edges get removed, and larger values result in significantly more aggressive edge removal. A value of 1.0 will prune all edges that it can.tree_init (
bool
, defaultTrue
) – Whether to use random projection trees for initialization.init_graph (
np.ndarray
, defaultNone
) – 2D array of indices of candidate neighbours of the shape (data.shape[0], n_neighbours). If the j-th neighbour of the i-th instances is unknown, use init_graph[i, j] = -1init_dist (
np.ndarray
, defaultNone
) – 2D array with the same shape as init_graph, such that metric(data[i], data[init_graph[i, j]]) equals init_dist[i, j]random_state (
int
,RandomState instance
orNone
, defaultNone
) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.algorithm (
str
, default'standard'
) – This implementation provides an alternative algorithm for construction of the k-neighbors graph used as a search index. The alternative algorithm can be fast for largen_neighbors
values. The``’alternative’`` algorithm has been deprecated and is no longer available.low_memory (
boolean
, defaultTrue
) – Whether to use a lower memory, but more computationally expensive approach to index construction.max_candidates (
int
, defaultNone
) – Internally each “self-join” keeps a maximum number of candidates ( nearest neighbors and reverse nearest neighbors) to be considered. This value controls this aspect of the algorithm. Larger values will provide more accurate search results later, but potentially at non-negligible computation cost in building the index. Don’t tweak this value unless you know what you’re doing.n_iters (
int
, defaultNone
) – The maximum number of NN-descent iterations to perform. The NN-descent algorithm can abort early if limited progress is being made, so this only controls the worst case. Don’t tweak this value unless you know what you’re doing. The default of None means a value will be chosen based on the size of the graph_data.delta (
float
, default0.001
) – Controls the early abort due to limited progress. Larger values will result in earlier aborts, providing less accurate indexes, and less accurate searching. Don’t tweak this value unless you know what you’re doing.n_jobs (
int
orNone
, defaultNone
) – The number of parallel jobs to run for neighbors index construction.None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.compressed (
bool
, defaultFalse
) – Whether to prune out data not needed for searching the index. This will result in a significantly smaller index, particularly useful for saving, but will remove information that might otherwise be useful.
- index(options: Iterable[str])#
Indexes all options for fast querying.
- Parameters:
options (
iterable
ofstr
) – All options in which we want search.
- query(search_terms: Iterable[str], limit: int = 10) Tuple[ndarray, ndarray] #
Searches for the given terms in the options.
- Parameters:
search_terms (
iterable
ofstr
) – Terms to search for.limit (
int
, default10
) – Amount of closest matches to return.indices (
array
ofshape (n_search_terms
,limit)
) – Indices of the closest options to each search term.distances (
array
ofshape (n_search_terms
,limit)
) – Distances from the closest options to each search term.
- extract(query: str, choices: Optional[Iterable[str]] = None, limit: int = 10) List[Tuple[str, int]] #
TheFuzz compatible querying.
- Parameters:
query (
str
) – Query string to search for.choices (
iterable
ofstr
, defaultNone
) – Choices to iterate through. If the options are already indexed, this parameter is ignored, otherwise it will be used for indexing.limit (
int
, default10
) – Number of results to return
- Returns:
List of closest terms and their similarity to the query term.
- Return type:
list
of(str
,int)
- extractOne(query: str, choices: Optional[Iterable[str]] = None) Tuple[str, int] #
TheFuzz compatible extraction of one item.
- Parameters:
query (
str
) – Query string to search for.choices (
iterable
ofstr
, defaultNone
) – Choices to iterate through. If the options are already indexed, this parameter is ignored, otherwise it will be used for indexing.
- Returns:
result (
str
) – Closest term to given search term.score (
int
) – Similarity score.
- ratio(s1: str, s2: str) int #
Calculates similarity of two strings.
- Parameters:
s1 (
str
) – First string.s2 (
str
) – Second string.
- Returns:
Similarity of the two strings (1-100).
- Return type:
int
- to_disk(filename: Union[str, Path])#
Persists indexed process to disk.
- Parameters:
filename (
str
orPath
) – File path to save the process to. e.g. process.joblib
- static from_disk(filename: Union[str, Path])#
Loads indexed process from disk.
- Parameters:
filename (
str
orPath
) – File path to save the process to. e.g. process.joblib
- neofuzz.process.char_ngram_process(ngram_range: Tuple[int, int] = (1, 5), tf_idf: bool = True, metric: str = 'cosine') Process #
Basic character n-gram based fuzzy search process.
- Parameters:
ngram_range (
tuple
of(int
,int)
, default(1,1)
) – Lower and upper boundary of n-values for the character n-grams.tf_idf (
bool
, defaultTrue
) – Flag signifying whether the features should be tf-idf weighted.metric (
str
, default'cosine'
) – Distance metric to use for fuzzy search.
- Returns:
Fuzzy search process.
- Return type: