API reference#

class neofuzz.process.Process(vectorizer, metric='euclidean', metric_kwds=None, n_neighbors=30, n_trees=None, leaf_size=None, pruning_degree_multiplier=1.5, diversify_prob=1.0, n_search_trees=1, tree_init=True, init_graph=None, init_dist=None, random_state=None, low_memory=True, max_candidates=None, n_iters=None, delta=0.001, n_jobs=None, compressed=False, parallel_batch_queries=False, verbose=False)#

TheFuzz-compatible process class for quick searching options. Beyond the vectorizer all parameters refer to the approximate nearest neighbour search.

Parameters:
  • vectorizer (sklearn vectorizer) – Some kind of vectorizer model that can vectorize strings. You could use tf-idf, bow or even a Pipeline that has multiple steps.

  • metric (string or callable, default 'euclidean') –

    The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:

    • euclidean

    • manhattan

    • chebyshev

    • minkowski

    • canberra

    • braycurtis

    • mahalanobis

    • wminkowski

    • seuclidean

    • cosine

    • correlation

    • haversine

    • hamming

    • jaccard

    • dice

    • russelrao

    • kulsinski

    • rogerstanimoto

    • sokalmichener

    • sokalsneath

    • yule

    • hellinger

    • wasserstein-1d

    Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.

  • metric_kwds (dict, default {}) – Arguments to pass on to the metric, such as the p value for Minkowski distance.

  • n_neighbors (int, default 30) – The number of neighbors to use in k-neighbor graph graph_data structure used for fast approximate nearest neighbor search. Larger values will result in more accurate search results at the cost of computation time.

  • n_trees (int, default None) – This implementation uses random projection forests for initializing the index build process. This parameter controls the number of trees in that forest. A larger number will result in more accurate neighbor computation at the cost of performance. The default of None means a value will be chosen based on the size of the graph_data.

  • leaf_size (int, default None) – The maximum number of points in a leaf for the random projection trees. The default of None means a value will be chosen based on n_neighbors.

  • pruning_degree_multiplier (float, default 1.5) – How aggressively to prune the graph. Since the search graph is undirected (and thus includes nearest neighbors and reverse nearest neighbors) vertices can have very high degree – the graph will be pruned such that no vertex has degree greater than pruning_degree_multiplier * n_neighbors.

  • diversify_prob (float, default 1.0) – The search graph get “diversified” by removing potentially unnecessary edges. This controls the volume of edges removed. A value of 0.0 ensures that no edges get removed, and larger values result in significantly more aggressive edge removal. A value of 1.0 will prune all edges that it can.

  • tree_init (bool, default True) – Whether to use random projection trees for initialization.

  • init_graph (np.ndarray, default None) – 2D array of indices of candidate neighbours of the shape (data.shape[0], n_neighbours). If the j-th neighbour of the i-th instances is unknown, use init_graph[i, j] = -1

  • init_dist (np.ndarray, default None) – 2D array with the same shape as init_graph, such that metric(data[i], data[init_graph[i, j]]) equals init_dist[i, j]

  • random_state (int, RandomState instance or None, default None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • algorithm (str, default 'standard') – This implementation provides an alternative algorithm for construction of the k-neighbors graph used as a search index. The alternative algorithm can be fast for large n_neighbors values. The``’alternative’`` algorithm has been deprecated and is no longer available.

  • low_memory (boolean, default True) – Whether to use a lower memory, but more computationally expensive approach to index construction.

  • max_candidates (int, default None) – Internally each “self-join” keeps a maximum number of candidates ( nearest neighbors and reverse nearest neighbors) to be considered. This value controls this aspect of the algorithm. Larger values will provide more accurate search results later, but potentially at non-negligible computation cost in building the index. Don’t tweak this value unless you know what you’re doing.

  • n_iters (int, default None) – The maximum number of NN-descent iterations to perform. The NN-descent algorithm can abort early if limited progress is being made, so this only controls the worst case. Don’t tweak this value unless you know what you’re doing. The default of None means a value will be chosen based on the size of the graph_data.

  • delta (float, default 0.001) – Controls the early abort due to limited progress. Larger values will result in earlier aborts, providing less accurate indexes, and less accurate searching. Don’t tweak this value unless you know what you’re doing.

  • n_jobs (int or None, default None) – The number of parallel jobs to run for neighbors index construction. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • compressed (bool, default False) – Whether to prune out data not needed for searching the index. This will result in a significantly smaller index, particularly useful for saving, but will remove information that might otherwise be useful.

index(options: Iterable[str])#

Indexes all options for fast querying.

Parameters:

options (iterable of str) – All options in which we want search.

query(search_terms: Iterable[str], limit: int = 10) Tuple[ndarray, ndarray]#

Searches for the given terms in the options.

Parameters:
  • search_terms (iterable of str) – Terms to search for.

  • limit (int, default 10) – Amount of closest matches to return.

  • indices (array of shape (n_search_terms, limit)) – Indices of the closest options to each search term.

  • distances (array of shape (n_search_terms, limit)) – Distances from the closest options to each search term.

extract(query: str, choices: Optional[Iterable[str]] = None, limit: int = 10) List[Tuple[str, int]]#

TheFuzz compatible querying.

Parameters:
  • query (str) – Query string to search for.

  • choices (iterable of str, default None) – Choices to iterate through. If the options are already indexed, this parameter is ignored, otherwise it will be used for indexing.

  • limit (int, default 10) – Number of results to return

Returns:

List of closest terms and their similarity to the query term.

Return type:

list of (str, int)

extractOne(query: str, choices: Optional[Iterable[str]] = None) Tuple[str, int]#

TheFuzz compatible extraction of one item.

Parameters:
  • query (str) – Query string to search for.

  • choices (iterable of str, default None) – Choices to iterate through. If the options are already indexed, this parameter is ignored, otherwise it will be used for indexing.

Returns:

  • result (str) – Closest term to given search term.

  • score (int) – Similarity score.

ratio(s1: str, s2: str) int#

Calculates similarity of two strings.

Parameters:
  • s1 (str) – First string.

  • s2 (str) – Second string.

Returns:

Similarity of the two strings (1-100).

Return type:

int

neofuzz.process.char_ngram_process(ngram_range: Tuple[int, int] = (1, 5), tf_idf: bool = True, metric: str = 'cosine') Process#

Basic character n-gram based fuzzy search process.

Parameters:
  • ngram_range (tuple of (int, int), default (1,1)) – Lower and upper boundary of n-values for the character n-grams.

  • tf_idf (bool, default True) – Flag signifying whether the features should be tf-idf weighted.

  • metric (str, default 'cosine') – Distance metric to use for fuzzy search.

Returns:

Fuzzy search process.

Return type:

Process