Skip to content

Encoders

Turftopic by default encodes documents using sentence transformers. You can always change the encoder model either by passing the name of a sentence transformer from the Huggingface Hub to a model, or by passing a SentenceTransformer instance.

Here's an example of building a multilingual topic model by using multilingual embeddings:

from sentence_transformers import SentenceTransformer
from turftopic import GMM

trf = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

model = GMM(10, encoder=trf)

# or

model = GMM(10, encoder="paraphrase-multilingual-MiniLM-L12-v2")

Different encoders have different performance and model sizes. To make an informed choice about which embedding model you should be using check out the Massive Text Embedding Benchmark.

Asymmetric and Instruction-tuned Embedding Models

Some embedding models can be used together with prompting, or encode queries and passages differently. Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.

In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using sentence-transformers.

Here's an example of using instruct models for keyword retrieval with KeyNMF. In this case, documents will serve as the queries and words as the passages:

from turftopic import KeyNMF
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer(
    "intfloat/multilingual-e5-large-instruct",
    prompts={
        "query": "Instruct: Retrieve relevant keywords from the given document. Query: "
        "passage": "Passage: "
    },
    # Make sure to set default prompt to query!
    default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)

And a regular, asymmetric example:

encoder = SentenceTransformer(
    "intfloat/e5-large-v2",
    prompts={
        "query": "query: "
        "passage": "passage: "
    },
    # Make sure to set default prompt to query!
    default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)

Performance tips

From sentence-transformers version 3.2.0 you can significantly speed up some models by using the onnx backend instead of regular torch.

pip install sentence-transformers[onnx, onnx-gpu]
from turftopic import SemanticSignalSeparation
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")

model = SemanticSignalSeparation(10, encoder=encoder)

External Embeddings

If you do not have the computational resources to run embedding models on your own infrastructure, you can also use high quality 3rd party embeddings. Turftopic currently supports OpenAI, Voyage and Cohere embeddings.

turftopic.encoders.base.ExternalEncoder

Bases: ABC

Base class for external encoder models.

Source code in turftopic/encoders/base.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class ExternalEncoder(ABC):
    """Base class for external encoder models."""

    @abstractmethod
    def encode(self, sentences: Iterable[str]) -> np.ndarray:
        """Encodes sentences into an embedding matrix.

        Parameters
        ----------
        sentences: Iterable[str]
            Sentences to get embeddings for.

        Returns
        -------
        ndarray of shape (n_docs, n_dimensions)
            Embedding matrix.
        """
        pass

encode(sentences) abstractmethod

Encodes sentences into an embedding matrix.

Parameters:

Name Type Description Default
sentences Iterable[str]

Sentences to get embeddings for.

required

Returns:

Type Description
ndarray of shape (n_docs, n_dimensions)

Embedding matrix.

Source code in turftopic/encoders/base.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
@abstractmethod
def encode(self, sentences: Iterable[str]) -> np.ndarray:
    """Encodes sentences into an embedding matrix.

    Parameters
    ----------
    sentences: Iterable[str]
        Sentences to get embeddings for.

    Returns
    -------
    ndarray of shape (n_docs, n_dimensions)
        Embedding matrix.
    """
    pass

turftopic.encoders.CohereEmbeddings

Bases: ExternalEncoder

Encoder model using embeddings from Cohere.

The available models are:

  • embed-english-v3.0
  • embed-multilingual-v3.0
  • embed-english-light-v3.0
  • embed-multilingual-light-v3.0
  • embed-english-v2.0
  • embed-english-light-v2.0
  • embed-multilingual-v2.0
from turftopic.encoders import CohereEmbeddings
from turftopic import GMM

model = GMM(10, encoder=CohereEmbeddings())

Parameters:

Name Type Description Default
model str

Embedding model to use from Cohere.

'embed-english-v3.0'
input_type str

Input type passed to the embedding model.

'clustering'
batch_size int

Sizes of the batches that will be sent to Cohere's API.

25
Source code in turftopic/encoders/cohere.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
class CohereEmbeddings(ExternalEncoder):
    """Encoder model using embeddings from Cohere.

    The available models are:

     - `embed-english-v3.0`
     - `embed-multilingual-v3.0`
     - `embed-english-light-v3.0`
     - `embed-multilingual-light-v3.0`
     - `embed-english-v2.0`
     - `embed-english-light-v2.0`
     - `embed-multilingual-v2.0`

    ```python
    from turftopic.encoders import CohereEmbeddings
    from turftopic import GMM

    model = GMM(10, encoder=CohereEmbeddings())
    ```

    Parameters
    ----------
    model: str, default "embed-english-v3.0"
        Embedding model to use from Cohere.

    input_type: str, default "clustering"
        Input type passed to the embedding model.

    batch_size: int, default 25
        Sizes of the batches that will be sent to Cohere's API.
    """

    def __init__(
        self,
        model: str = "embed-english-v3.0",
        input_type: str = "clustering",
        batch_size: int = 25,
    ):
        import cohere

        try:
            self.client = cohere.Client(os.environ["COHERE_KEY"])
        except KeyError as e:
            raise KeyError(
                "You have to set the COHERE_KEY environment"
                " variable to use Cohere embeddings."
            ) from e
        self.model = model
        self.input_type = input_type
        self.batch_size = batch_size

    def encode(self, sentences: Iterable[str]):
        result = []
        for b in batched(sentences, self.batch_size):
            response = self.client.embed(b, input_type=self.input_type)
            result.extend(response.embeddings)
        return np.array(result)

turftopic.encoders.OpenAIEmbeddings

Bases: ExternalEncoder

Encoder model using embeddings from OpenAI.

The available models are:

  • text-embedding-3-large
  • text-embedding-3-small
  • text-embedding-ada-002
from turftopic.encoders import OpenAIEmbeddings
from turftopic import GMM

model = GMM(10, encoder=OpenAIEmbeddings())

Parameters:

Name Type Description Default
model str

Embedding model to use from OpenAI.

'text-embedding-3-large'
batch_size int

Sizes of the batches that will be sent to OpenAI's API.

25
Source code in turftopic/encoders/openai.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
class OpenAIEmbeddings(ExternalEncoder):
    """Encoder model using embeddings from OpenAI.

    The available models are:

     - `text-embedding-3-large`
     - `text-embedding-3-small`
     - `text-embedding-ada-002`

    ```python
    from turftopic.encoders import OpenAIEmbeddings
    from turftopic import GMM

    model = GMM(10, encoder=OpenAIEmbeddings())
    ```

    Parameters
    ----------
    model: str, default "text-embedding-3-large"
        Embedding model to use from OpenAI.

    batch_size: int, default 25
        Sizes of the batches that will be sent to OpenAI's API.

    """

    def __init__(
        self, model: str = "text-embedding-3-large", batch_size: int = 25
    ):
        import openai

        try:
            openai.api_key = os.environ["OPENAI_KEY"]
        except KeyError as e:
            raise KeyError(
                "You have to set the OPENAI_KEY environment"
                " variable to use OpenAI embeddings."
            ) from e
        openai.organization = os.getenv("OPENAI_ORG")
        self.model = model
        self.batch_size = batch_size

    def encode(self, sentences: Iterable[str]):
        import openai

        result = []
        for b in batched(sentences, self.batch_size):
            resp = openai.Embedding.create(
                input=b, model=self.model
            )  # fmt: off
            result.extend([_["embedding"] for _ in resp["data"]])
        return np.array(result)

turftopic.encoders.VoyageEmbeddings

Bases: ExternalEncoder

Encoder model using embeddings from VoyageAI.

The available models are:

  • voyage-2
  • voyage-lite-2-instruct
from turftopic.encoders import VoyageEmbeddings
from turftopic import GMM

model = GMM(10, encoder=VoyageEmbeddings())

Parameters:

Name Type Description Default
model str

Embedding model to use from Voyage.

'voyage-lite-2-instruct'
batch_size int

Sizes of the batches that will be sent to Voyage's API.

25
Source code in turftopic/encoders/voyage.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class VoyageEmbeddings(ExternalEncoder):
    """Encoder model using embeddings from VoyageAI.

    The available models are:

     - `voyage-2`
     - `voyage-lite-2-instruct`

    ```python
    from turftopic.encoders import VoyageEmbeddings
    from turftopic import GMM

    model = GMM(10, encoder=VoyageEmbeddings())
    ```

    Parameters
    ----------
    model: str, default "voyage-lite-2-instruct"
        Embedding model to use from Voyage.

    batch_size: int, default 25
        Sizes of the batches that will be sent to Voyage's API.

    """

    def __init__(
        self, model: str = "voyage-lite-2-instruct", batch_size: int = 25
    ):
        import voyageai

        try:
            voyageai.api_key = os.environ["VOYAGE_KEY"]
        except KeyError as e:
            raise KeyError(
                "You have to set the VOYAGE_KEY environment"
                " variable to use Voyage embeddings."
            ) from e
        self.model = model
        self.batch_size = batch_size

    def encode(self, sentences: Iterable[str]):
        from voyageai import get_embeddings

        result = []
        for b in batched(sentences, self.batch_size):
            response = get_embeddings(b, self.model)
            result.extend(response)
        return np.array(result)