Encoders
Turftopic by default encodes documents using sentence transformers.
You can always change the encoder model either by passing the name of a sentence transformer from the Huggingface Hub to a model, or by passing a SentenceTransformer
instance.
Here's an example of building a multilingual topic model by using multilingual embeddings:
from sentence_transformers import SentenceTransformer
from turftopic import GMM
trf = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
model = GMM(10, encoder=trf)
# or
model = GMM(10, encoder="paraphrase-multilingual-MiniLM-L12-v2")
Different encoders have different performance and model sizes. To make an informed choice about which embedding model you should be using check out the Massive Text Embedding Benchmark.
Asymmetric and Instruction-tuned Embedding Models
Some embedding models can be used together with prompting, or encode queries and passages differently. Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.
In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using sentence-transformers
.
Here's an example of using instruct models for keyword retrieval with KeyNMF. In this case, documents will serve as the queries and words as the passages:
from turftopic import KeyNMF
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer(
"intfloat/multilingual-e5-large-instruct",
prompts={
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
"passage": "Passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)
And a regular, asymmetric example:
encoder = SentenceTransformer(
"intfloat/e5-large-v2",
prompts={
"query": "query: "
"passage": "passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)
Performance tips
From sentence-transformers
version 3.2.0
you can significantly speed up some models by using
the onnx
backend instead of regular torch.
pip install sentence-transformers[onnx, onnx-gpu]
from turftopic import SemanticSignalSeparation
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
model = SemanticSignalSeparation(10, encoder=encoder)
External Embeddings
If you do not have the computational resources to run embedding models on your own infrastructure, you can also use high quality 3rd party embeddings. Turftopic currently supports OpenAI, Voyage and Cohere embeddings.
turftopic.encoders.base.ExternalEncoder
Bases: ABC
Base class for external encoder models.
Source code in turftopic/encoders/base.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
encode(sentences)
abstractmethod
Encodes sentences into an embedding matrix.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sentences
|
Iterable[str]
|
Sentences to get embeddings for. |
required |
Returns:
Type | Description |
---|---|
ndarray of shape (n_docs, n_dimensions)
|
Embedding matrix. |
Source code in turftopic/encoders/base.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
turftopic.encoders.CohereEmbeddings
Bases: ExternalEncoder
Encoder model using embeddings from Cohere.
The available models are:
embed-english-v3.0
embed-multilingual-v3.0
embed-english-light-v3.0
embed-multilingual-light-v3.0
embed-english-v2.0
embed-english-light-v2.0
embed-multilingual-v2.0
from turftopic.encoders import CohereEmbeddings
from turftopic import GMM
model = GMM(10, encoder=CohereEmbeddings())
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Embedding model to use from Cohere. |
'embed-english-v3.0'
|
input_type
|
str
|
Input type passed to the embedding model. |
'clustering'
|
batch_size
|
int
|
Sizes of the batches that will be sent to Cohere's API. |
25
|
Source code in turftopic/encoders/cohere.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
turftopic.encoders.OpenAIEmbeddings
Bases: ExternalEncoder
Encoder model using embeddings from OpenAI.
The available models are:
text-embedding-3-large
text-embedding-3-small
text-embedding-ada-002
from turftopic.encoders import OpenAIEmbeddings
from turftopic import GMM
model = GMM(10, encoder=OpenAIEmbeddings())
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Embedding model to use from OpenAI. |
'text-embedding-3-large'
|
batch_size
|
int
|
Sizes of the batches that will be sent to OpenAI's API. |
25
|
Source code in turftopic/encoders/openai.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
turftopic.encoders.VoyageEmbeddings
Bases: ExternalEncoder
Encoder model using embeddings from VoyageAI.
The available models are:
voyage-2
voyage-lite-2-instruct
from turftopic.encoders import VoyageEmbeddings
from turftopic import GMM
model = GMM(10, encoder=VoyageEmbeddings())
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Embedding model to use from Voyage. |
'voyage-lite-2-instruct'
|
batch_size
|
int
|
Sizes of the batches that will be sent to Voyage's API. |
25
|
Source code in turftopic/encoders/voyage.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|