Online Topic Modeling

Some models in Turftopic can be fitted in an online manner (currently this only includes KeyNMF). These models can be fitted in minibatches instead of the entire corpus at the same time.

Use Cases:

You can use online fitting when you have very large corpora at hand, and it would be impractical to fit a model on it at once.
You have new data flowing in constantly, and need a model that can morph the topics based on the incoming data. You can also do this in a dynamic fashion.
You need to finetune an already fitted topic model to novel data.

Batch Fitting

We will use the batching function from the itertools recipes to produce batches.

In newer versions of Python (>=3.12) you can just from itertools import batched

def batched(iterable, n: int):
    "Batch data into lists of length n. The last batch may be shorter."
    if n < 1:
        raise ValueError("n must be at least one")
    it = iter(iterable)
    while batch := tuple(itertools.islice(it, n)):
        yield batch

You can fit a model to a very large corpus in batches like so:

from turftopic import KeyNMF

model = KeyNMF(10, top_n=5)

corpus = ["some string", "etc", ...]
for batch in batched(corpus, 200):
    batch = list(batch)
    model.partial_fit(batch)

You might want to train in epochs, so that the model sees the same documents multiple times, this might be useful in numerous settings:

for epoch in range(5):
  for batch in batched(corpus, 200):
      batch = list(batch)
      model.partial_fit(batch)

Finetuning a Model

You can pretrain a topic model on a large corpus and then finetune it on a novel corpus the model has not seen before. This will morph the model's topics to the corpus at hand.

In this example I will load a pretrained KeyNMF model from disk. (see Model Loading and Saving)

from turftopic import load_model

model = load_model("pretrained_keynmf_model")

new_corpus: list[str] = [...]
# Finetune the model to the new corpus
model.partial_fit(new_corpus)

model.to_disk("finetuned_model/")

Precomputed Embeddings

In the case of very large corpora it is common to precompute embeddings before fitting the model. You can still do this with partial_fit(), you just have to be careful to correctly match the embedding indices with the corpus indices.

We provide an example of correct usage here.

You might have a utils.py file with a function to load your corpus:

def load_corpus() -> list[str]:
    """Function that loads the corpus from some source."""
    ...

Then you have a file which computes the embeddings and saves them to disk:

import numpy as np
from sentence_transformers import SentenceTransformers

from utils import load_corpus

corpus = load_corpus()

trf = SentenceTransformers("all-MiniLM-L6-v2")
embeddings = trf.encode(corpus)

np.save("embeddings.npy")

This file then trains the model on the precomputed embeddings:

import numpy as np
from turftopic import KeyNMF

from utils import load_corpus

corpus = load_corpus()
embeddings = np.load("embeddings.npy")

model = KeyNMF(10, encoder="all-MiniLM-L6-v2")
for batch in batched(zip(corpus, embeddings), 200):
    text_batch, embedding_batch = zip(*batch)
    text_batch = list(text_batch)
    embedding_batch = np.stack(embedding_batch)
    model.partial_fit(text_batch, embeddings=embedding_batch)