Skip to content

Clustering Topic Models

Clustering topic models conceptualize topic modeling as a clustering task. Essentially a topic for these models is a tightly packed group of documents in semantic space.

The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.

Turftopic contains flexible implementations of these models where you have control over each of the steps in the process, while sticking to a minimal amount of extra dependencies. While the models themselves can be equivalent to BERTopic and Top2Vec implementations, Turftopic might not offer some of the implementation-specific features, that the other libraries boast.

How do clustering models work?

Dimensionality Reduction

from sklearn.manifold import TSNE
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(clustering=TSNE())

It is common practice to reduce the dimensionality of the embeddings before clustering them. This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by. Dimensionality reduction by default is done with scikit-learn's TSNE implementation in Turftopic, but users are free to specify the model that will be used for dimensionality reduction.

What reduction model should I choose?

Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature. Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).

Clustering

from sklearn.cluster import OPTICS
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(clustering=OPTICS())

After reducing the dimensionality of the embeddings, they are clustered with a clustering model. As HDBSCAN has only been part of scikit-learn since version 1.3.0, Turftopic uses OPTICS as its default.

What clustering model should I choose?

Some clustering models are capable of discovering the number of clusters in the data (HDBSCAN, DBSCAN, OPTICS, etc.). Practice suggests, however, that in large corpora, this frequently results in a very large number of topics, which is impractical for interpretation. Models' hyperparameters can be adjusted to account for this behaviour, but the impact of choice of hyperparameters on topic quality is more or less unknown. You can also use models that have predefined numbers of clusters, these, however, typically produce lower topic quality (e.g. KMeans)

Term importance

Clustering topic models rely on post-hoc term importance estimation. Multiple methods can be used for this in Turftopic.

Weaknesses

  • Topics can be too specific => low within-topic coverage
  • Assumes spherical clusters => could give incorrect results

Strengths

  • Clean topics
  • Highly specific topics

Proximity to Cluster Centroids

The solution introduced in Top2Vec (Angelov, 2020) is that of estimating terms' importances for a given topic from their embeddings' cosine similarity to the centroid of the embeddings in a cluster.

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(feature_importance="centroid")

Weaknesses

  • Topics can be contaminated with stop words
  • Lower topic quality

Strengths

  • Theoretically correct
  • More within-topic coverage

c-TF-IDF

c-TF-IDF (Grootendorst, 2022) is a weighting scheme based on the number of occurrences of terms in each cluster. Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance.

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(feature_importance="soft-c-tf-idf")
# or
model = ClusteringTopicModel(feature_importance="c-tf-idf")

By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF.

Click to see formula
  • Let \(X\) be the document term matrix where each element (\(X_{ij}\)) corresponds with the number of times word \(j\) occurs in a document \(i\).
  • Estimate weight of term \(j\) for topic \(z\):
    \(tf_{zj} = \frac{t_{zj}}{w_z}\), where \(t_{zj} = \sum_{i \in z} X_{ij}\) is the number of occurrences of a word in a topic and \(w_{z}= \sum_{j} t_{zj}\) is all words in the topic
  • Estimate inverse document/topic frequency for term \(j\):
    \(idf_j = log(\frac{N}{\sum_z |t_{zj}|})\), where \(N\) is the total number of documents.
  • Calculate importance of term \(j\) for topic \(z\):
    \(Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j\)

You can also use the original c-TF-IDF formula, if you intend to replicate the behaviour of BERTopic exactly. The two formulas tend to give similar results, though the implications of choosing one over the other has not been thoroughly evaluated.

Click to see formula
  • Let \(X\) be the document term matrix where each element (\(X_{ij}\)) corresponds with the number of times word \(j\) occurs in a document \(i\).
  • \(tf_{zj} = \frac{t_{zj}}{w_z}\), where \(t_{zj} = \sum_{i \in z} X_{ij}\) is the number of occurrences of a word in a topic and \(w_{z}= \sum_{j} t_{zj}\) is all words in the topic
  • Estimate inverse document/topic frequency for term \(j\):
    \(idf_j = log(1 + \frac{A}{\sum_z |t_{zj}|})\), where \(A = \frac{\sum_z \sum_j t_{zj}}{Z}\) is the average number of words per topic, and \(Z\) is the number of topics.
  • Calculate importance of term \(j\) for topic \(z\):
    \(c-TF-IDF{zj} = tf_{zj} \cdot idf_j\)

Recalculating Term Importance

You can also choose to recalculate term importances with a different method after fitting the model:

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit(corpus)
model.estimate_components(feature_importance="centroid")
model.estimate_components(feature_importance="soft-c-tf-idf")

Hierarchical Topic Merging

A weakness of clustering approaches based on density-based clustering methods, is that all too frequently they find a very large number of topics. To limit the number of topics in a topic model you can use hierarchical topic merging.

Merge Smallest

The approach used in the Top2Vec package is to always merge the smallest topic into the one closest to it (except the outlier-cluster) until the number of topics is down to the desired amount.

You can achieve this behaviour like so:

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10, reduction_method="smallest")

Agglomerative Clustering

In BERTopic topics are merged based on agglomerative clustering using average linkage, and then term importances are reestimated. You can do this in Turftopic as well:

model = ClusteringTopicModel(n_reduce_to=10, reduction_method="agglomerative")

You can also merge topics after having run the models using the reduce_topics() method.

model = ClusteringTopicModel().fit(corpus)
model.reduce_topics(n_reduce_to=20, reduction_method="smallest")

To reset topics to the original clustering, use the reset_topics() method:

model.reset_topics()

Manual Topic Merging

You can also manually merge topics using the join_topics() method.

model = ClusteringTopicModel()
model.fit(texts, embeddings=embeddings)
# This joins topics 0, 1, 2 to be cluster 0
model.join_topics([0, 1, 2])

How do I use BERTopic and Top2Vec in Turftopic?

You can create BERTopic and Top2Vec models in Turftopic by modifying all model parameters and hyperparameters to match the defaults in those other packages.

You will need UMAP and scikit-learn>=1.3.0 to be able to use HDBSCAN and UMAP:

pip install umap-learn scikit-learn>=1.3.0

BERTopic

You will need to set the clustering model to HDBSCAN and dimensionality reduction to UMAP. BERTopic also uses the original c-tf-idf formula and agglomerative topic joining.

Show code
from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap

berttopic = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(
        n_neighbors=10,
        n_components=5,
        min_dist=0.0,
        metric="cosine",
    ),
    clustering=HDBSCAN(
        min_cluster_size=15,
        metric="euclidean",
        cluster_selection_method="eom",
    ),
    feature_importance="c-tf-idf",
    reduction_method="agglomerative"
)

Top2Vec

You will need to set the clustering model to HDBSCAN and dimensionality reduction to UMAP. Top2Vec uses centroid feature importance and smallest topic merging method.

Show code
top2vec = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(
        n_neighbors=15,
        n_components=5,
        metric="cosine"
    ),
    clustering=HDBSCAN(
        min_cluster_size=15,
        metric="euclidean",
        cluster_selection_method="eom",
    ),
    feature_importance="centroid",
    reduction_method="smallest"
)

Theoretically the model descriptions above should result in the same behaviour as the other two packages, but there might be minor changes in implementation. We do not intend to keep up with changes in Top2Vec's and BERTopic's internal implementation details indefinitely.

Dynamic Modeling

Clustering models are also capable of dynamic topic modeling. This happens by fitting a clustering model over the entire corpus, as we expect that there is only one semantic model generating the documents.

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit_dynamic(corpus, timestamps=ts, bins=10)
model.print_topics_over_time()

API Reference

turftopic.models.cluster.ClusteringTopicModel

Bases: ContextualModel, ClusterMixin, DynamicTopicModel

Topic models, which assume topics to be clusters of documents in semantic space. Models also include a dimensionality reduction step to aid clustering.

from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap

corpus: list[str] = ["some text", "more text", ...]

# Construct a Top2Vec-like model
model = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(5),
    clustering=HDBSCAN(),
    feature_importance="centroid"
).fit(corpus)
model.print_topics()

Parameters:

Name Type Description Default
encoder Union[Encoder, str]

Model to encode documents/terms, all-MiniLM-L6-v2 is the default.

'sentence-transformers/all-MiniLM-L6-v2'
vectorizer Optional[CountVectorizer]

Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.

None
dimensionality_reduction Optional[TransformerMixin]

Dimensionality reduction step to run before clustering. Defaults to TSNE with cosine distance. To imitate the behavior of BERTopic or Top2Vec you should use UMAP.

None
clustering Optional[ClusterMixin]

Clustering method to use for finding topics. Defaults to OPTICS with 25 minimum cluster size. To imitate the behavior of BERTopic or Top2Vec you should use HDBSCAN.

None
feature_importance Literal['c-tf-idf', 'soft-c-tf-idf', 'centroid', 'bayes']

Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. 'bayes' uses Bayes' rule.

'soft-c-tf-idf'
n_reduce_to Optional[int]

Number of topics to reduce topics to. The specified reduction method will be used to merge them. By default, topics are not merged.

None
reduction_method Literal['agglomerative', 'smallest']

Method used to reduce the number of topics post-hoc. When 'agglomerative', BERTopic's topic reduction method is used, where topic vectors are hierarchically clustered. When 'smallest', the smallest topic gets merged into the closest non-outlier cluster until the desired number is achieved similarly to Top2Vec.

'agglomerative'
random_state Optional[int]

Random state to use so that results are exactly reproducible.

None
Source code in turftopic/models/cluster.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
class ClusteringTopicModel(ContextualModel, ClusterMixin, DynamicTopicModel):
    """Topic models, which assume topics to be clusters of documents
    in semantic space.
    Models also include a dimensionality reduction step to aid clustering.

    ```python
    from turftopic import ClusteringTopicModel
    from sklearn.cluster import HDBSCAN
    import umap

    corpus: list[str] = ["some text", "more text", ...]

    # Construct a Top2Vec-like model
    model = ClusteringTopicModel(
        dimensionality_reduction=umap.UMAP(5),
        clustering=HDBSCAN(),
        feature_importance="centroid"
    ).fit(corpus)
    model.print_topics()
    ```

    Parameters
    ----------
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    dimensionality_reduction: TransformerMixin, default None
        Dimensionality reduction step to run before clustering.
        Defaults to TSNE with cosine distance.
        To imitate the behavior of BERTopic or Top2Vec you should use UMAP.
    clustering: ClusterMixin, default None
        Clustering method to use for finding topics.
        Defaults to OPTICS with 25 minimum cluster size.
        To imitate the behavior of BERTopic or Top2Vec you should use HDBSCAN.
    feature_importance: {'soft-c-tf-idf', 'c-tf-idf', 'bayes', 'centroid'}, default 'soft-c-tf-idf'
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
        'bayes' uses Bayes' rule.
    n_reduce_to: int, default None
        Number of topics to reduce topics to.
        The specified reduction method will be used to merge them.
        By default, topics are not merged.
    reduction_method: 'agglomerative', 'smallest'
        Method used to reduce the number of topics post-hoc.
        When 'agglomerative', BERTopic's topic reduction method is used,
        where topic vectors are hierarchically clustered.
        When 'smallest', the smallest topic gets merged into the closest
        non-outlier cluster until the desired number
        is achieved similarly to Top2Vec.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.
    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        dimensionality_reduction: Optional[TransformerMixin] = None,
        clustering: Optional[ClusterMixin] = None,
        feature_importance: Literal[
            "c-tf-idf",
            "soft-c-tf-idf",
            "centroid",
            "bayes",
        ] = "soft-c-tf-idf",
        n_reduce_to: Optional[int] = None,
        reduction_method: Literal[
            "agglomerative", "smallest"
        ] = "agglomerative",
        random_state: Optional[int] = None,
    ):
        self.encoder = encoder
        self.random_state = random_state
        if feature_importance not in [
            "c-tf-idf",
            "soft-c-tf-idf",
            "centroid",
            "bayes",
        ]:
            raise ValueError(feature_message)
        if isinstance(encoder, int):
            raise TypeError(integer_message)
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        if vectorizer is None:
            self.vectorizer = default_vectorizer()
        else:
            self.vectorizer = vectorizer
        if clustering is None:
            self.clustering = OPTICS(min_samples=25)
        else:
            self.clustering = clustering
        if dimensionality_reduction is None:
            self.dimensionality_reduction = TSNE(
                n_components=2, metric="cosine", random_state=random_state
            )
        else:
            self.dimensionality_reduction = dimensionality_reduction
        self.feature_importance = feature_importance
        self.n_reduce_to = n_reduce_to
        self.reduction_method = reduction_method

    def _calculate_topic_vectors(
        self, is_in_slice: Optional[np.ndarray] = None
    ) -> np.ndarray:
        label_to_idx = {label: idx for idx, label in enumerate(self.classes_)}
        n_topics = len(self.classes_)
        n_dims = self.embeddings.shape[1]
        topic_vectors = np.full((n_topics, n_dims), np.nan)
        for label in np.unique(self.labels_):
            doc_idx = self.labels_ == label
            if is_in_slice is not None:
                doc_idx = doc_idx & is_in_slice
            topic_vectors[label_to_idx[label], :] = np.mean(
                self.embeddings[doc_idx], axis=0
            )
        return topic_vectors

    def _merge_agglomerative(self, n_reduce_to: int) -> np.ndarray:
        n_topics = self.components_.shape[0]
        res = {old_label: old_label for old_label in self.classes_}
        if n_topics <= n_reduce_to:
            return self.labels_
        interesting_topic_vectors = np.stack(
            [
                vec
                for label, vec in zip(self.classes_, self.topic_vectors_)
                if label != -1
            ]
        )
        old_labels = [label for label in self.classes_ if label != -1]
        new_labels = AgglomerativeClustering(
            n_clusters=n_reduce_to,
            metric="cosine",
            linkage="average",
        ).fit_predict(interesting_topic_vectors)
        res = {}
        if -1 in self.classes_:
            res[-1] = -1
        for i_old, i_new in zip(old_labels, new_labels):
            res[i_old] = i_new
        return np.array([res[label] for label in self.labels_])

    def _merge_smallest(self, n_reduce_to: int):
        merge_inst = smallest_hierarchical_join(
            self.topic_vectors_[self.classes_ != -1],
            self.topic_sizes_[self.classes_ != -1],
            self.classes_[self.classes_ != -1],
            n_reduce_to,
        )
        labels = np.copy(self.labels_)
        for from_topic, to_topic in merge_inst:
            labels[labels == from_topic] = to_topic
        return labels

    def reduce_topics(
        self,
        n_reduce_to: int,
        reduction_method: Literal["smallest", "agglomerative"],
    ) -> np.ndarray:
        """Reduces the clustering to the desired amount with the given method.

        Parameters
        ----------
        n_reduce_to: int, default None
            Number of topics to reduce topics to.
            The specified reduction method will be used to merge them.
            By default, topics are not merged.
        reduction_method: 'agglomerative', 'smallest'
            Method used to reduce the number of topics post-hoc.
            When 'agglomerative', BERTopic's topic reduction method is used,
            where topic vectors are hierarchically clustered.
            When 'smallest', the smallest topic gets merged into the closest
            non-outlier cluster until the desired number
            is achieved similarly to Top2Vec.

        Returns
        -------
        ndarray of shape (n_documents)
            New cluster labels for documents.
        """
        if not hasattr(self, "original_labels_"):
            self.original_labels_ = self.labels_
        if reduction_method == "smallest":
            self.labels_ = self._merge_smallest(n_reduce_to)
        elif reduction_method == "agglomerative":
            self.labels_ = self._merge_agglomerative(n_reduce_to)
        self.estimate_components(self.feature_importance)
        return self.labels_

    def join_topics(self, topic_ids: list[int]):
        """Joins given topic together into one topic and reestimates term importances.

        Example:
        ```python
        model.join_topics([0,3,2])
        ```

        Parameters
        ----------
        topic_ids: list[int]
            Topic IDs to join together.
            The new topic will get the lowest ID.
        """
        topic_ids = sorted(topic_ids)
        new_topic = topic_ids[0]
        new_labels = []
        self.original_labels_ = self.labels_
        for label in self.labels_:
            if label in topic_ids:
                new_labels.append(new_topic)
            else:
                new_labels.append(label)
        self.labels_ = np.array(new_labels)
        self.estimate_components(self.feature_importance)

    def reset_topics(self):
        """Resets topic reductions to the original clustering."""
        if not hasattr(self, "original_labels_"):
            warnings.warn("Topics have never been reduced, nothing to reset.")
        else:
            self.labels_ = self.original_labels_
            self.estimate_components(self.feature_importance)

    def estimate_components(
        self,
        feature_importance: Literal[
            "centroid", "soft-c-tf-idf", "bayes", "c-tf-idf"
        ],
    ) -> np.ndarray:
        """Estimates feature importances based on a fitted clustering.

        Parameters
        ----------
        feature_importance: {'soft-c-tf-idf', 'c-tf-idf', 'bayes', 'centroid'}, default 'soft-c-tf-idf'
            Method for estimating term importances.
            'centroid' uses distances from cluster centroid similarly
            to Top2Vec.
            'c-tf-idf' uses BERTopic's c-tf-idf.
            'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
            be very similar to 'c-tf-idf'.
            'bayes' uses Bayes' rule.

        Returns
        -------
        ndarray of shape (n_components, n_vocab)
            Topic-term matrix.
        """
        self.topic_names_ = None
        if getattr(self, "labels_", None) is None:
            raise NotFittedError(
                "The model has not been fitted yet, please fit the model before estimating temporal components."
            )
        clusters = np.unique(self.labels_)
        self.classes_ = np.sort(clusters)
        self.topic_sizes_ = np.array(
            [np.sum(self.labels_ == label) for label in self.classes_]
        )
        self.topic_vectors_ = self._calculate_topic_vectors()
        document_topic_matrix = label_binarize(
            self.labels_, classes=self.classes_
        )
        if feature_importance == "soft-c-tf-idf":
            self.components_ = soft_ctf_idf(
                document_topic_matrix, self.doc_term_matrix
            )  # type: ignore
        elif feature_importance == "centroid":
            if not hasattr(self, "vocab_embeddings"):
                self.vocab_embeddings = self.encoder_.encode(
                    self.vectorizer.get_feature_names_out()
                )  # type: ignore
                if (
                    self.vocab_embeddings.shape[1]
                    != self.topic_vectors_.shape[1]
                ):
                    raise ValueError(
                        NOT_MATCHING_ERROR.format(
                            n_dims=self.topic_vectors_.shape[1],
                            n_word_dims=self.vocab_embeddings.shape[1],
                        )
                    )
            self.components_ = cluster_centroid_distance(
                self.topic_vectors_,
                self.vocab_embeddings,
            )
        elif feature_importance == "bayes":
            self.components_ = bayes_rule(
                document_topic_matrix, self.doc_term_matrix
            )
        else:
            self.components_ = ctf_idf(
                document_topic_matrix, self.doc_term_matrix
            )
        return self.components_

    def fit_predict(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        """Fits model and predicts cluster labels for all given documents.

        Parameters
        ----------
        raw_documents: iterable of str
            Documents to fit the model on.
        y: None
            Ignored, exists for sklearn compatibility.
        embeddings: ndarray of shape (n_documents, n_dimensions), optional
            Precomputed document encodings.

        Returns
        -------
        ndarray of shape (n_documents)
            Cluster label for all documents (-1 for outliers)
        """
        console = Console()
        with console.status("Fitting model") as status:
            if embeddings is None:
                status.update("Encoding documents")
                embeddings = self.encoder_.encode(raw_documents)
                console.log("Encoding done.")
            self.embeddings = embeddings
            status.update("Extracting terms")
            self.doc_term_matrix = self.vectorizer.fit_transform(raw_documents)
            console.log("Term extraction done.")
            status.update("Reducing Dimensionality")
            reduced_embeddings = self.dimensionality_reduction.fit_transform(
                embeddings
            )
            console.log("Dimensionality reduction done.")
            status.update("Clustering documents")
            self.labels_ = self.clustering.fit_predict(reduced_embeddings)
            console.log("Clustering done.")
            status.update("Estimating parameters.")
            self.estimate_components(self.feature_importance)
            console.log("Parameter estimation done.")
            if self.n_reduce_to is not None:
                n_topics = self.classes_.shape[0]
                status.update(
                    f"Reducing topics from {n_topics} to {self.n_reduce_to}"
                )
                self.reduce_topics(self.n_reduce_to, self.reduction_method)
                console.log(
                    f"Topic reduction done from {n_topics} to {self.n_reduce_to}."
                )
                status.update("Reestimating parameters.")
                self.estimate_components(self.feature_importance)
                console.log("Reestimation done.")
        console.log("Model fitting done.")
        self.doc_topic_matrix = label_binarize(
            self.labels_, classes=self.classes_
        )
        return self.labels_

    def fit_transform(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ):
        labels = self.fit_predict(raw_documents, y, embeddings)
        return label_binarize(labels, classes=self.classes_)

    def estimate_temporal_components(
        self,
        time_labels,
        time_bin_edges,
        feature_importance: Literal[
            "c-tf-idf", "soft-c-tf-idf", "centroid", "bayes"
        ],
    ) -> np.ndarray:
        """Estimates temporal components based on a fitted topic model.

        Parameters
        ----------
        feature_importance: {'soft-c-tf-idf', 'c-tf-idf', 'bayes', 'centroid'}, default 'soft-c-tf-idf'
            Method for estimating term importances.
            'centroid' uses distances from cluster centroid similarly
            to Top2Vec.
            'c-tf-idf' uses BERTopic's c-tf-idf.
            'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
            be very similar to 'c-tf-idf'.
            'bayes' uses Bayes' rule.

        Returns
        -------
        ndarray of shape (n_time_bins, n_components, n_vocab)
            Temporal topic-term matrix.
        """
        if getattr(self, "components_", None) is None:
            raise NotFittedError(
                "The model has not been fitted yet, please fit the model before estimating temporal components."
            )
        n_comp, n_vocab = self.components_.shape
        self.time_bin_edges = time_bin_edges
        n_bins = len(self.time_bin_edges) - 1
        self.temporal_components_ = np.full(
            (n_bins, n_comp, n_vocab),
            np.nan,
            dtype=self.components_.dtype,
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        for i_timebin in np.unique(time_labels):
            topic_importances = self.doc_topic_matrix[
                time_labels == i_timebin
            ].sum(axis=0)
            if not topic_importances.sum() == 0:
                topic_importances = topic_importances / topic_importances.sum()
            self.temporal_importance_[i_timebin, :] = topic_importances
            t_dtm = self.doc_term_matrix[time_labels == i_timebin]
            t_doc_topic = self.doc_topic_matrix[time_labels == i_timebin]
            if feature_importance == "c-tf-idf":
                self.temporal_components_[i_timebin] = ctf_idf(
                    t_doc_topic, t_dtm
                )
            elif feature_importance == "soft-c-tf-idf":
                self.temporal_components_[i_timebin] = soft_ctf_idf(
                    t_doc_topic, t_dtm
                )
            elif feature_importance == "bayes":
                self.temporal_components_[i_timebin] = bayes_rule(
                    t_doc_topic, t_dtm
                )
            elif feature_importance == "centroid":
                t_topic_vectors = self._calculate_topic_vectors(
                    time_labels == i_timebin,
                )
                components = cluster_centroid_distance(
                    t_topic_vectors,
                    self.vocab_embeddings,
                )
                mask_terms = t_dtm.sum(axis=0).astype(np.float64)
                mask_terms = np.squeeze(np.asarray(mask_terms))
                components[:, mask_terms == 0] = np.nan
                self.temporal_components_[i_timebin] = components
        return self.temporal_components_

    def fit_transform_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ):
        time_labels, self.time_bin_edges = self.bin_timestamps(
            timestamps, bins
        )
        if hasattr(self, "components_"):
            doc_topic_matrix = label_binarize(
                self.labels_, classes=self.classes_
            )
        else:
            doc_topic_matrix = self.fit_transform(
                raw_documents, embeddings=embeddings
            )
        n_comp, n_vocab = self.components_.shape
        n_bins = len(self.time_bin_edges) - 1
        self.temporal_components_ = np.zeros(
            (n_bins, n_comp, n_vocab), dtype=doc_topic_matrix.dtype
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        if embeddings is None:
            embeddings = self.encoder_.encode(raw_documents)
        self.embeddings = embeddings
        self.estimate_temporal_components(
            time_labels, self.time_bin_edges, self.feature_importance
        )
        return doc_topic_matrix

estimate_components(feature_importance)

Estimates feature importances based on a fitted clustering.

Parameters:

Name Type Description Default
feature_importance Literal['centroid', 'soft-c-tf-idf', 'bayes', 'c-tf-idf']

Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. 'bayes' uses Bayes' rule.

required

Returns:

Type Description
ndarray of shape (n_components, n_vocab)

Topic-term matrix.

Source code in turftopic/models/cluster.py
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
def estimate_components(
    self,
    feature_importance: Literal[
        "centroid", "soft-c-tf-idf", "bayes", "c-tf-idf"
    ],
) -> np.ndarray:
    """Estimates feature importances based on a fitted clustering.

    Parameters
    ----------
    feature_importance: {'soft-c-tf-idf', 'c-tf-idf', 'bayes', 'centroid'}, default 'soft-c-tf-idf'
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
        'bayes' uses Bayes' rule.

    Returns
    -------
    ndarray of shape (n_components, n_vocab)
        Topic-term matrix.
    """
    self.topic_names_ = None
    if getattr(self, "labels_", None) is None:
        raise NotFittedError(
            "The model has not been fitted yet, please fit the model before estimating temporal components."
        )
    clusters = np.unique(self.labels_)
    self.classes_ = np.sort(clusters)
    self.topic_sizes_ = np.array(
        [np.sum(self.labels_ == label) for label in self.classes_]
    )
    self.topic_vectors_ = self._calculate_topic_vectors()
    document_topic_matrix = label_binarize(
        self.labels_, classes=self.classes_
    )
    if feature_importance == "soft-c-tf-idf":
        self.components_ = soft_ctf_idf(
            document_topic_matrix, self.doc_term_matrix
        )  # type: ignore
    elif feature_importance == "centroid":
        if not hasattr(self, "vocab_embeddings"):
            self.vocab_embeddings = self.encoder_.encode(
                self.vectorizer.get_feature_names_out()
            )  # type: ignore
            if (
                self.vocab_embeddings.shape[1]
                != self.topic_vectors_.shape[1]
            ):
                raise ValueError(
                    NOT_MATCHING_ERROR.format(
                        n_dims=self.topic_vectors_.shape[1],
                        n_word_dims=self.vocab_embeddings.shape[1],
                    )
                )
        self.components_ = cluster_centroid_distance(
            self.topic_vectors_,
            self.vocab_embeddings,
        )
    elif feature_importance == "bayes":
        self.components_ = bayes_rule(
            document_topic_matrix, self.doc_term_matrix
        )
    else:
        self.components_ = ctf_idf(
            document_topic_matrix, self.doc_term_matrix
        )
    return self.components_

estimate_temporal_components(time_labels, time_bin_edges, feature_importance)

Estimates temporal components based on a fitted topic model.

Parameters:

Name Type Description Default
feature_importance Literal['c-tf-idf', 'soft-c-tf-idf', 'centroid', 'bayes']

Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. 'bayes' uses Bayes' rule.

required

Returns:

Type Description
ndarray of shape (n_time_bins, n_components, n_vocab)

Temporal topic-term matrix.

Source code in turftopic/models/cluster.py
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
def estimate_temporal_components(
    self,
    time_labels,
    time_bin_edges,
    feature_importance: Literal[
        "c-tf-idf", "soft-c-tf-idf", "centroid", "bayes"
    ],
) -> np.ndarray:
    """Estimates temporal components based on a fitted topic model.

    Parameters
    ----------
    feature_importance: {'soft-c-tf-idf', 'c-tf-idf', 'bayes', 'centroid'}, default 'soft-c-tf-idf'
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
        'bayes' uses Bayes' rule.

    Returns
    -------
    ndarray of shape (n_time_bins, n_components, n_vocab)
        Temporal topic-term matrix.
    """
    if getattr(self, "components_", None) is None:
        raise NotFittedError(
            "The model has not been fitted yet, please fit the model before estimating temporal components."
        )
    n_comp, n_vocab = self.components_.shape
    self.time_bin_edges = time_bin_edges
    n_bins = len(self.time_bin_edges) - 1
    self.temporal_components_ = np.full(
        (n_bins, n_comp, n_vocab),
        np.nan,
        dtype=self.components_.dtype,
    )
    self.temporal_importance_ = np.zeros((n_bins, n_comp))
    for i_timebin in np.unique(time_labels):
        topic_importances = self.doc_topic_matrix[
            time_labels == i_timebin
        ].sum(axis=0)
        if not topic_importances.sum() == 0:
            topic_importances = topic_importances / topic_importances.sum()
        self.temporal_importance_[i_timebin, :] = topic_importances
        t_dtm = self.doc_term_matrix[time_labels == i_timebin]
        t_doc_topic = self.doc_topic_matrix[time_labels == i_timebin]
        if feature_importance == "c-tf-idf":
            self.temporal_components_[i_timebin] = ctf_idf(
                t_doc_topic, t_dtm
            )
        elif feature_importance == "soft-c-tf-idf":
            self.temporal_components_[i_timebin] = soft_ctf_idf(
                t_doc_topic, t_dtm
            )
        elif feature_importance == "bayes":
            self.temporal_components_[i_timebin] = bayes_rule(
                t_doc_topic, t_dtm
            )
        elif feature_importance == "centroid":
            t_topic_vectors = self._calculate_topic_vectors(
                time_labels == i_timebin,
            )
            components = cluster_centroid_distance(
                t_topic_vectors,
                self.vocab_embeddings,
            )
            mask_terms = t_dtm.sum(axis=0).astype(np.float64)
            mask_terms = np.squeeze(np.asarray(mask_terms))
            components[:, mask_terms == 0] = np.nan
            self.temporal_components_[i_timebin] = components
    return self.temporal_components_

fit_predict(raw_documents, y=None, embeddings=None)

Fits model and predicts cluster labels for all given documents.

Parameters:

Name Type Description Default
raw_documents

Documents to fit the model on.

required
y

Ignored, exists for sklearn compatibility.

None
embeddings Optional[ndarray]

Precomputed document encodings.

None

Returns:

Type Description
ndarray of shape (n_documents)

Cluster label for all documents (-1 for outliers)

Source code in turftopic/models/cluster.py
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
def fit_predict(
    self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
) -> np.ndarray:
    """Fits model and predicts cluster labels for all given documents.

    Parameters
    ----------
    raw_documents: iterable of str
        Documents to fit the model on.
    y: None
        Ignored, exists for sklearn compatibility.
    embeddings: ndarray of shape (n_documents, n_dimensions), optional
        Precomputed document encodings.

    Returns
    -------
    ndarray of shape (n_documents)
        Cluster label for all documents (-1 for outliers)
    """
    console = Console()
    with console.status("Fitting model") as status:
        if embeddings is None:
            status.update("Encoding documents")
            embeddings = self.encoder_.encode(raw_documents)
            console.log("Encoding done.")
        self.embeddings = embeddings
        status.update("Extracting terms")
        self.doc_term_matrix = self.vectorizer.fit_transform(raw_documents)
        console.log("Term extraction done.")
        status.update("Reducing Dimensionality")
        reduced_embeddings = self.dimensionality_reduction.fit_transform(
            embeddings
        )
        console.log("Dimensionality reduction done.")
        status.update("Clustering documents")
        self.labels_ = self.clustering.fit_predict(reduced_embeddings)
        console.log("Clustering done.")
        status.update("Estimating parameters.")
        self.estimate_components(self.feature_importance)
        console.log("Parameter estimation done.")
        if self.n_reduce_to is not None:
            n_topics = self.classes_.shape[0]
            status.update(
                f"Reducing topics from {n_topics} to {self.n_reduce_to}"
            )
            self.reduce_topics(self.n_reduce_to, self.reduction_method)
            console.log(
                f"Topic reduction done from {n_topics} to {self.n_reduce_to}."
            )
            status.update("Reestimating parameters.")
            self.estimate_components(self.feature_importance)
            console.log("Reestimation done.")
    console.log("Model fitting done.")
    self.doc_topic_matrix = label_binarize(
        self.labels_, classes=self.classes_
    )
    return self.labels_

join_topics(topic_ids)

Joins given topic together into one topic and reestimates term importances.

Example:

model.join_topics([0,3,2])

Parameters:

Name Type Description Default
topic_ids list[int]

Topic IDs to join together. The new topic will get the lowest ID.

required
Source code in turftopic/models/cluster.py
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
def join_topics(self, topic_ids: list[int]):
    """Joins given topic together into one topic and reestimates term importances.

    Example:
    ```python
    model.join_topics([0,3,2])
    ```

    Parameters
    ----------
    topic_ids: list[int]
        Topic IDs to join together.
        The new topic will get the lowest ID.
    """
    topic_ids = sorted(topic_ids)
    new_topic = topic_ids[0]
    new_labels = []
    self.original_labels_ = self.labels_
    for label in self.labels_:
        if label in topic_ids:
            new_labels.append(new_topic)
        else:
            new_labels.append(label)
    self.labels_ = np.array(new_labels)
    self.estimate_components(self.feature_importance)

reduce_topics(n_reduce_to, reduction_method)

Reduces the clustering to the desired amount with the given method.

Parameters:

Name Type Description Default
n_reduce_to int

Number of topics to reduce topics to. The specified reduction method will be used to merge them. By default, topics are not merged.

required
reduction_method Literal['smallest', 'agglomerative']

Method used to reduce the number of topics post-hoc. When 'agglomerative', BERTopic's topic reduction method is used, where topic vectors are hierarchically clustered. When 'smallest', the smallest topic gets merged into the closest non-outlier cluster until the desired number is achieved similarly to Top2Vec.

required

Returns:

Type Description
ndarray of shape (n_documents)

New cluster labels for documents.

Source code in turftopic/models/cluster.py
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
def reduce_topics(
    self,
    n_reduce_to: int,
    reduction_method: Literal["smallest", "agglomerative"],
) -> np.ndarray:
    """Reduces the clustering to the desired amount with the given method.

    Parameters
    ----------
    n_reduce_to: int, default None
        Number of topics to reduce topics to.
        The specified reduction method will be used to merge them.
        By default, topics are not merged.
    reduction_method: 'agglomerative', 'smallest'
        Method used to reduce the number of topics post-hoc.
        When 'agglomerative', BERTopic's topic reduction method is used,
        where topic vectors are hierarchically clustered.
        When 'smallest', the smallest topic gets merged into the closest
        non-outlier cluster until the desired number
        is achieved similarly to Top2Vec.

    Returns
    -------
    ndarray of shape (n_documents)
        New cluster labels for documents.
    """
    if not hasattr(self, "original_labels_"):
        self.original_labels_ = self.labels_
    if reduction_method == "smallest":
        self.labels_ = self._merge_smallest(n_reduce_to)
    elif reduction_method == "agglomerative":
        self.labels_ = self._merge_agglomerative(n_reduce_to)
    self.estimate_components(self.feature_importance)
    return self.labels_

reset_topics()

Resets topic reductions to the original clustering.

Source code in turftopic/models/cluster.py
324
325
326
327
328
329
330
def reset_topics(self):
    """Resets topic reductions to the original clustering."""
    if not hasattr(self, "original_labels_"):
        warnings.warn("Topics have never been reduced, nothing to reset.")
    else:
        self.labels_ = self.original_labels_
        self.estimate_components(self.feature_importance)