Skip to content

Clustering Topic Models

Clustering topic models conceptualize topic modeling as a clustering task. Essentially a topic for these models is a tightly packed group of documents in semantic space. The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.

If you are looking for a probabilistic/soft-clustering model you should also check out GMM.

Figure 1: Interactive figure to explore cluster structure in a clustering topic model.

How do clustering models work?

Step 1: Dimensionality Reduction

It is common practice to reduce the dimensionality of the embeddings before clustering them. This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by. Dimensionality reduction by default is done with TSNE in Turftopic, but users are free to specify the model that will be used for dimensionality reduction.

Choose a dimensionality reduction method

from sklearn.manifold import TSNE
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(dimensionality_reduction=TSNE(n_components=2, metric="cosine"))
TSNE is a classic method for producing non-linear lower-dimensional representations of high-simensional embeddings. TSNE has an inherent clustering property, which helps clustering models find groups of data. While it is widely used, it has many well-known issues, such as poor representation of global relations, and artificial clusters.

Use openTSNE for better performance!

By default, a scikit-learn implementation is used, but if you have the openTSNE package installed on your system, Turftopic will automatically use it. You can potentially speed up your clustering topic models by multiple orders of magnitude.

pip install turftopic[opentsne]

pip install umap-learn
from umap import UMAP
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(dimensionality_reduction=UMAP(n_components=2, metric="cosine"))

UMAP is universally usable non-linear dimensionality reduction method and is typically the default choice for topic discovery in clustering topic models. UMAP is faster than TSNE and is also substantially better at representing global structures in your dataset.

from sklearn.decomposition import PCA
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(dimensionality_reduction=PCA(n_components=2))

Principal Component Analysis is one of the most widely used dimensionality reduction techniques in machine learning. It is a linear method, that projects embeddings onto the first N principal components by the amount of variance they capture in the data. PCA is substantially faster than manifold methods, but is not as good at aiding clustering models as TSNE and UMAP.

Step 2: Document Clustering

After the dimensionality of document embeddings is reduced, topics are discovered by clustering document-embeddings in this lower dimensional space. Turftopic is entirely clustering-model agnostic, and as such, any type of model may be used.

Choose a clustering method

from sklearn.cluster import HDBSCAN
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(clustering=HDBSCAN())

HDBSCAN is a density-based clustering method, that can find clusters with varying densities. It can find the number of clusters in the data, and can also find outliers. While HDBSCAN has many advantageous properties, it can be hard to make an informed choice about its hyperparameters.

from sklearn.cluster import KMeans
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(clustering=KMeans(n_clusters=10))

The KMeans algorithm finds clusters by locating a prespecified number of mean vectors that minimize square distance of embeddings in a cluster to their mean. KMeans is a very fast algorithm, but makes very strong assumptions about cluster shapes, can't detect outliers and you have to specify the number of clusters prior to model fitting.

Step 3: Calculate term importance scores

Clustering topic models rely on post-hoc term importance estimation, meaning that topic descriptions are calculated based on already discovered clusters. Multiple methods are available in Turftopic for estimating words'/phrases' importance scores for topics.

Choose a term importance estimation method

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(feature_importance="soft-c-tf-idf")
# or
model = ClusteringTopicModel(feature_importance="c-tf-idf")

Weaknesses

  • Topics can be contaminated with stop words
  • Lower topic quality

Strengths

  • Theoretically more correct
  • More within-topic coverage

c-TF-IDF (Grootendorst, 2022) is a weighting scheme based on the number of occurrences of terms in each cluster. Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance. By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF, which is more robust to stop-words.


Click to see formulas

Soft-c-TF-IDF

  • Let \(X\) be the document term matrix where each element (\(X_{ij}\)) corresponds with the number of times word \(j\) occurs in a document \(i\).
  • Estimate weight of term \(j\) for topic \(z\):
    \(tf_{zj} = \frac{t_{zj}}{w_z}\), where \(t_{zj} = \sum_{i \in z} X_{ij}\) is the number of occurrences of a word in a topic and \(w_{z}= \sum_{j} t_{zj}\) is all words in the topic
  • Estimate inverse document/topic frequency for term \(j\):
    \(idf_j = log(\frac{N}{\sum_z |t_{zj}|})\), where \(N\) is the total number of documents.
  • Calculate importance of term \(j\) for topic \(z\):
    \(Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j\)

c-TF-IDF

  • Let \(X\) be the document term matrix where each element (\(X_{ij}\)) corresponds with the number of times word \(j\) occurs in a document \(i\).
  • \(tf_{zj} = \frac{t_{zj}}{w_z}\), where \(t_{zj} = \sum_{i \in z} X_{ij}\) is the number of occurrences of a word in a topic and \(w_{z}= \sum_{j} t_{zj}\) is all words in the topic
  • Estimate inverse document/topic frequency for term \(j\):
    \(idf_j = log(1 + \frac{A}{\sum_z |t_{zj}|})\), where \(A = \frac{\sum_z \sum_j t_{zj}}{Z}\) is the average number of words per topic, and \(Z\) is the number of topics.
  • Calculate importance of term \(j\) for topic \(z\):
    \(c-TF-IDF{zj} = tf_{zj} \cdot idf_j\)
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(feature_importance="centroid")

Weaknesses

  • Low within-topic coverage
  • Assumes spherical clusters

Strengths

  • Clean topics
  • Highly specific topics

In Top2Vec (Angelov, 2020) term importance scores are estimated from word embeddings' similarity to centroid vector of clusters. This approach typically produces cleaner and more specific topic descriptions, but might not be the optimal choice, since it makes assumptions about cluster shapes, and only describes the centers of clusters accurately.

You can also choose to recalculate term importances with a different method after fitting the model:

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit(corpus)
model.estimate_components(feature_importance="centroid")
model.estimate_components(feature_importance="soft-c-tf-idf")

Hierarchical Topic Merging

A weakness of clustering approaches based on density-based clustering methods, is that all too frequently they find a very large number of topics. To limit the number of topics in a topic model you can hierarchically merge topics, until you get the desired number. Turftopic allows you to use a number of popular methods for merging topics in clustering models.

Choose a topic reduction method

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10, reduction_method="average")
# or 
model.reduce_topics(10, reduction_method="single", metric="cosine")

Topics discovered by a clustering model can be merged using agglomerative clustering. For a detailed discussion of linkage methods and hierarchical clustering, consult SciPy's documentation. All linkage methods compatible with SciPy can be used as topic reduction methods in Turftopic.

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10, reduction_method="smallest")
# or 
model.reduce_topics(10, reduction_method="smallest", metric="cosine")

The approach used in the Top2Vec package is to always merge the smallest topic into the one closest to it (except the outlier-cluster) until the number of topics is down to the desired amount. This approach is remarkably fast, and usually quite effective, since it doesn't require computing full linkages.

As such, all clustering models have a hierarchy property, with which you can explore the topic hierarchy discovered by your models. For a detailed discussion of hierarchical modeling, check out the Hierarchical modeling page.

print(model.hierarchy)
Root:
├── -1: documented, obsolete, et4000, concerns, dubious, embedded, hardware, xfree86, alternative, seeking
├── 20: hitter, pitching, batting, hitters, pitchers, fielder, shortstop, inning, baseman, pitcher
├── 284: nhl, goaltenders, canucks, sabres, hockey, bruins, puck, oilers, canadiens, flyers
│ ├── 242: sportschannel, espn, nbc, nhl, broadcasts, broadcasting, broadcast, mlb, cbs, cbc
│ │ ├── 171: stadium, tickets, mlb, ticket, sportschannel, mets, inning, nationals, schedule, cubs
│ │ │ └── ...
│ │ └── 21: sportschannel, nbc, espn, nhl, broadcasting, broadcasts, broadcast, hockey, cbc, cbs
│ └── 236: nhl, goaltenders, canucks, sabres, puck, oilers, andreychuk, bruins, goaltender, leafs
...

You can also manually merge topics by using the join_topics() method of cluster hierarchies.

# Joins topics 0,1 and 2 and creates a merged topics with ID 4
model.hierarchy.join_topics([0, 1, 2], joint_id=4)

If you want to reset topics to their original state, you can call reset_topics()

model.reset_topics()

Dynamic Topic Modeling

Clustering models are also capable of dynamic topic modeling. This happens by fitting a clustering model over the entire corpus, as we expect that there is only one semantic model generating the documents.

For a detailed discussion, see Dynamic Models.

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit_dynamic(corpus, timestamps=ts, bins=10)
model.print_topics_over_time()

Visualization

You can interactively explore clusters using datamapplot directly in Turftopic! You will first have to install datamapplot for this to work:

pip install turftopic[datamapplot]
from turftopic import ClusteringTopicModel
from turftopic.namers import OpenAITopicNamer

model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)

fig = model.plot_clusters_datamapplot()
fig.save("clusters_visualization.html")
fig

See Figure 1

Info

If you are not running Turftopic from a Jupyter notebook, make sure to call fig.show(). This will open up a new browser tab with the interactive figure.

BERTopic and Top2Vec-like models in Turftopic

You can create BERTopic and Top2Vec models in Turftopic by modifying all model parameters and hyperparameters to match the defaults in those other packages. You will need UMAP and scikit-learn>=1.3.0 to be able to use HDBSCAN and UMAP:

pip install umap-learn scikit-learn>=1.3.0

Create BERTopic and Top2Vec models

from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap

berttopic = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(
        n_neighbors=10,
        n_components=5,
        min_dist=0.0,
        metric="cosine",
    ),
    clustering=HDBSCAN(
        min_cluster_size=15,
        metric="euclidean",
        cluster_selection_method="eom",
    ),
    feature_importance="c-tf-idf",
    reduction_method="average"
    reduction_distance_metric="cosine",
    reduction_topic_representation="component",
)
from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap

top2vec = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(
        n_neighbors=15,
        n_components=5,
        metric="cosine"
    ),
    clustering=HDBSCAN(
        min_cluster_size=15,
        metric="euclidean",
        cluster_selection_method="eom",
    ),
    feature_importance="centroid",
    reduction_method="smallest"
    reduction_distance_metric="cosine",
    reduction_topic_representation="centroid",
)

Theoretically the model descriptions above should result in the same behaviour as the other two packages, but there might be minor changes in implementation. We do not intend to keep up with changes in Top2Vec's and BERTopic's internal implementation details indefinitely.

API Reference

turftopic.models.cluster.ClusteringTopicModel

Bases: ContextualModel, ClusterMixin, DynamicTopicModel

Topic models, which assume topics to be clusters of documents in semantic space. Models also include a dimensionality reduction step to aid clustering.

from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap

corpus: list[str] = ["some text", "more text", ...]

# Construct a Top2Vec-like model
model = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(5),
    clustering=HDBSCAN(),
    feature_importance="centroid"
).fit(corpus)
model.print_topics()

Parameters:

Name Type Description Default
encoder Union[Encoder, str]

Model to encode documents/terms, all-MiniLM-L6-v2 is the default.

'sentence-transformers/all-MiniLM-L6-v2'
vectorizer Optional[CountVectorizer]

Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.

None
dimensionality_reduction Optional[TransformerMixin]

Dimensionality reduction step to run before clustering. Defaults to TSNE with cosine distance. To imitate the behavior of BERTopic or Top2Vec you should use UMAP.

None
clustering Optional[ClusterMixin]

Clustering method to use for finding topics. Defaults to OPTICS with 25 minimum cluster size. To imitate the behavior of BERTopic or Top2Vec you should use HDBSCAN.

None
feature_importance WordImportance

Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. 'bayes' uses Bayes' rule.

'soft-c-tf-idf'
n_reduce_to Optional[int]

Number of topics to reduce topics to. The specified reduction method will be used to merge them. By default, topics are not merged.

None
reduction_method LinkageMethod

Method used for hierarchically merging topics. Could be "smallest", which is Top2Vec's default merging strategy, or any of the linkage methods listed in SciPy's documentation

'average'
reduction_distance_metric DistanceMetric

Distance metric to use for hierarchical topic reduction.

'cosine'
reduction_topic_representation TopicRepresentation

Topic representation used for hierarchical clustering. If 'component' the topic-word importance scores will be used as topic vectors, (this is how it's done in BERTopic) if 'centroid' the centroid vectors of clusters will be used as topic vectors (Top2Vec).

'component'
random_state Optional[int]

Random state to use so that results are exactly reproducible.

None
Source code in turftopic/models/cluster.py
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
class ClusteringTopicModel(ContextualModel, ClusterMixin, DynamicTopicModel):
    """Topic models, which assume topics to be clusters of documents
    in semantic space.
    Models also include a dimensionality reduction step to aid clustering.

    ```python
    from turftopic import ClusteringTopicModel
    from sklearn.cluster import HDBSCAN
    import umap

    corpus: list[str] = ["some text", "more text", ...]

    # Construct a Top2Vec-like model
    model = ClusteringTopicModel(
        dimensionality_reduction=umap.UMAP(5),
        clustering=HDBSCAN(),
        feature_importance="centroid"
    ).fit(corpus)
    model.print_topics()
    ```

    Parameters
    ----------
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    dimensionality_reduction: TransformerMixin, default None
        Dimensionality reduction step to run before clustering.
        Defaults to TSNE with cosine distance.
        To imitate the behavior of BERTopic or Top2Vec you should use UMAP.
    clustering: ClusterMixin, default None
        Clustering method to use for finding topics.
        Defaults to OPTICS with 25 minimum cluster size.
        To imitate the behavior of BERTopic or Top2Vec you should use HDBSCAN.
    feature_importance: WordImportance, default 'soft-c-tf-idf'
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
        'bayes' uses Bayes' rule.
    n_reduce_to: int, default None
        Number of topics to reduce topics to.
        The specified reduction method will be used to merge them.
        By default, topics are not merged.
    reduction_method: LinkageMethod, default 'average'
        Method used for hierarchically merging topics.
        Could be "smallest", which is Top2Vec's default merging strategy, or
        any of the linkage methods listed in [SciPy's documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)
    reduction_distance_metric: DistanceMetric, default 'cosine'
        Distance metric to use for hierarchical topic reduction.
    reduction_topic_representation: {'component', 'centroid'}, default 'component'
        Topic representation used for hierarchical clustering.
        If 'component' the topic-word importance scores will be used as topic vectors, (this is how it's done in BERTopic)
        if 'centroid' the centroid vectors of clusters will be used as topic vectors (Top2Vec).
    random_state: int, default None
        Random state to use so that results are exactly reproducible.
    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        dimensionality_reduction: Optional[TransformerMixin] = None,
        clustering: Optional[ClusterMixin] = None,
        feature_importance: WordImportance = "soft-c-tf-idf",
        n_reduce_to: Optional[int] = None,
        reduction_method: LinkageMethod = "average",
        reduction_distance_metric: DistanceMetric = "cosine",
        reduction_topic_representation: TopicRepresentation = "component",
        random_state: Optional[int] = None,
    ):
        self.encoder = encoder
        self.random_state = random_state
        if feature_importance not in VALID_WORD_IMPORTANCE:
            raise ValueError(
                f"feature_importance must be one of {VALID_WORD_IMPORTANCE} got {feature_importance} instead."
            )
        if reduction_method not in VALID_LINKAGE_METHODS:
            raise ValueError(
                f"Topic reduction method has to be one of: {VALID_LINKAGE_METHODS}, but got {reduction_method} instead."
            )
        if reduction_distance_metric not in VALID_DISTANCE_METRICS:
            raise ValueError(
                f"Distance metric should be one of: {VALID_DISTANCE_METRICS}, but got {reduction_distance_metric} instead."
            )
        if reduction_topic_representation not in VALID_TOPIC_REPRESENTATIONS:
            raise ValueError(
                f"Topic representation should be one of: {VALID_TOPIC_REPRESENTATIONS}, but got {reduction_topic_representation} instead."
            )
        if isinstance(encoder, int):
            raise TypeError(integer_message)
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        if vectorizer is None:
            self.vectorizer = default_vectorizer()
        else:
            self.vectorizer = vectorizer
        if clustering is None:
            self.clustering = HDBSCAN(
                min_samples=10,
                min_cluster_size=25,
            )
        else:
            self.clustering = clustering
        if dimensionality_reduction is None:
            self.dimensionality_reduction = build_tsne(
                n_components=2,
                metric="cosine",
                perplexity=15,
                random_state=random_state,
            )
        else:
            self.dimensionality_reduction = dimensionality_reduction
        self.feature_importance = feature_importance
        self.reduction_distance_metric = reduction_distance_metric
        self.reduction_topic_representation = reduction_topic_representation
        self.n_reduce_to = n_reduce_to
        self.reduction_method = reduction_method

    @property
    def topic_representations(self) -> np.ndarray:
        if self.reduction_topic_representation == "component":
            return self.components_
        else:
            return self._calculate_topic_vectors()

    def _calculate_topic_vectors(
        self,
        is_in_slice: Optional[np.ndarray] = None,
        classes: Optional[np.ndarray] = None,
        embeddings: Optional[np.ndarray] = None,
        labels: Optional[np.ndarray] = None,
    ) -> np.ndarray:
        if classes is None:
            classes = self.classes_
        if embeddings is None:
            embeddings = self.embeddings
        if labels is None:
            labels = self.labels_
        label_to_idx = {label: idx for idx, label in enumerate(classes)}
        n_topics = len(classes)
        n_dims = embeddings.shape[1]
        topic_vectors = np.full((n_topics, n_dims), np.nan)
        for label in np.unique(labels):
            doc_idx = labels == label
            if is_in_slice is not None:
                doc_idx = doc_idx & is_in_slice
            topic_vectors[label_to_idx[label], :] = np.mean(
                embeddings[doc_idx], axis=0
            )
        return topic_vectors

    def estimate_components(
        self, feature_importance: Optional[WordImportance] = None
    ) -> np.ndarray:
        """Estimates feature importances based on a fitted clustering.

        Parameters
        ----------
        feature_importance: WordImportance, default None
            Method for estimating term importances.
            'centroid' uses distances from cluster centroid similarly
            to Top2Vec.
            'c-tf-idf' uses BERTopic's c-tf-idf.
            'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
            be very similar to 'c-tf-idf'.
            'bayes' uses Bayes' rule.

        Returns
        -------
        ndarray of shape (n_components, n_vocab)
            Topic-term matrix.
        """
        if feature_importance is not None:
            if feature_importance not in VALID_WORD_IMPORTANCE:
                raise ValueError(
                    f"feature_importance must be one of {VALID_WORD_IMPORTANCE} got {feature_importance} instead."
                )
            self.feature_importance = feature_importance
        self.hierarchy.estimate_components()
        return self.components_

    def reduce_topics(
        self,
        n_reduce_to: int,
        reduction_method: Optional[LinkageMethod] = None,
        metric: Optional[DistanceMetric] = None,
    ) -> np.ndarray:
        """Reduces the clustering to the desired amount with the given method.

        Parameters
        ----------
        n_reduce_to: int, default None
            Number of topics to reduce topics to.
            The specified reduction method will be used to merge them.
            By default, topics are not merged.
        reduction_method: LinkageMethod, default None
            Method used for hierarchically merging topics.
            Could be "smallest", which is Top2Vec's default merging strategy, or
            any of the linkage methods listed in [SciPy's documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)
        reduction_distance_metric: DistanceMetric, default None
            Distance metric to use for hierarchical topic reduction.

        Returns
        -------
        ndarray of shape (n_documents)
            New cluster labels for documents.
        """
        if not hasattr(self, "original_labels_"):
            self.original_labels_ = self.labels_
            self.original_names_ = self.topic_names
            self.original_classes_ = self.classes_
        if reduction_method is None:
            reduction_method = self.reduction_method
        if metric is None:
            metric = self.reduction_distance_metric
        self.hierarchy.reduce_topics(
            n_reduce_to, method=reduction_method, metric=metric
        )
        return self.labels_

    def reset_topics(self):
        """Resets topics to the original cllustering."""
        original_labels = getattr(self, "original_labels_", None)
        if original_labels is None:
            warnings.warn("Topics have never been reduced, nothing to reset.")
        else:
            self.hierarchy = ClusterNode.create_root(
                self, labels=self.original_labels_
            )
            self.topic_names_ = self.original_names_

    @property
    def classes_(self):
        try:
            return self.hierarchy.classes_
        except AttributeError as e:
            raise AttributeError(
                "Model has not been fitted yet, and doesn't have classes_"
            ) from e

    @property
    def components_(self):
        try:
            return self.hierarchy.components_
        except AttributeError as e:
            raise AttributeError(
                "Model has not been fitted yet, and doesn't have components_"
            ) from e

    @property
    def labels_(self):
        try:
            return self.hierarchy.labels_
        except AttributeError as e:
            raise AttributeError(
                "Model has not been fitted yet, and doesn't have labels_"
            ) from e

    @property
    def document_topic_matrix(self):
        return label_binarize(self.labels_, classes=self.classes_)

    def join_topics(
        self, to_join: Sequence[int], joint_id: Optional[int] = None
    ):
        """Joins the given topics in the cluster hierarchy to a single topic.

        Parameters
        ----------
        to_join: Sequence of int
            Topics to join together by ID.
        joint_id: int, default None
            New ID for the joint cluster.
            Default is the smallest ID of the topics to join.
        """
        self.hierarchy.join_topics(to_join, joint_id=joint_id)

    def fit_predict(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        """Fits model and predicts cluster labels for all given documents.

        Parameters
        ----------
        raw_documents: iterable of str
            Documents to fit the model on.
        y: None
            Ignored, exists for sklearn compatibility.
        embeddings: ndarray of shape (n_documents, n_dimensions), optional
            Precomputed document encodings.

        Returns
        -------
        ndarray of shape (n_documents)
            Cluster label for all documents (-1 for outliers)
        """
        console = Console()
        with console.status("Fitting model") as status:
            if embeddings is None:
                status.update("Encoding documents")
                embeddings = self.encoder_.encode(raw_documents)
                console.log("Encoding done.")
            self.embeddings = embeddings
            status.update("Extracting terms")
            self.doc_term_matrix = self.vectorizer.fit_transform(raw_documents)
            console.log("Term extraction done.")
            status.update("Reducing Dimensionality")
            self.reduced_embeddings = (
                self.dimensionality_reduction.fit_transform(embeddings)
            )
            console.log("Dimensionality reduction done.")
            status.update("Clustering documents")
            labels = self.clustering.fit_predict(self.reduced_embeddings)
            console.log("Clustering done.")
            status.update("Estimating parameters.")
            # Initializing hierarchy
            self.hierarchy = ClusterNode.create_root(self, labels=labels)
            console.log("Parameter estimation done.")
            if self.n_reduce_to is not None:
                n_topics = self.classes_.shape[0]
                status.update(
                    f"Reducing topics from {n_topics} to {self.n_reduce_to}"
                )
                self.reduce_topics(
                    self.n_reduce_to,
                    self.reduction_method,
                    self.reduction_distance_metric,
                )
                console.log(
                    f"Topic reduction done from {n_topics} to {self.n_reduce_to}."
                )
        console.log("Model fitting done.")
        return self.labels_

    def fit_transform(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ):
        labels = self.fit_predict(raw_documents, y, embeddings)
        document_topic_matrix = label_binarize(labels, classes=self.classes_)
        document_topic_matrix = document_topic_matrix * cosine_similarity(
            self.embeddings, self._calculate_topic_vectors()
        )
        return document_topic_matrix

    def estimate_temporal_components(
        self,
        time_labels,
        time_bin_edges,
        feature_importance: Optional[WordImportance] = None,
    ) -> np.ndarray:
        """Estimates temporal components based on a fitted topic model.

        Parameters
        ----------
        feature_importance: WordImportance, default None
            Method for estimating term importances.
            'centroid' uses distances from cluster centroid similarly
            to Top2Vec.
            'c-tf-idf' uses BERTopic's c-tf-idf.
            'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
            be very similar to 'c-tf-idf'.
            'bayes' uses Bayes' rule.

        Returns
        -------
        ndarray of shape (n_time_bins, n_components, n_vocab)
            Temporal topic-term matrix.
        """
        if getattr(self, "components_", None) is None:
            raise NotFittedError(
                "The model has not been fitted yet, please fit the model before estimating temporal components."
            )
        if feature_importance is None:
            feature_importance = self.feature_importance
        n_comp, n_vocab = self.components_.shape
        self.time_bin_edges = time_bin_edges
        n_bins = len(self.time_bin_edges) - 1
        self.temporal_components_ = np.full(
            (n_bins, n_comp, n_vocab),
            np.nan,
            dtype=self.components_.dtype,
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        for i_timebin in np.unique(time_labels):
            topic_importances = self.document_topic_matrix[
                time_labels == i_timebin
            ].sum(axis=0)
            if not topic_importances.sum() == 0:
                topic_importances = topic_importances / topic_importances.sum()
            self.temporal_importance_[i_timebin, :] = topic_importances
            t_dtm = self.doc_term_matrix[time_labels == i_timebin]
            t_doc_topic = self.document_topic_matrix[time_labels == i_timebin]
            if feature_importance == "c-tf-idf":
                self.temporal_components_[i_timebin] = ctf_idf(
                    t_doc_topic, t_dtm
                )
            elif feature_importance == "soft-c-tf-idf":
                self.temporal_components_[i_timebin] = soft_ctf_idf(
                    t_doc_topic, t_dtm
                )
            elif feature_importance == "bayes":
                self.temporal_components_[i_timebin] = bayes_rule(
                    t_doc_topic, t_dtm
                )
            elif feature_importance == "centroid":
                t_topic_vectors = self._calculate_topic_vectors(
                    is_in_slice=time_labels == i_timebin,
                )
                components = cluster_centroid_distance(
                    t_topic_vectors,
                    self.vocab_embeddings,
                )
                mask_terms = t_dtm.sum(axis=0).astype(np.float64)
                mask_terms = np.squeeze(np.asarray(mask_terms))
                components[:, mask_terms == 0] = np.nan
                self.temporal_components_[i_timebin] = components
        return self.temporal_components_

    def fit_transform_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ):
        time_labels, self.time_bin_edges = self.bin_timestamps(
            timestamps, bins
        )
        if hasattr(self, "components_"):
            doc_topic_matrix = label_binarize(
                self.labels_, classes=self.classes_
            )
        else:
            doc_topic_matrix = self.fit_transform(
                raw_documents, embeddings=embeddings
            )
        n_comp, n_vocab = self.components_.shape
        n_bins = len(self.time_bin_edges) - 1
        self.temporal_components_ = np.zeros(
            (n_bins, n_comp, n_vocab), dtype=doc_topic_matrix.dtype
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        if embeddings is None:
            embeddings = self.encoder_.encode(raw_documents)
        self.embeddings = embeddings
        self.estimate_temporal_components(
            time_labels, self.time_bin_edges, self.feature_importance
        )
        return doc_topic_matrix

    @staticmethod
    def _labels_to_indices(labels, classes):
        n_classes = len(classes)
        class_to_index = dict(zip(classes, np.arange(n_classes)))
        return np.array([class_to_index[label] for label in labels])

    def plot_clusters_datamapplot(
        self, dimensions: tuple[int, int] = (0, 1), *args, **kwargs
    ):
        try:
            import datamapplot
        except ModuleNotFoundError as e:
            raise ModuleNotFoundError(
                "You need to install datamapplot to be able to use plot_clusters_datamapplot()."
            ) from e
        coordinates = self.reduced_embeddings[:, dimensions]
        coordinates = scale(coordinates) * 4
        indices = self._labels_to_indices(self.labels_, self.classes_)
        labels = np.array(self.topic_names)[indices]
        if -1 in self.classes_:
            i_outlier = np.where(self.classes_ == -1)[0][0]
            kwargs["noise_label"] = self.topic_names[i_outlier]
        plot = datamapplot.create_interactive_plot(
            coordinates, labels, *args, **kwargs
        )

        def show_fig():
            with tempfile.TemporaryDirectory() as temp_dir:
                file_name = Path(temp_dir).joinpath("fig.html")
                plot.save(file_name)
                webbrowser.open("file://" + str(file_name.absolute()), new=2)
                time.sleep(2)

        plot.show = show_fig
        plot.write_html = plot.save
        return plot

estimate_components(feature_importance=None)

Estimates feature importances based on a fitted clustering.

Parameters:

Name Type Description Default
feature_importance Optional[WordImportance]

Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. 'bayes' uses Bayes' rule.

None

Returns:

Type Description
ndarray of shape (n_components, n_vocab)

Topic-term matrix.

Source code in turftopic/models/cluster.py
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
def estimate_components(
    self, feature_importance: Optional[WordImportance] = None
) -> np.ndarray:
    """Estimates feature importances based on a fitted clustering.

    Parameters
    ----------
    feature_importance: WordImportance, default None
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
        'bayes' uses Bayes' rule.

    Returns
    -------
    ndarray of shape (n_components, n_vocab)
        Topic-term matrix.
    """
    if feature_importance is not None:
        if feature_importance not in VALID_WORD_IMPORTANCE:
            raise ValueError(
                f"feature_importance must be one of {VALID_WORD_IMPORTANCE} got {feature_importance} instead."
            )
        self.feature_importance = feature_importance
    self.hierarchy.estimate_components()
    return self.components_

estimate_temporal_components(time_labels, time_bin_edges, feature_importance=None)

Estimates temporal components based on a fitted topic model.

Parameters:

Name Type Description Default
feature_importance Optional[WordImportance]

Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. 'bayes' uses Bayes' rule.

None

Returns:

Type Description
ndarray of shape (n_time_bins, n_components, n_vocab)

Temporal topic-term matrix.

Source code in turftopic/models/cluster.py
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
def estimate_temporal_components(
    self,
    time_labels,
    time_bin_edges,
    feature_importance: Optional[WordImportance] = None,
) -> np.ndarray:
    """Estimates temporal components based on a fitted topic model.

    Parameters
    ----------
    feature_importance: WordImportance, default None
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
        'bayes' uses Bayes' rule.

    Returns
    -------
    ndarray of shape (n_time_bins, n_components, n_vocab)
        Temporal topic-term matrix.
    """
    if getattr(self, "components_", None) is None:
        raise NotFittedError(
            "The model has not been fitted yet, please fit the model before estimating temporal components."
        )
    if feature_importance is None:
        feature_importance = self.feature_importance
    n_comp, n_vocab = self.components_.shape
    self.time_bin_edges = time_bin_edges
    n_bins = len(self.time_bin_edges) - 1
    self.temporal_components_ = np.full(
        (n_bins, n_comp, n_vocab),
        np.nan,
        dtype=self.components_.dtype,
    )
    self.temporal_importance_ = np.zeros((n_bins, n_comp))
    for i_timebin in np.unique(time_labels):
        topic_importances = self.document_topic_matrix[
            time_labels == i_timebin
        ].sum(axis=0)
        if not topic_importances.sum() == 0:
            topic_importances = topic_importances / topic_importances.sum()
        self.temporal_importance_[i_timebin, :] = topic_importances
        t_dtm = self.doc_term_matrix[time_labels == i_timebin]
        t_doc_topic = self.document_topic_matrix[time_labels == i_timebin]
        if feature_importance == "c-tf-idf":
            self.temporal_components_[i_timebin] = ctf_idf(
                t_doc_topic, t_dtm
            )
        elif feature_importance == "soft-c-tf-idf":
            self.temporal_components_[i_timebin] = soft_ctf_idf(
                t_doc_topic, t_dtm
            )
        elif feature_importance == "bayes":
            self.temporal_components_[i_timebin] = bayes_rule(
                t_doc_topic, t_dtm
            )
        elif feature_importance == "centroid":
            t_topic_vectors = self._calculate_topic_vectors(
                is_in_slice=time_labels == i_timebin,
            )
            components = cluster_centroid_distance(
                t_topic_vectors,
                self.vocab_embeddings,
            )
            mask_terms = t_dtm.sum(axis=0).astype(np.float64)
            mask_terms = np.squeeze(np.asarray(mask_terms))
            components[:, mask_terms == 0] = np.nan
            self.temporal_components_[i_timebin] = components
    return self.temporal_components_

fit_predict(raw_documents, y=None, embeddings=None)

Fits model and predicts cluster labels for all given documents.

Parameters:

Name Type Description Default
raw_documents

Documents to fit the model on.

required
y

Ignored, exists for sklearn compatibility.

None
embeddings Optional[ndarray]

Precomputed document encodings.

None

Returns:

Type Description
ndarray of shape (n_documents)

Cluster label for all documents (-1 for outliers)

Source code in turftopic/models/cluster.py
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
def fit_predict(
    self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
) -> np.ndarray:
    """Fits model and predicts cluster labels for all given documents.

    Parameters
    ----------
    raw_documents: iterable of str
        Documents to fit the model on.
    y: None
        Ignored, exists for sklearn compatibility.
    embeddings: ndarray of shape (n_documents, n_dimensions), optional
        Precomputed document encodings.

    Returns
    -------
    ndarray of shape (n_documents)
        Cluster label for all documents (-1 for outliers)
    """
    console = Console()
    with console.status("Fitting model") as status:
        if embeddings is None:
            status.update("Encoding documents")
            embeddings = self.encoder_.encode(raw_documents)
            console.log("Encoding done.")
        self.embeddings = embeddings
        status.update("Extracting terms")
        self.doc_term_matrix = self.vectorizer.fit_transform(raw_documents)
        console.log("Term extraction done.")
        status.update("Reducing Dimensionality")
        self.reduced_embeddings = (
            self.dimensionality_reduction.fit_transform(embeddings)
        )
        console.log("Dimensionality reduction done.")
        status.update("Clustering documents")
        labels = self.clustering.fit_predict(self.reduced_embeddings)
        console.log("Clustering done.")
        status.update("Estimating parameters.")
        # Initializing hierarchy
        self.hierarchy = ClusterNode.create_root(self, labels=labels)
        console.log("Parameter estimation done.")
        if self.n_reduce_to is not None:
            n_topics = self.classes_.shape[0]
            status.update(
                f"Reducing topics from {n_topics} to {self.n_reduce_to}"
            )
            self.reduce_topics(
                self.n_reduce_to,
                self.reduction_method,
                self.reduction_distance_metric,
            )
            console.log(
                f"Topic reduction done from {n_topics} to {self.n_reduce_to}."
            )
    console.log("Model fitting done.")
    return self.labels_

join_topics(to_join, joint_id=None)

Joins the given topics in the cluster hierarchy to a single topic.

Parameters:

Name Type Description Default
to_join Sequence[int]

Topics to join together by ID.

required
joint_id Optional[int]

New ID for the joint cluster. Default is the smallest ID of the topics to join.

None
Source code in turftopic/models/cluster.py
376
377
378
379
380
381
382
383
384
385
386
387
388
389
def join_topics(
    self, to_join: Sequence[int], joint_id: Optional[int] = None
):
    """Joins the given topics in the cluster hierarchy to a single topic.

    Parameters
    ----------
    to_join: Sequence of int
        Topics to join together by ID.
    joint_id: int, default None
        New ID for the joint cluster.
        Default is the smallest ID of the topics to join.
    """
    self.hierarchy.join_topics(to_join, joint_id=joint_id)

reduce_topics(n_reduce_to, reduction_method=None, metric=None)

Reduces the clustering to the desired amount with the given method.

Parameters:

Name Type Description Default
n_reduce_to int

Number of topics to reduce topics to. The specified reduction method will be used to merge them. By default, topics are not merged.

required
reduction_method Optional[LinkageMethod]

Method used for hierarchically merging topics. Could be "smallest", which is Top2Vec's default merging strategy, or any of the linkage methods listed in SciPy's documentation

None
reduction_distance_metric

Distance metric to use for hierarchical topic reduction.

required

Returns:

Type Description
ndarray of shape (n_documents)

New cluster labels for documents.

Source code in turftopic/models/cluster.py
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
def reduce_topics(
    self,
    n_reduce_to: int,
    reduction_method: Optional[LinkageMethod] = None,
    metric: Optional[DistanceMetric] = None,
) -> np.ndarray:
    """Reduces the clustering to the desired amount with the given method.

    Parameters
    ----------
    n_reduce_to: int, default None
        Number of topics to reduce topics to.
        The specified reduction method will be used to merge them.
        By default, topics are not merged.
    reduction_method: LinkageMethod, default None
        Method used for hierarchically merging topics.
        Could be "smallest", which is Top2Vec's default merging strategy, or
        any of the linkage methods listed in [SciPy's documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)
    reduction_distance_metric: DistanceMetric, default None
        Distance metric to use for hierarchical topic reduction.

    Returns
    -------
    ndarray of shape (n_documents)
        New cluster labels for documents.
    """
    if not hasattr(self, "original_labels_"):
        self.original_labels_ = self.labels_
        self.original_names_ = self.topic_names
        self.original_classes_ = self.classes_
    if reduction_method is None:
        reduction_method = self.reduction_method
    if metric is None:
        metric = self.reduction_distance_metric
    self.hierarchy.reduce_topics(
        n_reduce_to, method=reduction_method, metric=metric
    )
    return self.labels_

reset_topics()

Resets topics to the original cllustering.

Source code in turftopic/models/cluster.py
334
335
336
337
338
339
340
341
342
343
def reset_topics(self):
    """Resets topics to the original cllustering."""
    original_labels = getattr(self, "original_labels_", None)
    if original_labels is None:
        warnings.warn("Topics have never been reduced, nothing to reset.")
    else:
        self.hierarchy = ClusterNode.create_root(
            self, labels=self.original_labels_
        )
        self.topic_names_ = self.original_names_