Skip to content

Clustering Topic Models

Clustering topic models conceptualize topic modeling as a clustering task. Essentially a topic for these models is a tightly packed group of documents in semantic space.

The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.

Turftopic contains flexible implementations of these models where you have control over each of the steps in the process, while sticking to a minimal amount of extra dependencies. While the models themselves can be equivalent to BERTopic and Top2Vec implementations, Turftopic might not offer some of the implementation-specific features, that the other libraries boast.

The Model

1. Dimensionality Reduction

It is common practice in clustering topic modeling literature to reduce the dimensionality of the embeddings before clustering them. This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by.

Dimensionality reduction by default is done with scikit-learn's TSNE implementation in Turftopic, but users are free to specify the model that will be used for dimensionality reduction.

Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature. Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).

2. Clustering

After reducing the dimensionality of the embeddings, they are clustered with a clustering model. As HDBSCAN has only been part of scikit-learn since version 1.3.0, Turftopic uses OPTICS as its default.

Some clustering models are capable of discovering the number of clusters in the data. This is a useful and yet-to-be challenged property of clustering topic models.

Practice suggests, however, that in large corpora, this frequently results in a very large number of topics, which is impractical for interpretation. Models' hyperparameters can be adjusted to account for this behaviour, but the impact of choice of hyperparameters on topic quality is more or less unknown.

3a. Term Importance: Proximity to Cluster Centroids

Clustering topic models rely on post-hoc term importance estimation. Currently there are two methods used for this.

The solution introduced in Top2Vec (Angelov, 2020) is that of estimating terms' importances for a given topic from their embeddings' cosine similarity to the centroid of the embeddings in a cluster.

Terms Close to the Topic Vector
(figure from Top2Vec documentation)

This has three implications:

  1. Topic descriptions are very specific. As the closest terms to the topic vector are selected, they tend to also be very close to each other. The issue with this is that many of the documents in a topic might not get proper coverage.
  2. It is assumed that the clusters are convex and spherical. This might not at all be the case, and especially when clusters are concave, the closest terms to the centroid might end up describing a different, or nonexistent topic. In other words: The mean might not be a representative datapoint of the population.
  3. Noise rarely gets into topic descriptions. Since functions words or contaminating terms are not very likely to be closest to the topic vector, decriptions are typically clean.
Centroids of Non-Convex Clusters

3b. Term Importance: c-TF-IDF

The solution to this issue, suggested by Grootendorst (2022) to this issue was c-TF-IDF.

c-TF-IDF is a weighting scheme based on the number of occurrences of terms in each cluster. Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance.

Let \(X\) be the document term matrix where each element (\(X_{ij}\)) corresponds with the number of times word \(j\) occurs in a document \(i\).

By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF, which is calculated in the following manner:

  • Estimate weight of term \(j\) for topic \(z\):
    \(tf_{zj} = \frac{t_{zj}}{w_z}\), where \(t_{zj} = \sum_{i \in z} X_{ij}\) is the number of occurrences of a word in a topic and \(w_{z}= \sum_{j} t_{zj}\) is all words in the topic
  • Estimate inverse document/topic frequency for term \(j\):
    \(idf_j = log(\frac{N}{\sum_z |t_{zj}|})\), where \(N\) is the total number of documents.
  • Calculate importance of term \(j\) for topic \(z\):
    \(Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j\)

You can also use the original c-TF-IDF formula, if you intend to replicate the behaviour of BERTopic exactly. The two formulas tend to give similar results, though the implications of choosing one over the other has not been thoroughly evaluated.

\(tf_{zj} = \frac{t_{zj}}{w_z}\), where \(t_{zj} = \sum_{i \in z} X_{ij}\) is the number of occurrences of a word in a topic and \(w_{z}= \sum_{j} t_{zj}\) is all words in the topic
- Estimate inverse document/topic frequency for term \(j\):
\(idf_j = log(1 + \frac{A}{\sum_z |t_{zj}|})\), where \(A = \frac{\sum_z \sum_j t_{zj}}{Z}\) is the average number of words per topic, and \(Z\) is the number of topics. - Calculate importance of term \(j\) for topic \(z\):
\(c-TF-IDF{zj} = tf_{zj} \cdot idf_j\)

This solution is generally to be preferred to centroid-based term importance (and the default in Turftopic), as it is more likely to give correct results. On the other hand, c-TF-IDF can be sensitive to words with atypical statistical properties (stop words), and can result in low diversity between topics, when clusters are joined post-hoc.

4. Hierarchical Topic Merging

A weakness of clustering approaches based on density-based clustering methods, is that all too frequently they find a very large number of topics. To limit the number of topics in a topic model you can use hierarchical topic merging.

Merge Smallest

The approach used in the Top2Vec package is to always merge the smallest topic into the one closest to it (except the outlier-cluster) until the number of topics is down to the desired amount.

You can achieve this behaviour like so:

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10, reduction_method="smallest")

Agglomerative Clustering

In BERTopic topics are merged based on agglomerative clustering using average linkage, and then term importances are reestimated. You can do this in Turftopic as well:

model = ClusteringTopicModel(n_reduce_to=10, reduction_method="agglomerative")

BERTopic and Top2Vec

Turftopic's implementation differs in multiple places to BERTopic and Top2Vec. You can, however, construct models in Turftopic that imitate the behaviour of these other packages.

The main differences to these packages are: - Dimensionality reduction in BERTopic and Top2Vec is done with UMAP. - Clustering is in BERTopic and Top2Vec is done with HDBSCAN. - Turftopic does not include many of the visualization and model-specific utilities that BERTopic does.

To get closest to the functionality of the two other packages you can manually set the clustering and dimensionality reduction model when creating the models:

You will need UMAP and scikit-learn>=1.3.0:

pip install umap-learn scikit-learn>=1.3.0

This is how you build a BERTopic-like model in Turftopic:

from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap

# I also included the default parameters of BERTopic so that the behaviour is as
# close as possible
berttopic = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(
        n_neighbors=10,
        n_components=5,
        min_dist=0.0,
        metric="cosine",
    ),
    clustering=HDBSCAN(
        min_cluster_size=15,
        metric="euclidean",
        cluster_selection_method="eom",
    ),
    feature_importance="c-tf-idf",
    reduction_method="agglomerative"
)

This is how you build a Top2Vec model in Turftopic:

top2vec = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(
        n_neighbors=15,
        n_components=5,
        metric="cosine"
    ),
    clustering=HDBSCAN(
        min_cluster_size=15,
        metric="euclidean",
        cluster_selection_method="eom",
    ),
    feature_importance="centroid",
    reduction_method="smallest"
)

Theoretically the model descriptions above should result in the same behaviour as the other two packages, but there might be minor changes in implementation. We do not intend to keep up with changes in Top2Vec's and BERTopic's internal implementation details indefinitely.

(Optional) 5. Dynamic Modeling

Clustering models are also capable of dynamic topic modeling. This happens by fitting a clustering model over the entire corpus, as we expect that there is only one semantic model generating the documents. To gain temporal representations for topics, the corpus is divided into equal, or arbitrarily chosen time slices, and then term importances are estimated using Soft-c-TF-IDF, c-TF-IDF, or distances from cluster centroid for each of the time slices separately. When distance from cluster centroids is used to estimate topic importances in dynamic modeling, cluster centroids are computed based on documents and terms present within a given time slice.

Considerations

Strengths

  • Automatic Discovery of Number of Topics: Clustering models can find the number of topics by themselves. This is a useful quality of these models as practicioners can rarely make an informed decision about the number of topics a-priori.
  • No Assumptions of Normality: With clustering models you can avoid making assumptions about cluster shapes. This is in contrast with GMMs, which assume topics to be Gaussian components.
  • Outlier Detection: OPTICS, HDBSCAN or DBSCAN contain outlier detection. This way, outliers do not influence topic representations.
  • Not Affected by Embedding Size: Since the models include dimensionality reduction, they are not as influenced by the curse of dimensionality as other methods.

Weaknesses

  • Scalability: Clustering models typically cannot be fitted in an online fashion, and manifold learning is usually inefficient in large corpora. When the number of texts is huge, the number of topics also gets inflated, which is impractical for interpretation.
  • Lack of Nuance: The models are unable to capture multiple topics in a document or capture uncertainty around topic labels. This makes the models impractical for longer texts as well.
  • Sensitivity to Hyperparameters: While do not have to set the number of topics directly, the hyperparameters you choose has a huge impact on the number of topics you will end up getting. You can counteract this to a certain extent with hierarchical merging. (see figure)
  • Transductivity: Some clustering methods are transductive, meaning you can't predict topical content for new documents, as they would change the cluster structure.
Effect of UMAP's and HDBSCAN's Hyperparameters on the Number of Topics in 20 Newsgroups

API Reference

turftopic.models.cluster.ClusteringTopicModel

Bases: ContextualModel, ClusterMixin, DynamicTopicModel

Topic models, which assume topics to be clusters of documents in semantic space. Models also include a dimensionality reduction step to aid clustering.

from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap

corpus: list[str] = ["some text", "more text", ...]

# Construct a Top2Vec-like model
model = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(5),
    clustering=HDBSCAN(),
    feature_importance="centroid"
).fit(corpus)
model.print_topics()

Parameters:

Name Type Description Default
encoder Union[Encoder, str]

Model to encode documents/terms, all-MiniLM-L6-v2 is the default.

'sentence-transformers/all-MiniLM-L6-v2'
vectorizer Optional[CountVectorizer]

Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.

None
dimensionality_reduction Optional[TransformerMixin]

Dimensionality reduction step to run before clustering. Defaults to TSNE with cosine distance. To imitate the behavior of BERTopic or Top2Vec you should use UMAP.

None
clustering Optional[ClusterMixin]

Clustering method to use for finding topics. Defaults to OPTICS with 25 minimum cluster size. To imitate the behavior of BERTopic or Top2Vec you should use HDBSCAN.

None
feature_importance Literal['c-tf-idf', 'soft-c-tf-idf', 'centroid']

Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'.

'soft-c-tf-idf'
n_reduce_to Optional[int]

Number of topics to reduce topics to. The specified reduction method will be used to merge them. By default, topics are not merged.

None
reduction_method Literal['agglomerative', 'smallest']

Method used to reduce the number of topics post-hoc. When 'agglomerative', BERTopic's topic reduction method is used, where topic vectors are hierarchically clustered. When 'smallest', the smallest topic gets merged into the closest non-outlier cluster until the desired number is achieved similarly to Top2Vec.

'agglomerative'
random_state Optional[int]

Random state to use so that results are exactly reproducible.

None
Source code in turftopic/models/cluster.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
class ClusteringTopicModel(ContextualModel, ClusterMixin, DynamicTopicModel):
    """Topic models, which assume topics to be clusters of documents
    in semantic space.
    Models also include a dimensionality reduction step to aid clustering.

    ```python
    from turftopic import ClusteringTopicModel
    from sklearn.cluster import HDBSCAN
    import umap

    corpus: list[str] = ["some text", "more text", ...]

    # Construct a Top2Vec-like model
    model = ClusteringTopicModel(
        dimensionality_reduction=umap.UMAP(5),
        clustering=HDBSCAN(),
        feature_importance="centroid"
    ).fit(corpus)
    model.print_topics()
    ```

    Parameters
    ----------
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    dimensionality_reduction: TransformerMixin, default None
        Dimensionality reduction step to run before clustering.
        Defaults to TSNE with cosine distance.
        To imitate the behavior of BERTopic or Top2Vec you should use UMAP.
    clustering: ClusterMixin, default None
        Clustering method to use for finding topics.
        Defaults to OPTICS with 25 minimum cluster size.
        To imitate the behavior of BERTopic or Top2Vec you should use HDBSCAN.
    feature_importance: 'soft-c-tf-idf', 'c-tf-idf' or 'centroid', default 'soft-c-tf-idf'
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
    n_reduce_to: int, default None
        Number of topics to reduce topics to.
        The specified reduction method will be used to merge them.
        By default, topics are not merged.
    reduction_method: 'agglomerative', 'smallest'
        Method used to reduce the number of topics post-hoc.
        When 'agglomerative', BERTopic's topic reduction method is used,
        where topic vectors are hierarchically clustered.
        When 'smallest', the smallest topic gets merged into the closest
        non-outlier cluster until the desired number
        is achieved similarly to Top2Vec.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.
    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        dimensionality_reduction: Optional[TransformerMixin] = None,
        clustering: Optional[ClusterMixin] = None,
        feature_importance: Literal[
            "c-tf-idf", "soft-c-tf-idf", "centroid"
        ] = "soft-c-tf-idf",
        n_reduce_to: Optional[int] = None,
        reduction_method: Literal[
            "agglomerative", "smallest"
        ] = "agglomerative",
        random_state: Optional[int] = None,
    ):
        self.encoder = encoder
        self.random_state = random_state
        if feature_importance not in ["c-tf-idf", "soft-c-tf-idf", "centroid"]:
            raise ValueError(feature_message)
        if isinstance(encoder, int):
            raise TypeError(integer_message)
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        if vectorizer is None:
            self.vectorizer = default_vectorizer()
        else:
            self.vectorizer = vectorizer
        if clustering is None:
            self.clustering = OPTICS(min_samples=25)
        else:
            self.clustering = clustering
        if dimensionality_reduction is None:
            self.dimensionality_reduction = TSNE(
                n_components=2, metric="cosine", random_state=random_state
            )
        else:
            self.dimensionality_reduction = dimensionality_reduction
        self.feature_importance = feature_importance
        self.n_reduce_to = n_reduce_to
        self.reduction_method = reduction_method

    def _merge_agglomerative(self, n_reduce_to: int) -> np.ndarray:
        n_topics = self.components_.shape[0]
        res = {old_label: old_label for old_label in self.classes_}
        if n_topics <= n_reduce_to:
            return self.labels_
        interesting_topic_vectors = np.stack(
            [
                vec
                for label, vec in zip(self.classes_, self.topic_vectors_)
                if label != -1
            ]
        )
        old_labels = [label for label in self.classes_ if label != -1]
        new_labels = AgglomerativeClustering(
            n_clusters=n_reduce_to,
            metric="cosine",
            linkage="average",
        ).fit_predict(interesting_topic_vectors)
        res = {}
        if -1 in self.classes_:
            res[-1] = -1
        for i_old, i_new in zip(old_labels, new_labels):
            res[i_old] = i_new
        return np.array([res[label] for label in self.labels_])

    def _merge_smallest(self, n_reduce_to: int):
        merge_inst = smallest_hierarchical_join(
            self.topic_vectors_[self.classes_ != -1],
            self.topic_sizes_[self.classes_ != -1],
            self.classes_[self.classes_ != -1],
            n_reduce_to,
        )
        labels = np.copy(self.labels_)
        for from_topic, to_topic in merge_inst:
            labels[labels == from_topic] = to_topic
        return labels

    def _estimate_parameters(
        self,
        embeddings: np.ndarray,
        doc_term_matrix: np.ndarray,
    ):
        clusters = np.unique(self.labels_)
        self.classes_ = np.sort(clusters)
        self.topic_sizes_ = np.array(
            [np.sum(self.labels_ == label) for label in self.classes_]
        )
        self.topic_vectors_ = calculate_topic_vectors(self.labels_, embeddings)
        self.vocab_embeddings = self.encoder_.encode(
            self.vectorizer.get_feature_names_out()
        )  # type: ignore
        document_topic_matrix = label_binarize(
            self.labels_, classes=self.classes_
        )
        if self.feature_importance == "soft-c-tf-idf":
            self.components_ = soft_ctf_idf(
                document_topic_matrix, doc_term_matrix
            )  # type: ignore
        elif self.feature_importance == "centroid":
            self.components_ = cluster_centroid_distance(
                self.topic_vectors_,
                self.vocab_embeddings,
                metric="cosine",
            )
        else:
            self.components_ = ctf_idf(document_topic_matrix, doc_term_matrix)

    def fit_predict(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        """Fits model and predicts cluster labels for all given documents.

        Parameters
        ----------
        raw_documents: iterable of str
            Documents to fit the model on.
        y: None
            Ignored, exists for sklearn compatibility.
        embeddings: ndarray of shape (n_documents, n_dimensions), optional
            Precomputed document encodings.

        Returns
        -------
        ndarray of shape (n_documents)
            Cluster label for all documents (-1 for outliers)
        """
        console = Console()
        with console.status("Fitting model") as status:
            if embeddings is None:
                status.update("Encoding documents")
                embeddings = self.encoder_.encode(raw_documents)
                console.log("Encoding done.")
            status.update("Extracting terms")
            self.doc_term_matrix = self.vectorizer.fit_transform(raw_documents)
            console.log("Term extraction done.")
            status.update("Reducing Dimensionality")
            reduced_embeddings = self.dimensionality_reduction.fit_transform(
                embeddings
            )
            console.log("Dimensionality reduction done.")
            status.update("Clustering documents")
            self.labels_ = self.clustering.fit_predict(reduced_embeddings)
            console.log("Clustering done.")
            status.update("Estimating parameters.")
            self._estimate_parameters(
                embeddings,
                self.doc_term_matrix,
            )
            console.log("Parameter estimation done.")
            if self.n_reduce_to is not None:
                n_topics = self.classes_.shape[0]
                status.update(
                    f"Reducing topics from {n_topics} to {self.n_reduce_to}"
                )
                if self.reduction_method == "agglomerative":
                    self.labels_ = self._merge_agglomerative(self.n_reduce_to)
                else:
                    self.labels_ = self._merge_smallest(self.n_reduce_to)
                console.log(
                    f"Topic reduction done from {n_topics} to {self.n_reduce_to}."
                )
                status.update("Reestimating parameters.")
                self._estimate_parameters(
                    embeddings,
                    self.doc_term_matrix,
                )
                console.log("Reestimation done.")
        console.log("Model fitting done.")
        return self.labels_

    def fit_transform(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ):
        labels = self.fit_predict(raw_documents, y, embeddings)
        return label_binarize(labels, classes=self.classes_)

    def fit_transform_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ):
        time_labels, self.time_bin_edges = self.bin_timestamps(
            timestamps, bins
        )
        if hasattr(self, "components_"):
            doc_topic_matrix = label_binarize(
                self.labels_, classes=self.classes_
            )
        else:
            doc_topic_matrix = self.fit_transform(
                raw_documents, embeddings=embeddings
            )
        n_comp, n_vocab = self.components_.shape
        n_bins = len(self.time_bin_edges) - 1
        self.temporal_components_ = np.zeros(
            (n_bins, n_comp, n_vocab), dtype=doc_topic_matrix.dtype
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        if embeddings is None:
            embeddings = self.encoder_.encode(raw_documents)
        for i_timebin in np.unique(time_labels):
            topic_importances = doc_topic_matrix[time_labels == i_timebin].sum(
                axis=0
            )
            topic_importances = topic_importances / topic_importances.sum()
            t_doc_term_matrix = self.doc_term_matrix[time_labels == i_timebin]
            t_doc_topic_matrix = doc_topic_matrix[time_labels == i_timebin]
            if "c-tf-idf" in self.feature_importance:
                if self.feature_importance == "soft-c-tf-idf":
                    components = soft_ctf_idf(
                        t_doc_topic_matrix, t_doc_term_matrix
                    )
                elif self.feature_importance == "c-tf-idf":
                    components = ctf_idf(t_doc_topic_matrix, t_doc_term_matrix)
            elif self.feature_importance == "centroid":
                time_index = time_labels == i_timebin
                t_topic_vectors = calculate_topic_vectors(
                    self.labels_,
                    embeddings,
                    time_index,
                )
                topic_mask = np.isnan(t_topic_vectors).all(
                    axis=1, keepdims=True
                )
                t_topic_vectors[:] = 0
                components = cluster_centroid_distance(
                    t_topic_vectors,
                    self.vocab_embeddings,
                    metric="cosine",
                )
                components *= topic_mask
                mask_terms = t_doc_term_matrix.sum(axis=0).astype(np.float64)
                mask_terms[mask_terms == 0] = np.nan
                components *= mask_terms
            self.temporal_components_[i_timebin] = components
            self.temporal_importance_[i_timebin] = topic_importances
        return doc_topic_matrix

fit_predict(raw_documents, y=None, embeddings=None)

Fits model and predicts cluster labels for all given documents.

Parameters:

Name Type Description Default
raw_documents

Documents to fit the model on.

required
y

Ignored, exists for sklearn compatibility.

None
embeddings Optional[ndarray]

Precomputed document encodings.

None

Returns:

Type Description
ndarray of shape (n_documents)

Cluster label for all documents (-1 for outliers)

Source code in turftopic/models/cluster.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
def fit_predict(
    self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
) -> np.ndarray:
    """Fits model and predicts cluster labels for all given documents.

    Parameters
    ----------
    raw_documents: iterable of str
        Documents to fit the model on.
    y: None
        Ignored, exists for sklearn compatibility.
    embeddings: ndarray of shape (n_documents, n_dimensions), optional
        Precomputed document encodings.

    Returns
    -------
    ndarray of shape (n_documents)
        Cluster label for all documents (-1 for outliers)
    """
    console = Console()
    with console.status("Fitting model") as status:
        if embeddings is None:
            status.update("Encoding documents")
            embeddings = self.encoder_.encode(raw_documents)
            console.log("Encoding done.")
        status.update("Extracting terms")
        self.doc_term_matrix = self.vectorizer.fit_transform(raw_documents)
        console.log("Term extraction done.")
        status.update("Reducing Dimensionality")
        reduced_embeddings = self.dimensionality_reduction.fit_transform(
            embeddings
        )
        console.log("Dimensionality reduction done.")
        status.update("Clustering documents")
        self.labels_ = self.clustering.fit_predict(reduced_embeddings)
        console.log("Clustering done.")
        status.update("Estimating parameters.")
        self._estimate_parameters(
            embeddings,
            self.doc_term_matrix,
        )
        console.log("Parameter estimation done.")
        if self.n_reduce_to is not None:
            n_topics = self.classes_.shape[0]
            status.update(
                f"Reducing topics from {n_topics} to {self.n_reduce_to}"
            )
            if self.reduction_method == "agglomerative":
                self.labels_ = self._merge_agglomerative(self.n_reduce_to)
            else:
                self.labels_ = self._merge_smallest(self.n_reduce_to)
            console.log(
                f"Topic reduction done from {n_topics} to {self.n_reduce_to}."
            )
            status.update("Reestimating parameters.")
            self._estimate_parameters(
                embeddings,
                self.doc_term_matrix,
            )
            console.log("Reestimation done.")
    console.log("Model fitting done.")
    return self.labels_