Clustering Topic Models
Clustering topic models conceptualize topic modeling as a clustering task. Essentially a topic for these models is a tightly packed group of documents in semantic space.
The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.
Turftopic contains flexible implementations of these models where you have control over each of the steps in the process, while sticking to a minimal amount of extra dependencies. While the models themselves can be equivalent to BERTopic and Top2Vec implementations, Turftopic might not offer some of the implementation-specific features, that the other libraries boast.
The Model
1. Dimensionality Reduction
It is common practice in clustering topic modeling literature to reduce the dimensionality of the embeddings before clustering them. This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by.
Dimensionality reduction by default is done with scikit-learn's TSNE implementation in Turftopic, but users are free to specify the model that will be used for dimensionality reduction.
Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature. Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).
2. Clustering
After reducing the dimensionality of the embeddings, they are clustered with a clustering model. As HDBSCAN has only been part of scikit-learn since version 1.3.0, Turftopic uses OPTICS as its default.
Some clustering models are capable of discovering the number of clusters in the data. This is a useful and yet-to-be challenged property of clustering topic models.
Practice suggests, however, that in large corpora, this frequently results in a very large number of topics, which is impractical for interpretation. Models' hyperparameters can be adjusted to account for this behaviour, but the impact of choice of hyperparameters on topic quality is more or less unknown.
3a. Term Importance: Proximity to Cluster Centroids
Clustering topic models rely on post-hoc term importance estimation. Currently there are two methods used for this.
The solution introduced in Top2Vec (Angelov, 2020) is that of estimating terms' importances for a given topic from their embeddings' cosine similarity to the centroid of the embeddings in a cluster.
This has three implications:
- Topic descriptions are very specific. As the closest terms to the topic vector are selected, they tend to also be very close to each other. The issue with this is that many of the documents in a topic might not get proper coverage.
- It is assumed that the clusters are convex and spherical. This might not at all be the case, and especially when clusters are concave, the closest terms to the centroid might end up describing a different, or nonexistent topic. In other words: The mean might not be a representative datapoint of the population.
- Noise rarely gets into topic descriptions. Since functions words or contaminating terms are not very likely to be closest to the topic vector, decriptions are typically clean.
3b. Term Importance: c-TF-IDF
The solution to this issue, suggested by Grootendorst (2022) to this issue was c-TF-IDF.
c-TF-IDF is a weighting scheme based on the number of occurrences of terms in each cluster. Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance.
Let \(X\) be the document term matrix where each element (\(X_{ij}\)) corresponds with the number of times word \(j\) occurs in a document \(i\).
By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF, which is calculated in the following manner:
- Estimate weight of term \(j\) for topic \(z\):
\(tf_{zj} = \frac{t_{zj}}{w_z}\), where \(t_{zj} = \sum_{i \in z} X_{ij}\) is the number of occurrences of a word in a topic and \(w_{z}= \sum_{j} t_{zj}\) is all words in the topic - Estimate inverse document/topic frequency for term \(j\):
\(idf_j = log(\frac{N}{\sum_z |t_{zj}|})\), where \(N\) is the total number of documents. - Calculate importance of term \(j\) for topic \(z\):
\(Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j\)
You can also use the original c-TF-IDF formula, if you intend to replicate the behaviour of BERTopic exactly. The two formulas tend to give similar results, though the implications of choosing one over the other has not been thoroughly evaluated.
\(tf_{zj} = \frac{t_{zj}}{w_z}\), where
\(t_{zj} = \sum_{i \in z} X_{ij}\) is the number of occurrences of a word in a topic and
\(w_{z}= \sum_{j} t_{zj}\) is all words in the topic
- Estimate inverse document/topic frequency for term \(j\):
\(idf_j = log(1 + \frac{A}{\sum_z |t_{zj}|})\), where
\(A = \frac{\sum_z \sum_j t_{zj}}{Z}\) is the average number of words per topic, and \(Z\) is the number of topics.
- Calculate importance of term \(j\) for topic \(z\):
\(c-TF-IDF{zj} = tf_{zj} \cdot idf_j\)
This solution is generally to be preferred to centroid-based term importance (and the default in Turftopic), as it is more likely to give correct results. On the other hand, c-TF-IDF can be sensitive to words with atypical statistical properties (stop words), and can result in low diversity between topics, when clusters are joined post-hoc.
4. Hierarchical Topic Merging
A weakness of clustering approaches based on density-based clustering methods, is that all too frequently they find a very large number of topics. To limit the number of topics in a topic model you can use hierarchical topic merging.
Merge Smallest
The approach used in the Top2Vec package is to always merge the smallest topic into the one closest to it (except the outlier-cluster) until the number of topics is down to the desired amount.
You can achieve this behaviour like so:
from turftopic import ClusteringTopicModel
model = ClusteringTopicModel(n_reduce_to=10, reduction_method="smallest")
Agglomerative Clustering
In BERTopic topics are merged based on agglomerative clustering using average linkage, and then term importances are reestimated. You can do this in Turftopic as well:
model = ClusteringTopicModel(n_reduce_to=10, reduction_method="agglomerative")
BERTopic and Top2Vec
Turftopic's implementation differs in multiple places to BERTopic and Top2Vec. You can, however, construct models in Turftopic that imitate the behaviour of these other packages.
The main differences to these packages are: - Dimensionality reduction in BERTopic and Top2Vec is done with UMAP. - Clustering is in BERTopic and Top2Vec is done with HDBSCAN. - Turftopic does not include many of the visualization and model-specific utilities that BERTopic does.
To get closest to the functionality of the two other packages you can manually set the clustering and dimensionality reduction model when creating the models:
You will need UMAP and scikit-learn>=1.3.0:
pip install umap-learn scikit-learn>=1.3.0
This is how you build a BERTopic-like model in Turftopic:
from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap
# I also included the default parameters of BERTopic so that the behaviour is as
# close as possible
berttopic = ClusteringTopicModel(
dimensionality_reduction=umap.UMAP(
n_neighbors=10,
n_components=5,
min_dist=0.0,
metric="cosine",
),
clustering=HDBSCAN(
min_cluster_size=15,
metric="euclidean",
cluster_selection_method="eom",
),
feature_importance="c-tf-idf",
reduction_method="agglomerative"
)
This is how you build a Top2Vec model in Turftopic:
top2vec = ClusteringTopicModel(
dimensionality_reduction=umap.UMAP(
n_neighbors=15,
n_components=5,
metric="cosine"
),
clustering=HDBSCAN(
min_cluster_size=15,
metric="euclidean",
cluster_selection_method="eom",
),
feature_importance="centroid",
reduction_method="smallest"
)
Theoretically the model descriptions above should result in the same behaviour as the other two packages, but there might be minor changes in implementation. We do not intend to keep up with changes in Top2Vec's and BERTopic's internal implementation details indefinitely.
(Optional) 5. Dynamic Modeling
Clustering models are also capable of dynamic topic modeling. This happens by fitting a clustering model over the entire corpus, as we expect that there is only one semantic model generating the documents. To gain temporal representations for topics, the corpus is divided into equal, or arbitrarily chosen time slices, and then term importances are estimated using Soft-c-TF-IDF, c-TF-IDF, or distances from cluster centroid for each of the time slices separately. When distance from cluster centroids is used to estimate topic importances in dynamic modeling, cluster centroids are computed based on documents and terms present within a given time slice.
Considerations
Strengths
- Automatic Discovery of Number of Topics: Clustering models can find the number of topics by themselves. This is a useful quality of these models as practicioners can rarely make an informed decision about the number of topics a-priori.
- No Assumptions of Normality: With clustering models you can avoid making assumptions about cluster shapes. This is in contrast with GMMs, which assume topics to be Gaussian components.
- Outlier Detection: OPTICS, HDBSCAN or DBSCAN contain outlier detection. This way, outliers do not influence topic representations.
- Not Affected by Embedding Size: Since the models include dimensionality reduction, they are not as influenced by the curse of dimensionality as other methods.
Weaknesses
- Scalability: Clustering models typically cannot be fitted in an online fashion, and manifold learning is usually inefficient in large corpora. When the number of texts is huge, the number of topics also gets inflated, which is impractical for interpretation.
- Lack of Nuance: The models are unable to capture multiple topics in a document or capture uncertainty around topic labels. This makes the models impractical for longer texts as well.
- Sensitivity to Hyperparameters: While do not have to set the number of topics directly, the hyperparameters you choose has a huge impact on the number of topics you will end up getting. You can counteract this to a certain extent with hierarchical merging. (see figure)
- Transductivity: Some clustering methods are transductive, meaning you can't predict topical content for new documents, as they would change the cluster structure.
API Reference
turftopic.models.cluster.ClusteringTopicModel
Bases: ContextualModel
, ClusterMixin
, DynamicTopicModel
Topic models, which assume topics to be clusters of documents in semantic space. Models also include a dimensionality reduction step to aid clustering.
from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap
corpus: list[str] = ["some text", "more text", ...]
# Construct a Top2Vec-like model
model = ClusteringTopicModel(
dimensionality_reduction=umap.UMAP(5),
clustering=HDBSCAN(),
feature_importance="centroid"
).fit(corpus)
model.print_topics()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
encoder |
Union[Encoder, str]
|
Model to encode documents/terms, all-MiniLM-L6-v2 is the default. |
'sentence-transformers/all-MiniLM-L6-v2'
|
vectorizer |
Optional[CountVectorizer]
|
Vectorizer used for term extraction. Can be used to prune or filter the vocabulary. |
None
|
dimensionality_reduction |
Optional[TransformerMixin]
|
Dimensionality reduction step to run before clustering. Defaults to TSNE with cosine distance. To imitate the behavior of BERTopic or Top2Vec you should use UMAP. |
None
|
clustering |
Optional[ClusterMixin]
|
Clustering method to use for finding topics. Defaults to OPTICS with 25 minimum cluster size. To imitate the behavior of BERTopic or Top2Vec you should use HDBSCAN. |
None
|
feature_importance |
Literal['c-tf-idf', 'soft-c-tf-idf', 'centroid']
|
Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. |
'soft-c-tf-idf'
|
n_reduce_to |
Optional[int]
|
Number of topics to reduce topics to. The specified reduction method will be used to merge them. By default, topics are not merged. |
None
|
reduction_method |
Literal['agglomerative', 'smallest']
|
Method used to reduce the number of topics post-hoc. When 'agglomerative', BERTopic's topic reduction method is used, where topic vectors are hierarchically clustered. When 'smallest', the smallest topic gets merged into the closest non-outlier cluster until the desired number is achieved similarly to Top2Vec. |
'agglomerative'
|
random_state |
Optional[int]
|
Random state to use so that results are exactly reproducible. |
None
|
Source code in turftopic/models/cluster.py
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 |
|
fit_predict(raw_documents, y=None, embeddings=None)
Fits model and predicts cluster labels for all given documents.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_documents |
Documents to fit the model on. |
required | |
y |
Ignored, exists for sklearn compatibility. |
None
|
|
embeddings |
Optional[ndarray]
|
Precomputed document encodings. |
None
|
Returns:
Type | Description |
---|---|
ndarray of shape (n_documents)
|
Cluster label for all documents (-1 for outliers) |
Source code in turftopic/models/cluster.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 |
|