Clustering Topic Models

Clustering topic models conceptualize topic modeling as a clustering task. Essentially a topic for these models is a tightly packed group of documents in semantic space. The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.

If you are looking for a probabilistic/soft-clustering model you should also check out GMM.

Figure 1: Interactive figure to explore cluster structure in a clustering topic model.

How do clustering models work?

Step 1: Dimensionality Reduction

It is common practice to reduce the dimensionality of the embeddings before clustering them. This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by. Dimensionality reduction by default is done with TSNE in Turftopic, but users are free to specify the model that will be used for dimensionality reduction.

Choose a dimensionality reduction method

TSNE (default)UMAP (Top2Vec; BERTopic)PCA (fast)

from sklearn.manifold import TSNE
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(dimensionality_reduction=TSNE(n_components=2, metric="cosine"))

TSNE is a classic method for producing non-linear lower-dimensional representations of high-simensional embeddings. TSNE has an inherent clustering property, which helps clustering models find groups of data. While it is widely used, it has many well-known issues, such as poor representation of global relations, and artificial clusters.

Use openTSNE for better performance!

By default, a scikit-learn implementation is used, but if you have the openTSNE package installed on your system, Turftopic will automatically use it. You can potentially speed up your clustering topic models by multiple orders of magnitude.

pip install turftopic[opentsne]

pip install umap-learn

from umap import UMAP
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(dimensionality_reduction=UMAP(n_components=2, metric="cosine"))

UMAP is universally usable non-linear dimensionality reduction method and is typically the default choice for topic discovery in clustering topic models. UMAP is faster than TSNE and is also substantially better at representing global structures in your dataset.

from sklearn.decomposition import PCA
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(dimensionality_reduction=PCA(n_components=2))

Principal Component Analysis is one of the most widely used dimensionality reduction techniques in machine learning. It is a linear method, that projects embeddings onto the first N principal components by the amount of variance they capture in the data. PCA is substantially faster than manifold methods, but is not as good at aiding clustering models as TSNE and UMAP.

Step 2: Document Clustering

After the dimensionality of document embeddings is reduced, topics are discovered by clustering document-embeddings in this lower dimensional space. Turftopic is entirely clustering-model agnostic, and as such, any type of model may be used.

Choose a clustering method

HDBSCAN (default)KMeans (fast)

from sklearn.cluster import HDBSCAN
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(clustering=HDBSCAN())

HDBSCAN is a density-based clustering method, that can find clusters with varying densities. It can find the number of clusters in the data, and can also find outliers. While HDBSCAN has many advantageous properties, it can be hard to make an informed choice about its hyperparameters.

from sklearn.cluster import KMeans
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(clustering=KMeans(n_clusters=10))

The KMeans algorithm finds clusters by locating a prespecified number of mean vectors that minimize square distance of embeddings in a cluster to their mean. KMeans is a very fast algorithm, but makes very strong assumptions about cluster shapes, can't detect outliers and you have to specify the number of clusters prior to model fitting.

Step 3: Calculate term importance scores

Clustering topic models rely on post-hoc term importance estimation, meaning that topic descriptions are calculated based on already discovered clusters. Multiple methods are available in Turftopic for estimating words'/phrases' importance scores for topics. You can manipulate how these scores are calculated by changing the feature_importance parameter of your topic models. By and large there are two types of methods that can be used for importance estimation:

Lexical methods, which estimate term importance solely based on word counts in each cluster:
- Generally faster, since the vocabulary does not need to be encoded.
- Can capture more particular word use.
- Usually cover the topics' content better.
Semantic methods, which estimate term importance using the semantic space of the model:
- They typically produce cleaner and more specific topics.
- Can be used in a multilingual context.
- Generally less sensitive to stop- and junk words.

Importance method	Type	Description	Advantages
`soft-c-tf-idf` (default)	Lexical	A c-tf-idf mehod that can interpret soft cluster assignments.	Can interpret soft cluster assignment in models like Gaussian Mixtures, less sensitive to stop words than vanilla c-tf-idf.
`fighting-words` (NEW)	Lexical	Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al.	A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. See Fightin' Words paper.
`c-tf-idf`	Lexical	Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic.	Very fast, easy to understand and is not affected by cluster shape.
`centroid`	Semantic	Word importance based on words' proximity to cluster centroid vectors. This is the default in Top2Vec.	Produces clean topics, easily interpretable.
`linear` (NEW, EXPERIMENTAL)	Semantic	Project words onto the parameter vectors of a linear classifier (LDA).	Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean.

Choose a term importance estimation method

c-TF-IDF (BERTopic)Centroid Proximity (Top2Vec)Fighting' WordsLinear Probing

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(feature_importance="soft-c-tf-idf")
# or
model = ClusteringTopicModel(feature_importance="c-tf-idf")

??? info "Click to see formulas" #### Soft-c-TF-IDF - Let \(X\) be the document term matrix where each element (\(X_{ij}\)) corresponds with the number of times word \(j\) occurs in a document \(i\). - Estimate weight of term \(j\) for topic \(z\):
\(tf_{zj} = \frac{t_{zj}}{w_z}\), where \(t_{zj} = \sum_{i \in z} X_{ij}\) is the number of occurrences of a word in a topic and \(w_{z}= \sum_{j} t_{zj}\) is all words in the topic
- Estimate inverse document/topic frequency for term \(j\):
\(idf_j = log(\frac{N}{\sum_z |t_{zj}|})\), where \(N\) is the total number of documents. - Calculate importance of term \(j\) for topic \(z\):
\(Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j\)

#### c-TF-IDF
- Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
- $tf_{zj} = \frac{t_{zj}}{w_z}$, where 
$t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and 
$w_{z}= \sum_{j} t_{zj}$ is all words in the topic <br>
- Estimate inverse document/topic frequency for term $j$:  
$idf_j = log(1 + \frac{A}{\sum_z |t_{zj}|})$, where
$A = \frac{\sum_z \sum_j t_{zj}}{Z}$ is the average number of words per topic, and $Z$ is the number of topics.
- Calculate importance of term $j$ for topic $z$:   
$c-TF-IDF{zj} = tf_{zj} \cdot idf_j$

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(feature_importance="centroid")

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(feature_importance="fighting-words")

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(feature_importance="linear")

You can also choose to recalculate term importances with a different method after fitting the model:

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit(corpus)
model.estimate_components(feature_importance="centroid")
model.estimate_components(feature_importance="soft-c-tf-idf")

BERTopic and Top2Vec in Turftopic

Since BERTopic and Top2Vec are also just clustering topic models with specific characteristics, you can easily use the same models in Turftopic. We have added convenience classes, that inherit from ClusteringTopicModel that make it very easy to create a BERTopic or Top2Vec model in the library.

pip install turftopic[umap-learn]

Create BERTopic and Top2Vec models

BERTopicTop2Vec

from turftopic import BERTopic

berttopic = BERTopic()
berttopic.fit(corpus)

from turftopic import Top2Vec

top2vec = Top2Vec()
top2vec.fit(corpus)

Are these different from the original?

Theoretically the model descriptions above should result in the same behaviour as the other two packages, but there might be minor changes in implementation.

Hierarchical Topic Merging

A weakness of clustering approaches based on density-based clustering methods, is that all too frequently they find a very large number of topics. To limit the number of topics in a topic model you can hierarchically merge topics, until you get the desired number. Turftopic allows you to use a number of popular methods for merging topics in clustering models.

Choose a topic reduction method

Agglomerative Clustering (BERTopic)Smallest -> Closest (Top2Vec)

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10, reduction_method="average")
# or 
model.reduce_topics(10, reduction_method="single", metric="cosine")

Topics discovered by a clustering model can be merged using agglomerative clustering. For a detailed discussion of linkage methods and hierarchical clustering, consult SciPy's documentation. All linkage methods compatible with SciPy can be used as topic reduction methods in Turftopic.

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10, reduction_method="smallest")
# or 
model.reduce_topics(10, reduction_method="smallest", metric="cosine")

The approach used in the Top2Vec package is to always merge the smallest topic into the one closest to it (except the outlier-cluster) until the number of topics is down to the desired amount. This approach is remarkably fast, and usually quite effective, since it doesn't require computing full linkages.

As such, all clustering models have a hierarchy property, with which you can explore the topic hierarchy discovered by your models. For a detailed discussion of hierarchical modeling, check out the Hierarchical modeling page.

print(model.hierarchy)

Root: ├── -1: documented, obsolete, et4000, concerns, dubious, embedded, hardware, xfree86, alternative, seeking ├── 20: hitter, pitching, batting, hitters, pitchers, fielder, shortstop, inning, baseman, pitcher ├── 284: nhl, goaltenders, canucks, sabres, hockey, bruins, puck, oilers, canadiens, flyers │ ├── 242: sportschannel, espn, nbc, nhl, broadcasts, broadcasting, broadcast, mlb, cbs, cbc │ │ ├── 171: stadium, tickets, mlb, ticket, sportschannel, mets, inning, nationals, schedule, cubs │ │ │ └── ... │ │ └── 21: sportschannel, nbc, espn, nhl, broadcasting, broadcasts, broadcast, hockey, cbc, cbs │ └── 236: nhl, goaltenders, canucks, sabres, puck, oilers, andreychuk, bruins, goaltender, leafs ...

You can also manually merge topics by using the join_topics() method of cluster hierarchies.

# Joins topics 0,1 and 2 and creates a merged topics with ID 4
model.hierarchy.join_topics([0, 1, 2], joint_id=4)

If you want to reset topics to their original state, you can call reset_topics()

model.reset_topics()

Dynamic Topic Modeling

Clustering models are also capable of dynamic topic modeling. This happens by fitting a clustering model over the entire corpus, as we expect that there is only one semantic model generating the documents.

For a detailed discussion, see Dynamic Models.

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit_dynamic(corpus, timestamps=ts, bins=10)
model.print_topics_over_time()

Semi-supervised Topic Modeling

Some dimensionality reduction methods are capable of designing features that are effective at predicting class labels. This way, you can provide a supervisory signal, but also let the model discover new topics that you have not specified.

Warning

TSNE, the default dimensionality reduction method in Turftopic is not capable of semi-supervised modelling. You will have to use a different algorithm.

Use a dimensionality reduction method for semi-supervised modeling.

with UMAPwith Linear Discriminant Analysis

pip install turftopic[umap-learn]

from umap import UMAP
from turftopic import ClusteringTopicModel

corpus: list[str] = [...]

# UMAP can also understand missing class labels if you only have them on some examples
# Specify these with -1 or NaN labels
labels: list[int] = [0, 2, -1, -1, 0, 0...]

model = ClusteringTopicModel(dimensionality_reduction=UMAP())
model.fit(corpus, y=labels)

from sklearn.discriminant_analysis import LinearDisciminantAnalysis
from turftopic import ClusteringTopicModel

corpus: list[str] = [...]
labels: list[int] = [...]

model = ClusteringTopicModel(dimensionality_reduction=LinearDisciminantAnalysis(n_components=5))
model.fit(corpus, y=labels)

Visualization

You can interactively explore clusters using datamapplot directly in Turftopic! You will first have to install datamapplot for this to work:

pip install turftopic[datamapplot]

from turftopic import ClusteringTopicModel
from turftopic.namers import OpenAITopicNamer

model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)

fig = model.plot_clusters_datamapplot()
fig.save("clusters_visualization.html")
fig

See Figure 1

Info

If you are not running Turftopic from a Jupyter notebook, make sure to call fig.show(). This will open up a new browser tab with the interactive figure.

API Reference

`turftopic.models.cluster.ClusteringTopicModel`

Bases: ContextualModel, ClusterMixin, DynamicTopicModel, MultimodalModel

Topic models, which assume topics to be clusters of documents in semantic space. Models also include a dimensionality reduction step to aid clustering.

from turftopic import ClusteringTopicModel
from sklearn.cluster import HDBSCAN
import umap

corpus: list[str] = ["some text", "more text", ...]

# Construct a Top2Vec-like model
model = ClusteringTopicModel(
    dimensionality_reduction=umap.UMAP(5),
    clustering=HDBSCAN(),
    feature_importance="centroid"
).fit(corpus)
model.print_topics()

Parameters:

Name	Type	Description	Default
`encoder`	`Union[Encoder, str, MultimodalEncoder]`	Model to encode documents/terms, all-MiniLM-L6-v2 is the default.	`'sentence-transformers/all-MiniLM-L6-v2'`
`vectorizer`	`Optional[CountVectorizer]`	Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.	`None`
`dimensionality_reduction`	`Optional[TransformerMixin]`	Dimensionality reduction step to run before clustering. Defaults to TSNE with cosine distance. To imitate the behavior of BERTopic or Top2Vec you should use UMAP.	`None`
`clustering`	`Optional[ClusterMixin]`	Clustering method to use for finding topics. Defaults to OPTICS with 25 minimum cluster size. To imitate the behavior of BERTopic or Top2Vec you should use HDBSCAN.	`None`
`feature_importance`	`WordImportance`	Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. 'bayes' uses Bayes' rule. 'linear' calculates most predictive directions in embedding space and projects words onto them. 'fighting-words' calculates word importances based on the Fighting Words algorithm from Monroe et al.	`'soft-c-tf-idf'`
`n_reduce_to`	`Optional[int]`	Number of topics to reduce topics to. The specified reduction method will be used to merge them. By default, topics are not merged.	`None`
`reduction_method`	`LinkageMethod`	Method used for hierarchically merging topics. Could be "smallest", which is Top2Vec's default merging strategy, or any of the linkage methods listed in SciPy's documentation	`'average'`
`reduction_distance_metric`	`DistanceMetric`	Distance metric to use for hierarchical topic reduction.	`'cosine'`
`reduction_topic_representation`	`TopicRepresentation`	Topic representation used for hierarchical clustering. If 'component' the topic-word importance scores will be used as topic vectors, (this is how it's done in BERTopic) if 'centroid' the centroid vectors of clusters will be used as topic vectors (Top2Vec).	`'component'`
`random_state`	`Optional[int]`	Random state to use so that results are exactly reproducible.	`None`

Source code in turftopic/models/cluster.py

class ClusteringTopicModel(
    ContextualModel, ClusterMixin, DynamicTopicModel, MultimodalModel
):
    """Topic models, which assume topics to be clusters of documents
    in semantic space.
    Models also include a dimensionality reduction step to aid clustering.

    ```python
    from turftopic import ClusteringTopicModel
    from sklearn.cluster import HDBSCAN
    import umap

    corpus: list[str] = ["some text", "more text", ...]

    # Construct a Top2Vec-like model
    model = ClusteringTopicModel(
        dimensionality_reduction=umap.UMAP(5),
        clustering=HDBSCAN(),
        feature_importance="centroid"
    ).fit(corpus)
    model.print_topics()
    ```

    Parameters
    ----------
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    dimensionality_reduction: TransformerMixin, default None
        Dimensionality reduction step to run before clustering.
        Defaults to TSNE with cosine distance.
        To imitate the behavior of BERTopic or Top2Vec you should use UMAP.
    clustering: ClusterMixin, default None
        Clustering method to use for finding topics.
        Defaults to OPTICS with 25 minimum cluster size.
        To imitate the behavior of BERTopic or Top2Vec you should use HDBSCAN.
    feature_importance: WordImportance, default 'soft-c-tf-idf'
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
        'bayes' uses Bayes' rule.
        'linear' calculates most predictive directions in embedding space and projects
        words onto them.
        'fighting-words' calculates word importances based on the Fighting Words
        algorithm from Monroe et al.
    n_reduce_to: int, default None
        Number of topics to reduce topics to.
        The specified reduction method will be used to merge them.
        By default, topics are not merged.
    reduction_method: LinkageMethod, default 'average'
        Method used for hierarchically merging topics.
        Could be "smallest", which is Top2Vec's default merging strategy, or
        any of the linkage methods listed in [SciPy's documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)
    reduction_distance_metric: DistanceMetric, default 'cosine'
        Distance metric to use for hierarchical topic reduction.
    reduction_topic_representation: {'component', 'centroid'}, default 'component'
        Topic representation used for hierarchical clustering.
        If 'component' the topic-word importance scores will be used as topic vectors, (this is how it's done in BERTopic)
        if 'centroid' the centroid vectors of clusters will be used as topic vectors (Top2Vec).
    random_state: int, default None
        Random state to use so that results are exactly reproducible.
    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        dimensionality_reduction: Optional[TransformerMixin] = None,
        clustering: Optional[ClusterMixin] = None,
        feature_importance: WordImportance = "soft-c-tf-idf",
        n_reduce_to: Optional[int] = None,
        reduction_method: LinkageMethod = "average",
        reduction_distance_metric: DistanceMetric = "cosine",
        reduction_topic_representation: TopicRepresentation = "component",
        random_state: Optional[int] = None,
    ):
        self.encoder = encoder
        self.random_state = random_state
        if feature_importance not in VALID_WORD_IMPORTANCE:
            raise ValueError(
                f"feature_importance must be one of {VALID_WORD_IMPORTANCE} got {feature_importance} instead."
            )
        if reduction_method not in VALID_LINKAGE_METHODS:
            raise ValueError(
                f"Topic reduction method has to be one of: {VALID_LINKAGE_METHODS}, but got {reduction_method} instead."
            )
        if reduction_distance_metric not in VALID_DISTANCE_METRICS:
            raise ValueError(
                f"Distance metric should be one of: {VALID_DISTANCE_METRICS}, but got {reduction_distance_metric} instead."
            )
        if reduction_topic_representation not in VALID_TOPIC_REPRESENTATIONS:
            raise ValueError(
                f"Topic representation should be one of: {VALID_TOPIC_REPRESENTATIONS}, but got {reduction_topic_representation} instead."
            )
        if isinstance(encoder, int):
            raise TypeError(integer_message)
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        self.validate_encoder()
        if vectorizer is None:
            self.vectorizer = default_vectorizer()
        else:
            self.vectorizer = vectorizer
        if clustering is None:
            self.clustering = HDBSCAN(
                min_samples=10,
                min_cluster_size=25,
            )
        else:
            self.clustering = clustering
        if dimensionality_reduction is None:
            self.dimensionality_reduction = build_tsne(
                n_components=2,
                metric="cosine",
                perplexity=15,
                random_state=random_state,
            )
        else:
            self.dimensionality_reduction = dimensionality_reduction
        self.feature_importance = feature_importance
        self.reduction_distance_metric = reduction_distance_metric
        self.reduction_topic_representation = reduction_topic_representation
        self.n_reduce_to = n_reduce_to
        self.reduction_method = reduction_method

    @property
    def topic_representations(self) -> np.ndarray:
        if self.reduction_topic_representation == "component":
            return self.components_
        else:
            return self._calculate_topic_vectors()

    def _calculate_topic_vectors(
        self,
        is_in_slice: Optional[np.ndarray] = None,
        classes: Optional[np.ndarray] = None,
        embeddings: Optional[np.ndarray] = None,
        labels: Optional[np.ndarray] = None,
    ) -> np.ndarray:
        if classes is None:
            classes = self.classes_
        if embeddings is None:
            embeddings = self.embeddings
        if labels is None:
            labels = self.labels_
        label_to_idx = {label: idx for idx, label in enumerate(classes)}
        n_topics = len(classes)
        n_dims = embeddings.shape[1]
        topic_vectors = np.full((n_topics, n_dims), np.nan)
        for label in np.unique(labels):
            doc_idx = labels == label
            if is_in_slice is not None:
                doc_idx = doc_idx & is_in_slice
            topic_vectors[label_to_idx[label], :] = np.mean(
                embeddings[doc_idx], axis=0
            )
        return topic_vectors

    def estimate_components(
        self, feature_importance: Optional[WordImportance] = None
    ) -> np.ndarray:
        """Estimates feature importances based on a fitted clustering.

        Parameters
        ----------
        feature_importance: WordImportance, default None
            Method for estimating term importances.
            'centroid' uses distances from cluster centroid similarly
            to Top2Vec.
            'c-tf-idf' uses BERTopic's c-tf-idf.
            'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
            be very similar to 'c-tf-idf'.
            'bayes' uses Bayes' rule.
            'linear' calculates most predictive directions in embedding space and projects
            words onto them.
            'fighting-words' calculates word importances based on the Fighting Words
            algorithm from Monroe et al.

        Returns
        -------
        ndarray of shape (n_components, n_vocab)
            Topic-term matrix.
        """
        if feature_importance is not None:
            if feature_importance not in VALID_WORD_IMPORTANCE:
                raise ValueError(
                    f"feature_importance must be one of {VALID_WORD_IMPORTANCE} got {feature_importance} instead."
                )
            self.feature_importance = feature_importance
        self.hierarchy.estimate_components()
        doc_topic_matrix = safe_binarize(self.labels_, classes=self.classes_)
        if feature_importance == "c-tf-idf":
            _, self._idf_diag = ctf_idf(
                doc_topic_matrix,
                self.doc_term_matrix,
                return_idf=True,
            )
        if feature_importance == "soft-c-tf-idf":
            _, self._idf_diag = soft_ctf_idf(
                doc_topic_matrix,
                self.doc_term_matrix,
                return_idf=True,
            )
        return self.components_

    def reduce_topics(
        self,
        n_reduce_to: int,
        reduction_method: Optional[LinkageMethod] = None,
        metric: Optional[DistanceMetric] = None,
    ) -> np.ndarray:
        """Reduces the clustering to the desired amount with the given method.

        Parameters
        ----------
        n_reduce_to: int, default None
            Number of topics to reduce topics to.
            The specified reduction method will be used to merge them.
            By default, topics are not merged.
        reduction_method: LinkageMethod, default None
            Method used for hierarchically merging topics.
            Could be "smallest", which is Top2Vec's default merging strategy, or
            any of the linkage methods listed in [SciPy's documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)
        reduction_distance_metric: DistanceMetric, default None
            Distance metric to use for hierarchical topic reduction.

        Returns
        -------
        ndarray of shape (n_documents)
            New cluster labels for documents.
        """
        if not hasattr(self, "original_labels_"):
            self.original_labels_ = self.labels_
            self.original_names_ = self.topic_names
            self.original_classes_ = self.classes_
        if reduction_method is None:
            reduction_method = self.reduction_method
        if metric is None:
            metric = self.reduction_distance_metric
        self.hierarchy.reduce_topics(
            n_reduce_to, method=reduction_method, metric=metric
        )
        return self.labels_

    def reset_topics(self):
        """Resets topics to the original cllustering."""
        original_labels = getattr(self, "original_labels_", None)
        if original_labels is None:
            warnings.warn("Topics have never been reduced, nothing to reset.")
        else:
            self.hierarchy = ClusterNode.create_root(
                self, labels=self.original_labels_
            )
            self.topic_names_ = self.original_names_

    @property
    def classes_(self):
        try:
            return self.hierarchy.classes_
        except AttributeError as e:
            raise AttributeError(
                "Model has not been fitted yet, and doesn't have classes_"
            ) from e

    @property
    def components_(self):
        try:
            return self.hierarchy.components_
        except AttributeError as e:
            raise AttributeError(
                "Model has not been fitted yet, and doesn't have components_"
            ) from e

    @property
    def labels_(self):
        try:
            return self.hierarchy.labels_
        except AttributeError as e:
            raise AttributeError(
                "Model has not been fitted yet, and doesn't have labels_"
            ) from e

    @property
    def document_topic_matrix(self):
        return safe_binarize(self.labels_, classes=self.classes_)

    def join_topics(
        self, to_join: Sequence[int], joint_id: Optional[int] = None
    ):
        """Joins the given topics in the cluster hierarchy to a single topic.

        Parameters
        ----------
        to_join: Sequence of int
            Topics to join together by ID.
        joint_id: int, default None
            New ID for the joint cluster.
            Default is the smallest ID of the topics to join.
        """
        self.hierarchy.join_topics(to_join, joint_id=joint_id)

    def fit_predict(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        """Fits model and predicts cluster labels for all given documents.

        Parameters
        ----------
        raw_documents: iterable of str
            Documents to fit the model on.
        y: None
            Ignored, when the dimensionality reduction is TSNE (the default),
            in case of a dimensionality reduction that can utilize labels,
            you can pass labels to the model to inform the clustering process.
        embeddings: ndarray of shape (n_documents, n_dimensions), optional
            Precomputed document encodings.

        Returns
        -------
        ndarray of shape (n_documents)
            Cluster label for all documents (-1 for outliers)
        """
        console = Console()
        with console.status("Fitting model") as status:
            if embeddings is None:
                status.update("Encoding documents")
                embeddings = self.encode_documents(raw_documents)
                console.log("Encoding done.")
            self.embeddings = embeddings
            status.update("Extracting terms")
            self.doc_term_matrix = self.vectorizer.fit_transform(raw_documents)
            console.log("Term extraction done.")
            status.update("Reducing Dimensionality")
            # If y is specified, we pass it to the dimensionality
            # reduction method as supervisory signal
            if y is not None:
                y = factorize_labels(y)
            self.reduced_embeddings = (
                self.dimensionality_reduction.fit_transform(embeddings, y=y)
            )
            console.log("Dimensionality reduction done.")
            status.update("Clustering documents")
            labels = self.clustering.fit_predict(self.reduced_embeddings)
            console.log("Clustering done.")
            status.update("Estimating parameters.")
            # Initializing hierarchy
            self.hierarchy = ClusterNode.create_root(self, labels=labels)
            doc_topic_matrix = safe_binarize(
                self.labels_, classes=self.classes_
            )
            if self.feature_importance == "c-tf-idf":
                _, self._idf_diag = ctf_idf(
                    doc_topic_matrix,
                    self.doc_term_matrix,
                    return_idf=True,
                )
            if self.feature_importance == "soft-c-tf-idf":
                _, self._idf_diag = soft_ctf_idf(
                    doc_topic_matrix,
                    self.doc_term_matrix,
                    return_idf=True,
                )
            console.log("Parameter estimation done.")
            if self.n_reduce_to is not None:
                n_topics = self.classes_.shape[0]
                status.update(
                    f"Reducing topics from {n_topics} to {self.n_reduce_to}"
                )
                self.reduce_topics(
                    self.n_reduce_to,
                    self.reduction_method,
                    self.reduction_distance_metric,
                )
                console.log(
                    f"Topic reduction done from {n_topics} to {self.n_reduce_to}."
                )
        console.log("Model fitting done.")
        return self.labels_

    def fit_transform(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ):
        self.fit_predict(raw_documents, y, embeddings)
        embeddings = (
            embeddings
            if embeddings is not None
            else getattr(self, "embeddings", None)
        )
        document_topic_matrix = self.transform(
            raw_documents, embeddings=embeddings
        )
        return document_topic_matrix

    def fit_transform_multimodal(
        self,
        raw_documents: list[str],
        images: list[ImageRepr],
        y=None,
        embeddings: Optional[MultimodalEmbeddings] = None,
    ) -> np.ndarray:
        self.validate_embeddings(embeddings)
        self.multimodal_embeddings = embeddings
        if self.multimodal_embeddings is None:
            self.multimodal_embeddings = self.encode_multimodal(
                raw_documents, images
            )
        doc_topic_matrix = self.fit_transform(
            raw_documents,
            embeddings=self.multimodal_embeddings["document_embeddings"],
            y=y,
        )
        self.image_topic_matrix = self.transform(
            raw_documents,
            embeddings=self.multimodal_embeddings["image_embeddings"],
        )
        self.top_images: list[list[Image.Image]] = self.collect_top_images(
            images, self.image_topic_matrix
        )
        return doc_topic_matrix

    def estimate_temporal_components(
        self,
        time_labels,
        time_bin_edges,
        feature_importance: Optional[WordImportance] = None,
    ) -> np.ndarray:
        """Estimates temporal components based on a fitted topic model.

        Parameters
        ----------
        feature_importance: WordImportance, default None
            Method for estimating term importances.
            'centroid' uses distances from cluster centroid similarly
            to Top2Vec.
            'c-tf-idf' uses BERTopic's c-tf-idf.
            'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
            be very similar to 'c-tf-idf'.
            'bayes' uses Bayes' rule.
            'linear' calculates most predictive directions in embedding space and projects
            words onto them.

        Returns
        -------
        ndarray of shape (n_time_bins, n_components, n_vocab)
            Temporal topic-term matrix.
        """
        if getattr(self, "components_", None) is None:
            raise NotFittedError(
                "The model has not been fitted yet, please fit the model before estimating temporal components."
            )
        if feature_importance is None:
            feature_importance = self.feature_importance
        n_comp, n_vocab = self.components_.shape
        self.time_bin_edges = time_bin_edges
        n_bins = len(self.time_bin_edges) - 1
        self.temporal_components_ = np.full(
            (n_bins, n_comp, n_vocab),
            np.nan,
            dtype=self.components_.dtype,
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        for i_timebin in np.unique(time_labels):
            topic_importances = self.document_topic_matrix[
                time_labels == i_timebin
            ].sum(axis=0)
            if not topic_importances.sum() == 0:
                topic_importances = topic_importances / topic_importances.sum()
            self.temporal_importance_[i_timebin, :] = topic_importances
            t_dtm = self.doc_term_matrix[time_labels == i_timebin]
            t_doc_topic = self.document_topic_matrix[time_labels == i_timebin]
            if feature_importance == "c-tf-idf":
                self.temporal_components_[i_timebin], _ = ctf_idf(
                    t_doc_topic, t_dtm, return_idf=True
                )
            elif feature_importance == "soft-c-tf-idf":
                self.temporal_components_[i_timebin], _ = soft_ctf_idf(
                    t_doc_topic, t_dtm, return_idf=True
                )
            elif feature_importance == "bayes":
                self.temporal_components_[i_timebin] = bayes_rule(
                    t_doc_topic, t_dtm
                )
            elif feature_importance == "fighting-words":
                self.temporal_components_[i_timebin] = fighting_words(
                    t_doc_topic, t_dtm
                )
            elif feature_importance in ["centroid", "linear"]:
                t_topic_vectors = self._calculate_topic_vectors(
                    is_in_slice=time_labels == i_timebin,
                )
                if feature_importance == "centroid":
                    components = cluster_centroid_distance(
                        t_topic_vectors,
                        self.vocab_embeddings,
                    )
                    mask_terms = t_dtm.sum(axis=0).astype(np.float64)
                    mask_terms = np.squeeze(np.asarray(mask_terms))
                    components[:, mask_terms == 0] = np.nan
                    self.temporal_components_[i_timebin] = components
                else:
                    self.temporal_components_[i_timebin] = linear_classifier(
                        t_doc_topic,
                        embeddings=self.embeddings,
                        vocab_embedding=self.vocab_embeddings,
                    )
        return self.temporal_components_

    def fit_transform_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ):
        time_labels, self.time_bin_edges = self.bin_timestamps(
            timestamps, bins
        )
        if hasattr(self, "components_"):
            doc_topic_matrix = safe_binarize(
                self.labels_, classes=self.classes_
            )
        else:
            doc_topic_matrix = self.fit_transform(
                raw_documents, embeddings=embeddings
            )
        n_comp, n_vocab = self.components_.shape
        n_bins = len(self.time_bin_edges) - 1
        self.temporal_components_ = np.zeros(
            (n_bins, n_comp, n_vocab), dtype=doc_topic_matrix.dtype
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        if embeddings is None:
            embeddings = self.encode_documents(raw_documents)
        self.embeddings = embeddings
        self.estimate_temporal_components(
            time_labels, self.time_bin_edges, self.feature_importance
        )
        return doc_topic_matrix

    @staticmethod
    def _labels_to_indices(labels, classes):
        n_classes = len(classes)
        class_to_index = dict(zip(classes, np.arange(n_classes)))
        return np.array([class_to_index[label] for label in labels])

    def plot_clusters_datamapplot(
        self, dimensions: tuple[int, int] = (0, 1), *args, **kwargs
    ):
        try:
            import datamapplot
        except ModuleNotFoundError as e:
            raise ModuleNotFoundError(
                "You need to install datamapplot to be able to use plot_clusters_datamapplot()."
            ) from e
        coordinates = self.reduced_embeddings[:, dimensions]
        coordinates = scale(coordinates) * 4
        indices = self._labels_to_indices(self.labels_, self.classes_)
        labels = np.array(self.topic_names)[indices]
        if -1 in self.classes_:
            i_outlier = np.where(self.classes_ == -1)[0][0]
            kwargs["noise_label"] = self.topic_names[i_outlier]
        plot = datamapplot.create_interactive_plot(
            coordinates, labels, *args, **kwargs
        )

        def show_fig():
            with tempfile.TemporaryDirectory() as temp_dir:
                file_name = Path(temp_dir).joinpath("fig.html")
                plot.save(file_name)
                webbrowser.open("file://" + str(file_name.absolute()), new=2)
                time.sleep(2)

        plot.show = show_fig
        plot.write_html = plot.save
        return plot

    def transform(
        self, raw_documents, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        if getattr(self, "components_", None) is None:
            raise NotFittedError(
                "You can only transform documents once the model has been fitted."
            )
        idf_diag = getattr(self, "_idf_diag", None)
        if idf_diag is not None:
            X = self.vectorizer.transform(raw_documents)
            X = normalize(X, axis=1, norm="l1", copy=False)
            X = X * idf_diag
            doc_topic_matrix = np.exp(cosine_similarity(X, self.components_))
        elif self.feature_importance == "centroid":
            if embeddings is None:
                embeddings = self.encode_documents(raw_documents)
            doc_topic_matrix = np.exp(
                cosine_similarity(embeddings, self._calculate_topic_vectors())
            )
        else:
            doc_topic_matrix = safe_binarize(
                self.labels_, classes=self.classes_
            )
        return doc_topic_matrix

`estimate_components(feature_importance=None)`

Estimates feature importances based on a fitted clustering.

Parameters:

Name	Type	Description	Default
`feature_importance`	`Optional[WordImportance]`	Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. 'bayes' uses Bayes' rule. 'linear' calculates most predictive directions in embedding space and projects words onto them. 'fighting-words' calculates word importances based on the Fighting Words algorithm from Monroe et al.	`None`

Returns:

Type	Description
`ndarray of shape (n_components, n_vocab)`	Topic-term matrix.

Source code in turftopic/models/cluster.py

def estimate_components(
    self, feature_importance: Optional[WordImportance] = None
) -> np.ndarray:
    """Estimates feature importances based on a fitted clustering.

    Parameters
    ----------
    feature_importance: WordImportance, default None
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
        'bayes' uses Bayes' rule.
        'linear' calculates most predictive directions in embedding space and projects
        words onto them.
        'fighting-words' calculates word importances based on the Fighting Words
        algorithm from Monroe et al.

    Returns
    -------
    ndarray of shape (n_components, n_vocab)
        Topic-term matrix.
    """
    if feature_importance is not None:
        if feature_importance not in VALID_WORD_IMPORTANCE:
            raise ValueError(
                f"feature_importance must be one of {VALID_WORD_IMPORTANCE} got {feature_importance} instead."
            )
        self.feature_importance = feature_importance
    self.hierarchy.estimate_components()
    doc_topic_matrix = safe_binarize(self.labels_, classes=self.classes_)
    if feature_importance == "c-tf-idf":
        _, self._idf_diag = ctf_idf(
            doc_topic_matrix,
            self.doc_term_matrix,
            return_idf=True,
        )
    if feature_importance == "soft-c-tf-idf":
        _, self._idf_diag = soft_ctf_idf(
            doc_topic_matrix,
            self.doc_term_matrix,
            return_idf=True,
        )
    return self.components_

`estimate_temporal_components(time_labels, time_bin_edges, feature_importance=None)`

Estimates temporal components based on a fitted topic model.

Parameters:

Name	Type	Description	Default
`feature_importance`	`Optional[WordImportance]`	Method for estimating term importances. 'centroid' uses distances from cluster centroid similarly to Top2Vec. 'c-tf-idf' uses BERTopic's c-tf-idf. 'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should be very similar to 'c-tf-idf'. 'bayes' uses Bayes' rule. 'linear' calculates most predictive directions in embedding space and projects words onto them.	`None`

Returns:

Type	Description
`ndarray of shape (n_time_bins, n_components, n_vocab)`	Temporal topic-term matrix.

Source code in turftopic/models/cluster.py

def estimate_temporal_components(
    self,
    time_labels,
    time_bin_edges,
    feature_importance: Optional[WordImportance] = None,
) -> np.ndarray:
    """Estimates temporal components based on a fitted topic model.

    Parameters
    ----------
    feature_importance: WordImportance, default None
        Method for estimating term importances.
        'centroid' uses distances from cluster centroid similarly
        to Top2Vec.
        'c-tf-idf' uses BERTopic's c-tf-idf.
        'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
        be very similar to 'c-tf-idf'.
        'bayes' uses Bayes' rule.
        'linear' calculates most predictive directions in embedding space and projects
        words onto them.

    Returns
    -------
    ndarray of shape (n_time_bins, n_components, n_vocab)
        Temporal topic-term matrix.
    """
    if getattr(self, "components_", None) is None:
        raise NotFittedError(
            "The model has not been fitted yet, please fit the model before estimating temporal components."
        )
    if feature_importance is None:
        feature_importance = self.feature_importance
    n_comp, n_vocab = self.components_.shape
    self.time_bin_edges = time_bin_edges
    n_bins = len(self.time_bin_edges) - 1
    self.temporal_components_ = np.full(
        (n_bins, n_comp, n_vocab),
        np.nan,
        dtype=self.components_.dtype,
    )
    self.temporal_importance_ = np.zeros((n_bins, n_comp))
    for i_timebin in np.unique(time_labels):
        topic_importances = self.document_topic_matrix[
            time_labels == i_timebin
        ].sum(axis=0)
        if not topic_importances.sum() == 0:
            topic_importances = topic_importances / topic_importances.sum()
        self.temporal_importance_[i_timebin, :] = topic_importances
        t_dtm = self.doc_term_matrix[time_labels == i_timebin]
        t_doc_topic = self.document_topic_matrix[time_labels == i_timebin]
        if feature_importance == "c-tf-idf":
            self.temporal_components_[i_timebin], _ = ctf_idf(
                t_doc_topic, t_dtm, return_idf=True
            )
        elif feature_importance == "soft-c-tf-idf":
            self.temporal_components_[i_timebin], _ = soft_ctf_idf(
                t_doc_topic, t_dtm, return_idf=True
            )
        elif feature_importance == "bayes":
            self.temporal_components_[i_timebin] = bayes_rule(
                t_doc_topic, t_dtm
            )
        elif feature_importance == "fighting-words":
            self.temporal_components_[i_timebin] = fighting_words(
                t_doc_topic, t_dtm
            )
        elif feature_importance in ["centroid", "linear"]:
            t_topic_vectors = self._calculate_topic_vectors(
                is_in_slice=time_labels == i_timebin,
            )
            if feature_importance == "centroid":
                components = cluster_centroid_distance(
                    t_topic_vectors,
                    self.vocab_embeddings,
                )
                mask_terms = t_dtm.sum(axis=0).astype(np.float64)
                mask_terms = np.squeeze(np.asarray(mask_terms))
                components[:, mask_terms == 0] = np.nan
                self.temporal_components_[i_timebin] = components
            else:
                self.temporal_components_[i_timebin] = linear_classifier(
                    t_doc_topic,
                    embeddings=self.embeddings,
                    vocab_embedding=self.vocab_embeddings,
                )
    return self.temporal_components_

`fit_predict(raw_documents, y=None, embeddings=None)`

Fits model and predicts cluster labels for all given documents.

Parameters:

Name	Type	Description	Default
`raw_documents`		Documents to fit the model on.	required
`y`		Ignored, when the dimensionality reduction is TSNE (the default), in case of a dimensionality reduction that can utilize labels, you can pass labels to the model to inform the clustering process.	`None`
`embeddings`	`Optional[ndarray]`	Precomputed document encodings.	`None`

Returns:

Type	Description
`ndarray of shape (n_documents)`	Cluster label for all documents (-1 for outliers)

Source code in turftopic/models/cluster.py

def fit_predict(
    self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
) -> np.ndarray:
    """Fits model and predicts cluster labels for all given documents.

    Parameters
    ----------
    raw_documents: iterable of str
        Documents to fit the model on.
    y: None
        Ignored, when the dimensionality reduction is TSNE (the default),
        in case of a dimensionality reduction that can utilize labels,
        you can pass labels to the model to inform the clustering process.
    embeddings: ndarray of shape (n_documents, n_dimensions), optional
        Precomputed document encodings.

    Returns
    -------
    ndarray of shape (n_documents)
        Cluster label for all documents (-1 for outliers)
    """
    console = Console()
    with console.status("Fitting model") as status:
        if embeddings is None:
            status.update("Encoding documents")
            embeddings = self.encode_documents(raw_documents)
            console.log("Encoding done.")
        self.embeddings = embeddings
        status.update("Extracting terms")
        self.doc_term_matrix = self.vectorizer.fit_transform(raw_documents)
        console.log("Term extraction done.")
        status.update("Reducing Dimensionality")
        # If y is specified, we pass it to the dimensionality
        # reduction method as supervisory signal
        if y is not None:
            y = factorize_labels(y)
        self.reduced_embeddings = (
            self.dimensionality_reduction.fit_transform(embeddings, y=y)
        )
        console.log("Dimensionality reduction done.")
        status.update("Clustering documents")
        labels = self.clustering.fit_predict(self.reduced_embeddings)
        console.log("Clustering done.")
        status.update("Estimating parameters.")
        # Initializing hierarchy
        self.hierarchy = ClusterNode.create_root(self, labels=labels)
        doc_topic_matrix = safe_binarize(
            self.labels_, classes=self.classes_
        )
        if self.feature_importance == "c-tf-idf":
            _, self._idf_diag = ctf_idf(
                doc_topic_matrix,
                self.doc_term_matrix,
                return_idf=True,
            )
        if self.feature_importance == "soft-c-tf-idf":
            _, self._idf_diag = soft_ctf_idf(
                doc_topic_matrix,
                self.doc_term_matrix,
                return_idf=True,
            )
        console.log("Parameter estimation done.")
        if self.n_reduce_to is not None:
            n_topics = self.classes_.shape[0]
            status.update(
                f"Reducing topics from {n_topics} to {self.n_reduce_to}"
            )
            self.reduce_topics(
                self.n_reduce_to,
                self.reduction_method,
                self.reduction_distance_metric,
            )
            console.log(
                f"Topic reduction done from {n_topics} to {self.n_reduce_to}."
            )
    console.log("Model fitting done.")
    return self.labels_

`join_topics(to_join, joint_id=None)`

Joins the given topics in the cluster hierarchy to a single topic.

Parameters:

Name	Type	Description	Default
`to_join`	`Sequence[int]`	Topics to join together by ID.	required
`joint_id`	`Optional[int]`	New ID for the joint cluster. Default is the smallest ID of the topics to join.	`None`

Source code in turftopic/models/cluster.py

def join_topics(
    self, to_join: Sequence[int], joint_id: Optional[int] = None
):
    """Joins the given topics in the cluster hierarchy to a single topic.

    Parameters
    ----------
    to_join: Sequence of int
        Topics to join together by ID.
    joint_id: int, default None
        New ID for the joint cluster.
        Default is the smallest ID of the topics to join.
    """
    self.hierarchy.join_topics(to_join, joint_id=joint_id)

`reduce_topics(n_reduce_to, reduction_method=None, metric=None)`

Reduces the clustering to the desired amount with the given method.

Parameters:

Name	Type	Description	Default
`n_reduce_to`	`int`	Number of topics to reduce topics to. The specified reduction method will be used to merge them. By default, topics are not merged.	required
`reduction_method`	`Optional[LinkageMethod]`	Method used for hierarchically merging topics. Could be "smallest", which is Top2Vec's default merging strategy, or any of the linkage methods listed in SciPy's documentation	`None`
`reduction_distance_metric`		Distance metric to use for hierarchical topic reduction.	required

Returns:

Type	Description
`ndarray of shape (n_documents)`	New cluster labels for documents.

Source code in turftopic/models/cluster.py

def reduce_topics(
    self,
    n_reduce_to: int,
    reduction_method: Optional[LinkageMethod] = None,
    metric: Optional[DistanceMetric] = None,
) -> np.ndarray:
    """Reduces the clustering to the desired amount with the given method.

    Parameters
    ----------
    n_reduce_to: int, default None
        Number of topics to reduce topics to.
        The specified reduction method will be used to merge them.
        By default, topics are not merged.
    reduction_method: LinkageMethod, default None
        Method used for hierarchically merging topics.
        Could be "smallest", which is Top2Vec's default merging strategy, or
        any of the linkage methods listed in [SciPy's documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)
    reduction_distance_metric: DistanceMetric, default None
        Distance metric to use for hierarchical topic reduction.

    Returns
    -------
    ndarray of shape (n_documents)
        New cluster labels for documents.
    """
    if not hasattr(self, "original_labels_"):
        self.original_labels_ = self.labels_
        self.original_names_ = self.topic_names
        self.original_classes_ = self.classes_
    if reduction_method is None:
        reduction_method = self.reduction_method
    if metric is None:
        metric = self.reduction_distance_metric
    self.hierarchy.reduce_topics(
        n_reduce_to, method=reduction_method, metric=metric
    )
    return self.labels_

`reset_topics()`

Resets topics to the original cllustering.

Source code in turftopic/models/cluster.py

def reset_topics(self):
    """Resets topics to the original cllustering."""
    original_labels = getattr(self, "original_labels_", None)
    if original_labels is None:
        warnings.warn("Topics have never been reduced, nothing to reset.")
    else:
        self.hierarchy = ClusterNode.create_root(
            self, labels=self.original_labels_
        )
        self.topic_names_ = self.original_names_

`turftopic.models.cluster.BERTopic`

Bases: ClusteringTopicModel

Convenience function to construct a BERTopic model in Turftopic. The model is essentially just a ClusteringTopicModel with BERTopic's defaults (UMAP -> HDBSCAN -> C-TF-IDF).

pip install turftopic[umap-learn]

from turftopic import BERTopic

corpus: list[str] = ["some text", "more text", ...]

model = BERTopic().fit(corpus)
model.print_topics()

Source code in turftopic/models/cluster.py

class BERTopic(ClusteringTopicModel):
    """Convenience function to construct a BERTopic model in Turftopic.
    The model is essentially just a ClusteringTopicModel
    with BERTopic's defaults (UMAP -> HDBSCAN -> C-TF-IDF).

    ```bash
    pip install turftopic[umap-learn]
    ```

    ```python
    from turftopic import BERTopic

    corpus: list[str] = ["some text", "more text", ...]

    model = BERTopic().fit(corpus)
    model.print_topics()
    ```
    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        dimensionality_reduction: Optional[TransformerMixin] = None,
        clustering: Optional[ClusterMixin] = None,
        feature_importance: WordImportance = "c-tf-idf",
        n_reduce_to: Optional[int] = None,
        reduction_method: LinkageMethod = "average",
        reduction_distance_metric: DistanceMetric = "cosine",
        reduction_topic_representation: TopicRepresentation = "component",
        random_state: Optional[int] = None,
    ):
        if dimensionality_reduction is None:
            try:
                from umap import UMAP
            except ModuleNotFoundError as e:
                raise ModuleNotFoundError(
                    "UMAP is not installed in your environment, but BERTopic requires it."
                ) from e
            dimensionality_reduction = UMAP(
                n_neighbors=15,
                n_components=5,
                min_dist=0.0,
                metric="cosine",
                random_state=random_state,
            )
        if clustering is None:
            clustering = HDBSCAN(
                min_cluster_size=10,
                metric="euclidean",
                cluster_selection_method="eom",
            )
        super().__init__(
            encoder=encoder,
            vectorizer=vectorizer,
            dimensionality_reduction=dimensionality_reduction,
            clustering=clustering,
            n_reduce_to=n_reduce_to,
            random_state=random_state,
            feature_importance=feature_importance,
            reduction_method=reduction_method,
            reduction_distance_metric=reduction_distance_metric,
            reduction_topic_representation=reduction_topic_representation,
        )

`turftopic.models.cluster.Top2Vec`

Bases: ClusteringTopicModel

Convenience function to construct a Top2Vec model in Turftopic. The model is essentially the same as ClusteringTopicModel with defaults that resemble Top2Vec (UMAP -> HDBSCAN -> Centroid term importance).

pip install turftopic[umap-learn]

from turftopic import Top2Vec

corpus: list[str] = ["some text", "more text", ...]

model = Top2Vec().fit(corpus)
model.print_topics()

Source code in turftopic/models/cluster.py

class Top2Vec(ClusteringTopicModel):
    """Convenience function to construct a Top2Vec model in Turftopic.
    The model is essentially the same as ClusteringTopicModel
    with defaults that resemble Top2Vec (UMAP -> HDBSCAN -> Centroid term importance).

    ```bash
    pip install turftopic[umap-learn]
    ```

    ```python
    from turftopic import Top2Vec

    corpus: list[str] = ["some text", "more text", ...]

    model = Top2Vec().fit(corpus)
    model.print_topics()
    ```
    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        dimensionality_reduction: Optional[TransformerMixin] = None,
        clustering: Optional[ClusterMixin] = None,
        feature_importance: WordImportance = "centroid",
        n_reduce_to: Optional[int] = None,
        reduction_method: LinkageMethod = "smallest",
        reduction_distance_metric: DistanceMetric = "cosine",
        reduction_topic_representation: TopicRepresentation = "centroid",
        random_state: Optional[int] = None,
    ):
        if dimensionality_reduction is None:
            try:
                from umap import UMAP
            except ModuleNotFoundError as e:
                raise ModuleNotFoundError(
                    "UMAP is not installed in your environment, but Top2Vec requires it."
                ) from e
            dimensionality_reduction = UMAP(
                n_neighbors=15,
                n_components=5,
                min_dist=0.0,
                metric="cosine",
                random_state=random_state,
            )
        if clustering is None:
            clustering = HDBSCAN(
                min_cluster_size=15,
                metric="euclidean",
                cluster_selection_method="eom",
            )
        super().__init__(
            encoder=encoder,
            vectorizer=vectorizer,
            dimensionality_reduction=dimensionality_reduction,
            clustering=clustering,
            n_reduce_to=n_reduce_to,
            random_state=random_state,
            feature_importance=feature_importance,
            reduction_method=reduction_method,
            reduction_distance_metric=reduction_distance_metric,
            reduction_topic_representation=reduction_topic_representation,
        )