C-Top2Vec

Contextual Top2Vec (Angelov and Inkpen, 2024) is a late-interaction topic model, that uses windowed representations.

Info

This part of the documentation is still in the works. More information, visualizations and benchmark results are on their way.

The model is essentially the same as wrapping a regular Top2vec model in LateWrapper, but we provide a convenience class in Turftopic, so that it's easy for you to initialize this model. It comes pre-loaded with the following features:

Same hyperparameters as in Angelov and Inkpen (2024)
Phrase-vectorizer that finds regular phrases based on PMI
LateSentenceTransformer by default, you can specify any model.

Our implementation is much more flexible than the original top2vec package, and you might be able to use much more powerful or novel embedding models.

Tip

For more info about multi-vector/late-interaction models, read our User Guide.

Example Usage

You should install Turftopic with UMAP in order to be able to use C-Top2Vec:

pip install turftopic[umap-learn]

Then use the topic model as you would use any other model in Turftopic:

from turftopic import CTop2Vec

model = CTop2Vec(n_reduce_to=5)
doc_topic_matrix = model.fit_transform(corpus)

model.print_topics()

Citation

Please cite Angelov and Inkpen (2024) and Turftopic when using C-Top2Vec in publications:

@article{
  Kardos2025,
  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
  doi = {10.21105/joss.08183},
  url = {https://doi.org/10.21105/joss.08183},
  year = {2025},
  publisher = {The Open Journal},
  volume = {10},
  number = {111},
  pages = {8183},
  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
  journal = {Journal of Open Source Software} 
}

@inproceedings{angelov-inkpen-2024-topic,
    title = "Topic Modeling: Contextual Token Embeddings Are All You Need",
    author = "Angelov, Dimo  and
      Inkpen, Diana",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.790/",
    doi = "10.18653/v1/2024.findings-emnlp.790",
    pages = "13528--13539",
    abstract = "The goal of topic modeling is to find meaningful topics that capture the information present in a collection of documents. The main challenges of topic modeling are finding the optimal number of topics, labeling the topics, segmenting documents by topic, and evaluating topic model performance. Current neural approaches have tackled some of these problems but none have been able to solve all of them. We introduce a novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds topic spans within documents and labels topics with phrases rather than just words. We propose the use of BERTScore to evaluate topic coherence and to evaluate how informative topics are of the underlying documents. Our model outperforms the current state-of-the-art models on a comprehensive set of topic model evaluation metrics."
}

API Reference

`turftopic.models.cluster.CTop2Vec`

Bases: LateWrapper

Convenience function to construct a CTop2Vec model in Turftopic. The model is essentially the same as ClusteringTopicModel in a Late Wrapper with defaults that resemble CTop2Vec. This includes:

A late interaction embedding model, with windowed aggregation
UMAP reduction
HDBSCAN clustering
Centroid term importance
Phrase vectorizer

pip install turftopic[umap-learn]

from turftopic import CTop2Vec

corpus: list[str] = ["some text", "more text", ...]

model = CTop2Vec().fit(corpus)
model.print_topics()

Source code in turftopic/models/cluster.py

class CTop2Vec(LateWrapper):
    """Convenience function to construct a CTop2Vec model in Turftopic.
    The model is essentially the same as ClusteringTopicModel in a Late Wrapper
    with defaults that resemble CTop2Vec. This includes:

    1. A late interaction embedding model, with windowed aggregation
    2. UMAP reduction
    3. HDBSCAN clustering
    4. Centroid term importance
    5. Phrase vectorizer

    ```bash
    pip install turftopic[umap-learn]
    ```

    ```python
    from turftopic import CTop2Vec

    corpus: list[str] = ["some text", "more text", ...]

    model = CTop2Vec().fit(corpus)
    model.print_topics()
    ```
    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        dimensionality_reduction: Optional[TransformerMixin] = None,
        clustering: Optional[ClusterMixin] = None,
        feature_importance: WordImportance = "centroid",
        n_reduce_to: Optional[int] = None,
        reduction_method: LinkageMethod = "smallest",
        reduction_distance_metric: DistanceMetric = "cosine",
        reduction_topic_representation: TopicRepresentation = "centroid",
        window_size: Optional[int] = 50,
        step_size: Optional[int] = 40,
        pooling: Optional[Callable] = np.nanmean,
        random_state: Optional[int] = None,
    ):
        if dimensionality_reduction is None:
            try:
                from umap import UMAP
            except ModuleNotFoundError as e:
                raise ModuleNotFoundError(
                    "UMAP is not installed in your environment, but Top2Vec requires it."
                ) from e
            dimensionality_reduction = UMAP(
                n_neighbors=15,
                n_components=5,
                min_dist=0.0,
                metric="cosine",
                random_state=random_state,
            )
        if clustering is None:
            clustering = HDBSCAN(
                min_cluster_size=15,
                metric="euclidean",
                cluster_selection_method="eom",
            )
        self.encoder = encoder
        if isinstance(encoder, str):
            encoder = LateSentenceTransformer(encoder)
        if vectorizer is None:
            vectorizer = PhraseVectorizer()
        self.dimensionality_reduction = dimensionality_reduction
        self.clustering = clustering
        self.feature_importance = feature_importance
        self.n_reduce_to = n_reduce_to
        self.reduction_method = reduction_method
        self.reduction_distance_metric = reduction_distance_metric
        self.reduction_topic_representation = reduction_topic_representation
        self.random_state = random_state
        model = ClusteringTopicModel(
            encoder=encoder,
            vectorizer=vectorizer,
            dimensionality_reduction=dimensionality_reduction,
            clustering=clustering,
            n_reduce_to=n_reduce_to,
            random_state=random_state,
            feature_importance=feature_importance,
            reduction_method=reduction_method,
            reduction_distance_metric=reduction_distance_metric,
            reduction_topic_representation=reduction_topic_representation,
        )
        super().__init__(
            model,
            window_size=window_size,
            step_size=step_size,
            pooling=pooling,
        )