Skip to content

C-Top2Vec

Contextual Top2Vec (Angelov and Inkpen, 2024) is a late-interaction topic model, that uses windowed representations.

Info

This part of the documentation is still in the works. More information, visualizations and benchmark results are on their way.

The model is essentially the same as wrapping a regular Top2vec model in LateWrapper, but we provide a convenience class in Turftopic, so that it's easy for you to initialize this model. It comes pre-loaded with the following features:

  • Same hyperparameters as in Angelov and Inkpen (2024)
  • Phrase-vectorizer that finds regular phrases based on PMI
  • LateSentenceTransformer by default, you can specify any model.

Our implementation is much more flexible than the original top2vec package, and you might be able to use much more powerful or novel embedding models.

Tip

For more info about multi-vector/late-interaction models, read our User Guide.

Example Usage

You should install Turftopic with UMAP in order to be able to use C-Top2Vec:

pip install turftopic[umap-learn]

Then use the topic model as you would use any other model in Turftopic:

from turftopic import CTop2Vec

model = CTop2Vec(n_reduce_to=5)
doc_topic_matrix = model.fit_transform(corpus)

model.print_topics()

Citation

Please cite Angelov and Inkpen (2024) and Turftopic when using C-Top2Vec in publications:

@article{
  Kardos2025,
  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
  doi = {10.21105/joss.08183},
  url = {https://doi.org/10.21105/joss.08183},
  year = {2025},
  publisher = {The Open Journal},
  volume = {10},
  number = {111},
  pages = {8183},
  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
  journal = {Journal of Open Source Software} 
}

@inproceedings{angelov-inkpen-2024-topic,
    title = "Topic Modeling: Contextual Token Embeddings Are All You Need",
    author = "Angelov, Dimo  and
      Inkpen, Diana",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.790/",
    doi = "10.18653/v1/2024.findings-emnlp.790",
    pages = "13528--13539",
    abstract = "The goal of topic modeling is to find meaningful topics that capture the information present in a collection of documents. The main challenges of topic modeling are finding the optimal number of topics, labeling the topics, segmenting documents by topic, and evaluating topic model performance. Current neural approaches have tackled some of these problems but none have been able to solve all of them. We introduce a novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds topic spans within documents and labels topics with phrases rather than just words. We propose the use of BERTScore to evaluate topic coherence and to evaluate how informative topics are of the underlying documents. Our model outperforms the current state-of-the-art models on a comprehensive set of topic model evaluation metrics."
}

API Reference

turftopic.models.cluster.CTop2Vec

Bases: LateWrapper

Convenience function to construct a CTop2Vec model in Turftopic. The model is essentially the same as ClusteringTopicModel in a Late Wrapper with defaults that resemble CTop2Vec. This includes:

  1. A late interaction embedding model, with windowed aggregation
  2. UMAP reduction
  3. HDBSCAN clustering
  4. Centroid term importance
  5. Phrase vectorizer
pip install turftopic[umap-learn]
from turftopic import CTop2Vec

corpus: list[str] = ["some text", "more text", ...]

model = CTop2Vec().fit(corpus)
model.print_topics()
Source code in turftopic/models/cluster.py
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
class CTop2Vec(LateWrapper):
    """Convenience function to construct a CTop2Vec model in Turftopic.
    The model is essentially the same as ClusteringTopicModel in a Late Wrapper
    with defaults that resemble CTop2Vec. This includes:

    1. A late interaction embedding model, with windowed aggregation
    2. UMAP reduction
    3. HDBSCAN clustering
    4. Centroid term importance
    5. Phrase vectorizer

    ```bash
    pip install turftopic[umap-learn]
    ```

    ```python
    from turftopic import CTop2Vec

    corpus: list[str] = ["some text", "more text", ...]

    model = CTop2Vec().fit(corpus)
    model.print_topics()
    ```
    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        dimensionality_reduction: Optional[TransformerMixin] = None,
        clustering: Optional[ClusterMixin] = None,
        feature_importance: WordImportance = "centroid",
        n_reduce_to: Optional[int] = None,
        reduction_method: LinkageMethod = "smallest",
        reduction_distance_metric: DistanceMetric = "cosine",
        reduction_topic_representation: TopicRepresentation = "centroid",
        window_size: Optional[int] = 50,
        step_size: Optional[int] = 40,
        pooling: Optional[Callable] = np.nanmean,
        random_state: Optional[int] = None,
    ):
        if dimensionality_reduction is None:
            try:
                from umap import UMAP
            except ModuleNotFoundError as e:
                raise ModuleNotFoundError(
                    "UMAP is not installed in your environment, but Top2Vec requires it."
                ) from e
            dimensionality_reduction = UMAP(
                n_neighbors=15,
                n_components=5,
                min_dist=0.0,
                metric="cosine",
                random_state=random_state,
            )
        if clustering is None:
            clustering = HDBSCAN(
                min_cluster_size=15,
                metric="euclidean",
                cluster_selection_method="eom",
            )
        self.encoder = encoder
        if isinstance(encoder, str):
            encoder = LateSentenceTransformer(encoder)
        if vectorizer is None:
            vectorizer = PhraseVectorizer()
        self.dimensionality_reduction = dimensionality_reduction
        self.clustering = clustering
        self.feature_importance = feature_importance
        self.n_reduce_to = n_reduce_to
        self.reduction_method = reduction_method
        self.reduction_distance_metric = reduction_distance_metric
        self.reduction_topic_representation = reduction_topic_representation
        self.random_state = random_state
        model = ClusteringTopicModel(
            encoder=encoder,
            vectorizer=vectorizer,
            dimensionality_reduction=dimensionality_reduction,
            clustering=clustering,
            n_reduce_to=n_reduce_to,
            random_state=random_state,
            feature_importance=feature_importance,
            reduction_method=reduction_method,
            reduction_distance_metric=reduction_distance_metric,
            reduction_topic_representation=reduction_topic_representation,
        )
        super().__init__(
            model,
            window_size=window_size,
            step_size=step_size,
            pooling=pooling,
        )