Skip to content

Concept Vector Projection

Concept Vector Projection is an embedding-based method for extracting continuous sentiment (or other) scores from free-text documents.

Figure 1: Schematic Overview of Concept Vector Projection.
Figure from Lyngbæk et al. (2025)

The method rests on the idea that one can construct a concept vector by encoding positive and negative seed phrases with a transformer, then taking the difference of these mean vectors. We can then project other documents' embeddings onto these concept vectors by taking the dot product with the concept vector, thereby giving continuous scores on how related documents are to a given concept.

Usage

Single Concept

When projecting onto a single concept, you should specify the seeds as a tuple of positive and negative phrases.

from turftopic import ConceptVectorProjection

positive = [
    "I love this product",
    "This is absolutely lovely",
    "My daughter is going to adore this"
]
negative = [
    "This product is not at all as advertised, I'm very displeased",
    "I hate this",
    "What a horrible way to deal with people"
]
cvp = ConceptVectorProjection(seeds=(positive, negative))

test_documents = ["My cute little doggy", "Few this is digusting"]
doc_concept_matrix = cvp.transform(test_documents)
print(doc_concept_matrix)
[[0.24265897]
 [0.01709663]]

Multiple Concepts

When projecting documents to multiple concepts at once, you will need to specify seeds for each concept, as well as its name. Internally this is handled with an OrderedDict, which you can either specify yourself, or Turftopic can do it for you:

import pandas as pd
from collections import OrderedDict

cuteness_seeds = (["Absolutely adorable", "I love how he dances with his little feet"], ["What a big slob of an abomination", "A suspicious old man sat next to me on the bus today"])
bullish_seeds = (["We are going to the moon", "This stock will prove an incredible investment"], ["I will short the hell out of them", "Uber stocks drop 7% in value after down-time."])

# Either specify it like this:
seeds = [("cuteness", cuteness_seeds), ("bullish", bullish_seeds)]
# or as an OrderedDict:
seeds = OrderedDict([("cuteness", cuteness_seeds), ("bullish", bullish_seeds)])
cvp = ConceptVectorProjection(seeds=seeds)

test_documents = ["What an awesome investment", "Tiny beautiful kitty-cat"]
doc_concept_matrix = cvp.transform(test_documents)
concept_df = pd.DataFrame(doc_concept_matrix, columns=cvp.get_feature_names_out())
print(concept_df)
   cuteness   bullish
0  0.085957  0.288779
1  0.269454  0.009495

Citation

Please cite Lyngbæk et al. (2025) and Turftopic when using Concept Vector Projection in publications:

@article{
  Kardos2025,
  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
  doi = {10.21105/joss.08183},
  url = {https://doi.org/10.21105/joss.08183},
  year = {2025},
  publisher = {The Open Journal},
  volume = {10},
  number = {111},
  pages = {8183},
  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
  journal = {Journal of Open Source Software} 
}

@incollection{Lyngbaek2025,
  title = {Continuous Sentiment Scores for Literary and Multilingual
Contexts},
  author = {Laurits Lyngbaek and Pascale Feldkamp and Yuri Bizzoni and Kristoffer L. Nielbo and Kenneth Enevoldsen},
  year = {2025},
  booktitle = {Computational Humanities Research 2025},
  publisher = {Anthology of Computers and the Humanities},
  pages = {480--497},
  editor = {Taylor Arnold and Margherita Fantoli and Ruben Ros},
  doi = {10.63744/nVu1Zq5gRkuD}
}

API Reference

turftopic.models.cvp.ConceptVectorProjection

Bases: BaseEstimator, TransformerMixin

Concept Vector Projection model from Lyngbæk et al. (2025) Can be used to project document embeddings onto a difference projection vector between positive and negative seed phrases. The primary use case is sentiment analysis, and continuous sentiment scores, especially for languages where dedicated models are not available.

Parameters:

Name Type Description Default
seeds Union[Seeds, list[tuple[str, Seeds]], OrderedDict[str, Seeds]]

If you want to project to a single concept, then a tuple of (list of negative terms, list of positive terms).
If there are multiple concepts, they should be specified as (name, Seeds) tuples in a list. Alternatively, seeds can be an OrderedDict with the names of the concepts being the keys, and the tuples of negative and positive seeds as the values.

required
encoder Union[Encoder, str, MultimodalEncoder]

Model to produce document representations, paraphrase-multilingual-mpnet-base-v2 is the default per Lyngbæk et al. (2025).

'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'
Source code in turftopic/models/cvp.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
class ConceptVectorProjection(BaseEstimator, TransformerMixin):
    """Concept Vector Projection model from [Lyngbæk et al. (2025)](https://doi.org/10.63744/nVu1Zq5gRkuD)
    Can be used to project document embeddings onto a difference projection vector between positive and negative seed phrases.
    The primary use case is sentiment analysis, and continuous sentiment scores,
    especially for languages where dedicated models are not available.

    Parameters
    ----------
    seeds: (list[str], list[str]) or list of (str, (list[str], list[str]))
        If you want to project to a single concept, then
        a tuple of (list of negative terms, list of positive terms). <br>
        If there are multiple concepts, they should be specified as (name, Seeds) tuples in a list.
        Alternatively, seeds can be an OrderedDict with the names of the concepts being the keys,
        and the tuples of negative and positive seeds as the values.
    encoder: str or SentenceTransformer
        Model to produce document representations, paraphrase-multilingual-mpnet-base-v2 is the default
        per Lyngbæk et al. (2025).
    """

    def __init__(
        self,
        seeds: Union[Seeds, list[tuple[str, Seeds]], OrderedDict[str, Seeds]],
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    ):
        self.seeds = seeds
        if isinstance(seeds, OrderedDict):
            self._seeds = seeds
        elif (
            (len(seeds) == 2)
            and (isinstance(seeds, tuple))
            and (isinstance(seeds[0][0], str))
        ):
            self._seeds = OrderedDict([("default", seeds)])
        else:
            self._seeds = OrderedDict(seeds)
        self.encoder = encoder
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        self.classes_ = np.array([name for name in self._seeds])
        self.concept_matrix_ = []
        for _, (positive, negative) in self._seeds.items():
            positive_emb = self.encoder_.encode(positive)
            negative_emb = self.encoder_.encode(negative)
            cv = np.mean(positive_emb, axis=0) - np.mean(negative_emb, axis=0)
            self.concept_matrix_.append(cv / np.linalg.norm(cv))
        self.concept_matrix_ = np.stack(self.concept_matrix_)

    def get_feature_names_out(self):
        """Returns concept names in an array."""
        return self.classes_

    def fit_transform(self, raw_documents=None, y=None, embeddings=None):
        """Project documents onto the concept vectors.

        Parameters
        ----------
        raw_documents: list[str] or None
            List of documents to project to the concept vectors.
        embeddings: ndarray of shape (n_documents, n_dimensions)
            Document embeddings (has to be created with the same encoder as the concept vectors.)

        Returns
        -------
        document_concept_matrix: ndarray of shape (n_documents, n_dimensions)
            Prevalance of each concept in each document.
        """
        if (raw_documents is None) and (embeddings is None):
            raise ValueError(
                "Either embeddings or raw_documents has to be passed, both are None."
            )
        if embeddings is None:
            embeddings = self.encoder_.encode(raw_documents)
        return embeddings @ self.concept_matrix_.T

    def transform(self, raw_documents=None, embeddings=None):
        """Project documents onto the concept vectors.

        Parameters
        ----------
        raw_documents: list[str] or None
            List of documents to project to the concept vectors.
        embeddings: ndarray of shape (n_documents, n_dimensions)
            Document embeddings (has to be created with the same encoder as the concept vectors.)

        Returns
        -------
        document_concept_matrix: ndarray of shape (n_documents, n_dimensions)
            Prevalance of each concept in each document.
        """
        return self.fit_transform(raw_documents, embeddings=embeddings)

    def to_disk(self, out_dir: Union[Path, str]):
        """Persists model to directory on your machine.

        Parameters
        ----------
        out_dir: Path | str
            Directory to save the model to.
        """
        out_dir = Path(out_dir)
        out_dir.mkdir(exist_ok=True)
        package_versions = get_package_versions()
        with out_dir.joinpath("package_versions.json").open("w") as ver_file:
            ver_file.write(json.dumps(package_versions))
        joblib.dump(self, out_dir.joinpath("model.joblib"))

    def push_to_hub(self, repo_id: str):
        """Uploads model to HuggingFace Hub

        Parameters
        ----------
        repo_id: str
            Repository to upload the model to.
        """
        api = HfApi()
        api.create_repo(repo_id, exist_ok=True)
        with tempfile.TemporaryDirectory() as tmp_dir:
            readme_path = Path(tmp_dir).joinpath("README.md")
            with readme_path.open("w") as readme_file:
                readme_file.write(create_readme(self, repo_id))
            self.to_disk(tmp_dir)
            api.upload_folder(
                folder_path=tmp_dir,
                repo_id=repo_id,
                repo_type="model",
            )

fit_transform(raw_documents=None, y=None, embeddings=None)

Project documents onto the concept vectors.

Parameters:

Name Type Description Default
raw_documents

List of documents to project to the concept vectors.

None
embeddings

Document embeddings (has to be created with the same encoder as the concept vectors.)

None

Returns:

Name Type Description
document_concept_matrix ndarray of shape (n_documents, n_dimensions)

Prevalance of each concept in each document.

Source code in turftopic/models/cvp.py
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
def fit_transform(self, raw_documents=None, y=None, embeddings=None):
    """Project documents onto the concept vectors.

    Parameters
    ----------
    raw_documents: list[str] or None
        List of documents to project to the concept vectors.
    embeddings: ndarray of shape (n_documents, n_dimensions)
        Document embeddings (has to be created with the same encoder as the concept vectors.)

    Returns
    -------
    document_concept_matrix: ndarray of shape (n_documents, n_dimensions)
        Prevalance of each concept in each document.
    """
    if (raw_documents is None) and (embeddings is None):
        raise ValueError(
            "Either embeddings or raw_documents has to be passed, both are None."
        )
    if embeddings is None:
        embeddings = self.encoder_.encode(raw_documents)
    return embeddings @ self.concept_matrix_.T

get_feature_names_out()

Returns concept names in an array.

Source code in turftopic/models/cvp.py
71
72
73
def get_feature_names_out(self):
    """Returns concept names in an array."""
    return self.classes_

push_to_hub(repo_id)

Uploads model to HuggingFace Hub

Parameters:

Name Type Description Default
repo_id str

Repository to upload the model to.

required
Source code in turftopic/models/cvp.py
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
def push_to_hub(self, repo_id: str):
    """Uploads model to HuggingFace Hub

    Parameters
    ----------
    repo_id: str
        Repository to upload the model to.
    """
    api = HfApi()
    api.create_repo(repo_id, exist_ok=True)
    with tempfile.TemporaryDirectory() as tmp_dir:
        readme_path = Path(tmp_dir).joinpath("README.md")
        with readme_path.open("w") as readme_file:
            readme_file.write(create_readme(self, repo_id))
        self.to_disk(tmp_dir)
        api.upload_folder(
            folder_path=tmp_dir,
            repo_id=repo_id,
            repo_type="model",
        )

to_disk(out_dir)

Persists model to directory on your machine.

Parameters:

Name Type Description Default
out_dir Union[Path, str]

Directory to save the model to.

required
Source code in turftopic/models/cvp.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def to_disk(self, out_dir: Union[Path, str]):
    """Persists model to directory on your machine.

    Parameters
    ----------
    out_dir: Path | str
        Directory to save the model to.
    """
    out_dir = Path(out_dir)
    out_dir.mkdir(exist_ok=True)
    package_versions = get_package_versions()
    with out_dir.joinpath("package_versions.json").open("w") as ver_file:
        ver_file.write(json.dumps(package_versions))
    joblib.dump(self, out_dir.joinpath("model.joblib"))

transform(raw_documents=None, embeddings=None)

Project documents onto the concept vectors.

Parameters:

Name Type Description Default
raw_documents

List of documents to project to the concept vectors.

None
embeddings

Document embeddings (has to be created with the same encoder as the concept vectors.)

None

Returns:

Name Type Description
document_concept_matrix ndarray of shape (n_documents, n_dimensions)

Prevalance of each concept in each document.

Source code in turftopic/models/cvp.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
def transform(self, raw_documents=None, embeddings=None):
    """Project documents onto the concept vectors.

    Parameters
    ----------
    raw_documents: list[str] or None
        List of documents to project to the concept vectors.
    embeddings: ndarray of shape (n_documents, n_dimensions)
        Document embeddings (has to be created with the same encoder as the concept vectors.)

    Returns
    -------
    document_concept_matrix: ndarray of shape (n_documents, n_dimensions)
        Prevalance of each concept in each document.
    """
    return self.fit_transform(raw_documents, embeddings=embeddings)