Concept Vector Projection

Concept Vector Projection is an embedding-based method for extracting continuous sentiment (or other) scores from free-text documents.

The method rests on the idea that one can construct a concept vector by encoding positive and negative seed phrases with a transformer, then taking the difference of these mean vectors. We can then project other documents' embeddings onto these concept vectors by taking the dot product with the concept vector, thereby giving continuous scores on how related documents are to a given concept.

Usage

Single Concept

When projecting onto a single concept, you should specify the seeds as a tuple of positive and negative phrases.

from turftopic import ConceptVectorProjection

positive = [
    "I love this product",
    "This is absolutely lovely",
    "My daughter is going to adore this"
]
negative = [
    "This product is not at all as advertised, I'm very displeased",
    "I hate this",
    "What a horrible way to deal with people"
]
cvp = ConceptVectorProjection(seeds=(positive, negative))

test_documents = ["My cute little doggy", "Few this is digusting"]
doc_concept_matrix = cvp.transform(test_documents)
print(doc_concept_matrix)

[[0.24265897]
 [0.01709663]]

Multiple Concepts

When projecting documents to multiple concepts at once, you will need to specify seeds for each concept, as well as its name. Internally this is handled with an OrderedDict, which you can either specify yourself, or Turftopic can do it for you:

import pandas as pd
from collections import OrderedDict

cuteness_seeds = (["Absolutely adorable", "I love how he dances with his little feet"], ["What a big slob of an abomination", "A suspicious old man sat next to me on the bus today"])
bullish_seeds = (["We are going to the moon", "This stock will prove an incredible investment"], ["I will short the hell out of them", "Uber stocks drop 7% in value after down-time."])

# Either specify it like this:
seeds = [("cuteness", cuteness_seeds), ("bullish", bullish_seeds)]
# or as an OrderedDict:
seeds = OrderedDict([("cuteness", cuteness_seeds), ("bullish", bullish_seeds)])
cvp = ConceptVectorProjection(seeds=seeds)

test_documents = ["What an awesome investment", "Tiny beautiful kitty-cat"]
doc_concept_matrix = cvp.transform(test_documents)
concept_df = pd.DataFrame(doc_concept_matrix, columns=cvp.get_feature_names_out())
print(concept_df)

   cuteness   bullish
0  0.085957  0.288779
1  0.269454  0.009495

Citation

Please cite Lyngbæk et al. (2025) and Turftopic when using Concept Vector Projection in publications:

@article{
  Kardos2025,
  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
  doi = {10.21105/joss.08183},
  url = {https://doi.org/10.21105/joss.08183},
  year = {2025},
  publisher = {The Open Journal},
  volume = {10},
  number = {111},
  pages = {8183},
  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
  journal = {Journal of Open Source Software} 
}

@incollection{Lyngbaek2025,
  title = {Continuous Sentiment Scores for Literary and Multilingual
Contexts},
  author = {Laurits Lyngbaek and Pascale Feldkamp and Yuri Bizzoni and Kristoffer L. Nielbo and Kenneth Enevoldsen},
  year = {2025},
  booktitle = {Computational Humanities Research 2025},
  publisher = {Anthology of Computers and the Humanities},
  pages = {480--497},
  editor = {Taylor Arnold and Margherita Fantoli and Ruben Ros},
  doi = {10.63744/nVu1Zq5gRkuD}
}

API Reference

`turftopic.models.cvp.ConceptVectorProjection`

Bases: BaseEstimator, TransformerMixin

Concept Vector Projection model from Lyngbæk et al. (2025) Can be used to project document embeddings onto a difference projection vector between positive and negative seed phrases. The primary use case is sentiment analysis, and continuous sentiment scores, especially for languages where dedicated models are not available.

Parameters:

Name	Type	Description	Default
`seeds`	`Union[Seeds, list[tuple[str, Seeds]], OrderedDict[str, Seeds]]`	If you want to project to a single concept, then a tuple of (list of negative terms, list of positive terms). If there are multiple concepts, they should be specified as (name, Seeds) tuples in a list. Alternatively, seeds can be an OrderedDict with the names of the concepts being the keys, and the tuples of negative and positive seeds as the values.	required
`encoder`	`Union[Encoder, str, MultimodalEncoder]`	Model to produce document representations, paraphrase-multilingual-mpnet-base-v2 is the default per Lyngbæk et al. (2025).	`'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'`

Source code in turftopic/models/cvp.py

class ConceptVectorProjection(BaseEstimator, TransformerMixin):
    """Concept Vector Projection model from [Lyngbæk et al. (2025)](https://doi.org/10.63744/nVu1Zq5gRkuD)
    Can be used to project document embeddings onto a difference projection vector between positive and negative seed phrases.
    The primary use case is sentiment analysis, and continuous sentiment scores,
    especially for languages where dedicated models are not available.

    Parameters
    ----------
    seeds: (list[str], list[str]) or list of (str, (list[str], list[str]))
        If you want to project to a single concept, then
        a tuple of (list of negative terms, list of positive terms). <br>
        If there are multiple concepts, they should be specified as (name, Seeds) tuples in a list.
        Alternatively, seeds can be an OrderedDict with the names of the concepts being the keys,
        and the tuples of negative and positive seeds as the values.
    encoder: str or SentenceTransformer
        Model to produce document representations, paraphrase-multilingual-mpnet-base-v2 is the default
        per Lyngbæk et al. (2025).
    """

    def __init__(
        self,
        seeds: Union[Seeds, list[tuple[str, Seeds]], OrderedDict[str, Seeds]],
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    ):
        self.seeds = seeds
        if isinstance(seeds, OrderedDict):
            self._seeds = seeds
        elif (
            (len(seeds) == 2)
            and (isinstance(seeds, tuple))
            and (isinstance(seeds[0][0], str))
        ):
            self._seeds = OrderedDict([("default", seeds)])
        else:
            self._seeds = OrderedDict(seeds)
        self.encoder = encoder
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        self.classes_ = np.array([name for name in self._seeds])
        self.concept_matrix_ = []
        for _, (positive, negative) in self._seeds.items():
            positive_emb = self.encoder_.encode(positive)
            negative_emb = self.encoder_.encode(negative)
            cv = np.mean(positive_emb, axis=0) - np.mean(negative_emb, axis=0)
            self.concept_matrix_.append(cv / np.linalg.norm(cv))
        self.concept_matrix_ = np.stack(self.concept_matrix_)

    def get_feature_names_out(self):
        """Returns concept names in an array."""
        return self.classes_

    def fit_transform(self, raw_documents=None, y=None, embeddings=None):
        """Project documents onto the concept vectors.

        Parameters
        ----------
        raw_documents: list[str] or None
            List of documents to project to the concept vectors.
        embeddings: ndarray of shape (n_documents, n_dimensions)
            Document embeddings (has to be created with the same encoder as the concept vectors.)

        Returns
        -------
        document_concept_matrix: ndarray of shape (n_documents, n_dimensions)
            Prevalance of each concept in each document.
        """
        if (raw_documents is None) and (embeddings is None):
            raise ValueError(
                "Either embeddings or raw_documents has to be passed, both are None."
            )
        if embeddings is None:
            embeddings = self.encoder_.encode(raw_documents)
        return embeddings @ self.concept_matrix_.T

    def transform(self, raw_documents=None, embeddings=None):
        """Project documents onto the concept vectors.

        Parameters
        ----------
        raw_documents: list[str] or None
            List of documents to project to the concept vectors.
        embeddings: ndarray of shape (n_documents, n_dimensions)
            Document embeddings (has to be created with the same encoder as the concept vectors.)

        Returns
        -------
        document_concept_matrix: ndarray of shape (n_documents, n_dimensions)
            Prevalance of each concept in each document.
        """
        return self.fit_transform(raw_documents, embeddings=embeddings)

    def to_disk(self, out_dir: Union[Path, str]):
        """Persists model to directory on your machine.

        Parameters
        ----------
        out_dir: Path | str
            Directory to save the model to.
        """
        out_dir = Path(out_dir)
        out_dir.mkdir(exist_ok=True)
        package_versions = get_package_versions()
        with out_dir.joinpath("package_versions.json").open("w") as ver_file:
            ver_file.write(json.dumps(package_versions))
        joblib.dump(self, out_dir.joinpath("model.joblib"))

    def push_to_hub(self, repo_id: str):
        """Uploads model to HuggingFace Hub

        Parameters
        ----------
        repo_id: str
            Repository to upload the model to.
        """
        api = HfApi()
        api.create_repo(repo_id, exist_ok=True)
        with tempfile.TemporaryDirectory() as tmp_dir:
            readme_path = Path(tmp_dir).joinpath("README.md")
            with readme_path.open("w") as readme_file:
                readme_file.write(create_readme(self, repo_id))
            self.to_disk(tmp_dir)
            api.upload_folder(
                folder_path=tmp_dir,
                repo_id=repo_id,
                repo_type="model",
            )

`fit_transform(raw_documents=None, y=None, embeddings=None)`

Project documents onto the concept vectors.

Parameters:

Name	Type	Description	Default
`raw_documents`		List of documents to project to the concept vectors.	`None`
`embeddings`		Document embeddings (has to be created with the same encoder as the concept vectors.)	`None`

Returns:

Name	Type	Description
`document_concept_matrix`	`ndarray of shape (n_documents, n_dimensions)`	Prevalance of each concept in each document.

Source code in turftopic/models/cvp.py

def fit_transform(self, raw_documents=None, y=None, embeddings=None):
    """Project documents onto the concept vectors.

    Parameters
    ----------
    raw_documents: list[str] or None
        List of documents to project to the concept vectors.
    embeddings: ndarray of shape (n_documents, n_dimensions)
        Document embeddings (has to be created with the same encoder as the concept vectors.)

    Returns
    -------
    document_concept_matrix: ndarray of shape (n_documents, n_dimensions)
        Prevalance of each concept in each document.
    """
    if (raw_documents is None) and (embeddings is None):
        raise ValueError(
            "Either embeddings or raw_documents has to be passed, both are None."
        )
    if embeddings is None:
        embeddings = self.encoder_.encode(raw_documents)
    return embeddings @ self.concept_matrix_.T

`get_feature_names_out()`

Returns concept names in an array.

Source code in turftopic/models/cvp.py

def get_feature_names_out(self):
    """Returns concept names in an array."""
    return self.classes_

`push_to_hub(repo_id)`

Uploads model to HuggingFace Hub

Parameters:

Name	Type	Description	Default
`repo_id`	`str`	Repository to upload the model to.	required

Source code in turftopic/models/cvp.py

def push_to_hub(self, repo_id: str):
    """Uploads model to HuggingFace Hub

    Parameters
    ----------
    repo_id: str
        Repository to upload the model to.
    """
    api = HfApi()
    api.create_repo(repo_id, exist_ok=True)
    with tempfile.TemporaryDirectory() as tmp_dir:
        readme_path = Path(tmp_dir).joinpath("README.md")
        with readme_path.open("w") as readme_file:
            readme_file.write(create_readme(self, repo_id))
        self.to_disk(tmp_dir)
        api.upload_folder(
            folder_path=tmp_dir,
            repo_id=repo_id,
            repo_type="model",
        )

`to_disk(out_dir)`

Persists model to directory on your machine.

Parameters:

Name	Type	Description	Default
`out_dir`	`Union[Path, str]`	Directory to save the model to.	required

Source code in turftopic/models/cvp.py

def to_disk(self, out_dir: Union[Path, str]):
    """Persists model to directory on your machine.

    Parameters
    ----------
    out_dir: Path | str
        Directory to save the model to.
    """
    out_dir = Path(out_dir)
    out_dir.mkdir(exist_ok=True)
    package_versions = get_package_versions()
    with out_dir.joinpath("package_versions.json").open("w") as ver_file:
        ver_file.write(json.dumps(package_versions))
    joblib.dump(self, out_dir.joinpath("model.joblib"))

`transform(raw_documents=None, embeddings=None)`

Project documents onto the concept vectors.

Parameters:

Name	Type	Description	Default
`raw_documents`		List of documents to project to the concept vectors.	`None`
`embeddings`		Document embeddings (has to be created with the same encoder as the concept vectors.)	`None`

Returns:

Name	Type	Description
`document_concept_matrix`	`ndarray of shape (n_documents, n_dimensions)`	Prevalance of each concept in each document.

Source code in turftopic/models/cvp.py

def transform(self, raw_documents=None, embeddings=None):
    """Project documents onto the concept vectors.

    Parameters
    ----------
    raw_documents: list[str] or None
        List of documents to project to the concept vectors.
    embeddings: ndarray of shape (n_documents, n_dimensions)
        Document embeddings (has to be created with the same encoder as the concept vectors.)

    Returns
    -------
    document_concept_matrix: ndarray of shape (n_documents, n_dimensions)
        Prevalance of each concept in each document.
    """
    return self.fit_transform(raw_documents, embeddings=embeddings)