Skip to content

Semantic Signal Separation (S³)

Semantic Signal Separation tries to recover dimensions/axes along which most of the semantic variations can be explained. A topic in S³ is a dimension of semantics, or a "semantic signal". This makes the model able to recover more nuanced topical content in documents, but is not optimal when you expect topics to be groupings of documents.

PCA and ICA Recovering Underlying Signals
(figure from scikit-learn's documentation)

The Model

1. Semantic Signal Decomposition

S³ finds semantic signals in the embedding matrix by decomposing it either with Independent Component Analysis(default) or with Principal Component Analysis. The difference between these two is that PCA finds maximally uncorrelated(orthogonal) components, while ICA recovers maximally independent signals.

To use one or the other, set the objective parameter of the model:

from turftopic import SemanticSignalSeparation

# Uses ICA
model = SemanticSignalSeparation(10, objective="independence")

# Uses PCA
model = SemanticSignalSeparation(10, objective="orthogonality")

My anecdotal experience indicates that ICA generally gives better results, but feel free to experiment with the two options.

Turftopic uses the FastICA and PCA implementations from scikit-learn in the background.

2. Term Importance Estimation: Recovering Signal Strength for the Vocabulary

To estimate the importance of terms for each component, S³ embeds all terms with the same encoder as the documents, and decomposes the vocabulary embeddings with the fitted components. The decomposed signals' matrix is then transposed to get a topic-term matrix.

Comparison to Classical Models

S³ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis. The conceptualization is very similar these models, but instead of recovering factors of word use, S³ recovers dimensions in a continuous semantic space.

Most of the intuitions you have about LSA will also apply with S³, but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.

S³ is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.

Interpretation

S³ is one of the trickier models to interpret due to the way it conceptualizes topics. Unlike many other models, the fact that a word ranks very low for a topic is also useful information for interpretation's sake. In other words, both ends of term importance are important for S³, words that rank highest, and words that rank lowest.

To investigate these relations, we recommend that you use Word Maps from topicwizard. Word maps allow you to display the distribution of all words in the vocabulary on two given topic axes.

pip install topic-wizard
from turftopic import SemanticSignalSeparation
from topicwizard import figures

model = SemanticSignalSeparation(10)
topic_data = model.prepare_topic_data(chatgpt_tweets)

figures.word_map(
  topic_data,
  topic_axes=(
     "9_api_apis_register_automatedsarcasmgenerator",
     "4_study_studying_assessments_exams"
  )
)
Word Map with two Discovered Semantic Components as Axes

Considerations

Strengths

  • Nuanced Content: Documents are assumed to contain multiple topics and the model can therefore work on corpora where texts are longer and might not group in semantic space based on topic.
  • Efficiency: FastICA is called fast for a reason. S³ is one of the most computationally efficient models in Turftopic.
  • Novel Descriptions: S³ tends to discover topics that no other models do. This is due to its interpretation of what a topic is.
  • High Quality: Topic descriptions tend to be high quality and easily interpretable.

Weaknesses

  • Noise Components: The model tends to find components in corpora that only contain noise. This is typical in other applications of ICA as well, and it is frequently used for noise removal in other disciplines. We are working on automated solutions to detect and flag these components.
  • Sometimes Unintuitive: Neural embedding models might have a different mapping of the semantic space than humans. Sometimes S³ uncovers unintuitive dimensions of meaning as a result of this.
  • Moderate Scalability: The model cannot be fitted in an online fashion. It is reasonably scalable, but for very large corpora you might want to consider using a different model.

API Reference

turftopic.models.decomp.SemanticSignalSeparation

Bases: ContextualModel

Separates the embedding matrix into 'semantic signals' with component analysis methods. Topics are assumed to be dimensions of semantics.

from turftopic import SemanticSignalSeparation

corpus: list[str] = ["some text", "more text", ...]

model = SemanticSignalSeparation(10).fit(corpus)
model.print_topics()

Parameters:

Name Type Description Default
n_components int

Number of topics.

10
encoder Union[Encoder, str]

Model to encode documents/terms, all-MiniLM-L6-v2 is the default.

'sentence-transformers/all-MiniLM-L6-v2'
vectorizer Optional[CountVectorizer]

Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.

None
decomposition Optional[TransformerMixin]

Custom decomposition method to use. Can be an instance of FastICA or PCA, or basically any dimensionality reduction method. Has to have fit_transform and fit methods. If not specified, FastICA is used.

None
max_iter int

Maximum number of iterations for ICA.

200
random_state Optional[int]

Random state to use so that results are exactly reproducible.

None
Source code in turftopic/models/decomp.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
class SemanticSignalSeparation(ContextualModel):
    """Separates the embedding matrix into 'semantic signals' with
    component analysis methods.
    Topics are assumed to be dimensions of semantics.

    ```python
    from turftopic import SemanticSignalSeparation

    corpus: list[str] = ["some text", "more text", ...]

    model = SemanticSignalSeparation(10).fit(corpus)
    model.print_topics()
    ```

    Parameters
    ----------
    n_components: int, default 10
        Number of topics.
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    decomposition: TransformerMixin, default None
        Custom decomposition method to use.
        Can be an instance of FastICA or PCA, or basically any dimensionality
        reduction method. Has to have `fit_transform` and `fit` methods.
        If not specified, FastICA is used.
    max_iter: int, default 200
        Maximum number of iterations for ICA.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.
    """

    def __init__(
        self,
        n_components: int = 10,
        encoder: Union[
            Encoder, str
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        decomposition: Optional[TransformerMixin] = None,
        max_iter: int = 200,
        random_state: Optional[int] = None,
    ):
        self.n_components = n_components
        self.encoder = encoder
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        if vectorizer is None:
            self.vectorizer = default_vectorizer()
        else:
            self.vectorizer = vectorizer
        self.max_iter = max_iter
        self.random_state = random_state
        if decomposition is None:
            self.decomposition = FastICA(
                n_components, max_iter=max_iter, random_state=random_state
            )
        else:
            self.decomposition = decomposition

    def fit_transform(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        console = Console()
        with console.status("Fitting model") as status:
            if embeddings is None:
                status.update("Encoding documents")
                embeddings = self.encoder_.encode(raw_documents)
                console.log("Documents encoded.")
            status.update("Decomposing embeddings")
            doc_topic = self.decomposition.fit_transform(embeddings)
            console.log("Decomposition done.")
            status.update("Extracting terms.")
            vocab = self.vectorizer.fit(raw_documents).get_feature_names_out()
            console.log("Term extraction done.")
            status.update("Encoding vocabulary")
            vocab_embeddings = self.encoder_.encode(vocab)
            console.log("Vocabulary encoded.")
            status.update("Estimating term importances")
            vocab_topic = self.decomposition.transform(vocab_embeddings)
            self.components_ = vocab_topic.T
            console.log("Model fitting done.")
        return doc_topic

    def transform(
        self, raw_documents, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        """Infers topic importances for new documents based on a fitted model.

        Parameters
        ----------
        raw_documents: iterable of str
            Documents to fit the model on.
        embeddings: ndarray of shape (n_documents, n_dimensions), optional
            Precomputed document encodings.

        Returns
        -------
        ndarray of shape (n_dimensions, n_topics)
            Document-topic matrix.
        """
        if embeddings is None:
            embeddings = self.encoder_.encode(raw_documents)
        return self.decomposition.transform(embeddings)

    def print_topics(
        self,
        top_k: int = 5,
        show_scores: bool = False,
        show_negative: bool = True,
    ):
        super().print_topics(top_k, show_scores, show_negative)

    def export_topics(
        self,
        top_k: int = 5,
        show_scores: bool = False,
        show_negative: bool = True,
        format: str = "csv",
    ) -> str:
        return super().export_topics(top_k, show_scores, show_negative, format)

    def print_representative_documents(
        self,
        topic_id,
        raw_documents,
        document_topic_matrix=None,
        top_k=5,
        show_negative: bool = True,
    ):
        super().print_representative_documents(
            topic_id,
            raw_documents,
            document_topic_matrix,
            top_k,
            show_negative,
        )

    def export_representative_documents(
        self,
        topic_id,
        raw_documents,
        document_topic_matrix=None,
        top_k=5,
        show_negative: bool = True,
        format: str = "csv",
    ):
        return super().export_representative_documents(
            topic_id,
            raw_documents,
            document_topic_matrix,
            top_k,
            show_negative,
            format,
        )

transform(raw_documents, embeddings=None)

Infers topic importances for new documents based on a fitted model.

Parameters:

Name Type Description Default
raw_documents

Documents to fit the model on.

required
embeddings Optional[ndarray]

Precomputed document encodings.

None

Returns:

Type Description
ndarray of shape (n_dimensions, n_topics)

Document-topic matrix.

Source code in turftopic/models/decomp.py
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
def transform(
    self, raw_documents, embeddings: Optional[np.ndarray] = None
) -> np.ndarray:
    """Infers topic importances for new documents based on a fitted model.

    Parameters
    ----------
    raw_documents: iterable of str
        Documents to fit the model on.
    embeddings: ndarray of shape (n_documents, n_dimensions), optional
        Precomputed document encodings.

    Returns
    -------
    ndarray of shape (n_dimensions, n_topics)
        Document-topic matrix.
    """
    if embeddings is None:
        embeddings = self.encoder_.encode(raw_documents)
    return self.decomposition.transform(embeddings)