Semantic Signal Separation (S³)
Semantic Signal Separation tries to recover dimensions/axes along which most of the semantic variations can be explained. A topic in S³ is a dimension of semantics, or a "semantic signal". This makes the model able to recover more nuanced topical content in documents, but is not optimal when you expect topics to be groupings of documents.
The Model
1. Semantic Signal Decomposition
S³ finds semantic signals in the embedding matrix by decomposing it either with Independent Component Analysis(default) or with Principal Component Analysis. The difference between these two is that PCA finds maximally uncorrelated(orthogonal) components, while ICA recovers maximally independent signals.
To use one or the other, set the objective
parameter of the model:
from turftopic import SemanticSignalSeparation
# Uses ICA
model = SemanticSignalSeparation(10, objective="independence")
# Uses PCA
model = SemanticSignalSeparation(10, objective="orthogonality")
My anecdotal experience indicates that ICA generally gives better results, but feel free to experiment with the two options.
Turftopic uses the FastICA and PCA implementations from scikit-learn in the background.
2. Term Importance Estimation: Recovering Signal Strength for the Vocabulary
To estimate the importance of terms for each component, S³ embeds all terms with the same encoder as the documents, and decomposes the vocabulary embeddings with the fitted components. The decomposed signals' matrix is then transposed to get a topic-term matrix.
Comparison to Classical Models
S³ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis. The conceptualization is very similar these models, but instead of recovering factors of word use, S³ recovers dimensions in a continuous semantic space.
Most of the intuitions you have about LSA will also apply with S³, but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.
S³ is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.
Interpretation
S³ is one of the trickier models to interpret due to the way it conceptualizes topics. Unlike many other models, the fact that a word ranks very low for a topic is also useful information for interpretation's sake. In other words, both ends of term importance are important for S³, words that rank highest, and words that rank lowest.
To investigate these relations, we recommend that you use Word Maps from topicwizard. Word maps allow you to display the distribution of all words in the vocabulary on two given topic axes.
pip install topic-wizard
from turftopic import SemanticSignalSeparation
from topicwizard import figures
model = SemanticSignalSeparation(10)
topic_data = model.prepare_topic_data(chatgpt_tweets)
figures.word_map(
topic_data,
topic_axes=(
"9_api_apis_register_automatedsarcasmgenerator",
"4_study_studying_assessments_exams"
)
)
Considerations
Strengths
- Nuanced Content: Documents are assumed to contain multiple topics and the model can therefore work on corpora where texts are longer and might not group in semantic space based on topic.
- Efficiency: FastICA is called fast for a reason. S³ is one of the most computationally efficient models in Turftopic.
- Novel Descriptions: S³ tends to discover topics that no other models do. This is due to its interpretation of what a topic is.
- High Quality: Topic descriptions tend to be high quality and easily interpretable.
Weaknesses
- Noise Components: The model tends to find components in corpora that only contain noise. This is typical in other applications of ICA as well, and it is frequently used for noise removal in other disciplines. We are working on automated solutions to detect and flag these components.
- Sometimes Unintuitive: Neural embedding models might have a different mapping of the semantic space than humans. Sometimes S³ uncovers unintuitive dimensions of meaning as a result of this.
- Moderate Scalability: The model cannot be fitted in an online fashion. It is reasonably scalable, but for very large corpora you might want to consider using a different model.
API Reference
turftopic.models.decomp.SemanticSignalSeparation
Bases: ContextualModel
Separates the embedding matrix into 'semantic signals' with component analysis methods. Topics are assumed to be dimensions of semantics.
from turftopic import SemanticSignalSeparation
corpus: list[str] = ["some text", "more text", ...]
model = SemanticSignalSeparation(10).fit(corpus)
model.print_topics()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_components |
int
|
Number of topics. |
10
|
encoder |
Union[Encoder, str]
|
Model to encode documents/terms, all-MiniLM-L6-v2 is the default. |
'sentence-transformers/all-MiniLM-L6-v2'
|
vectorizer |
Optional[CountVectorizer]
|
Vectorizer used for term extraction. Can be used to prune or filter the vocabulary. |
None
|
decomposition |
Optional[TransformerMixin]
|
Custom decomposition method to use.
Can be an instance of FastICA or PCA, or basically any dimensionality
reduction method. Has to have |
None
|
max_iter |
int
|
Maximum number of iterations for ICA. |
200
|
random_state |
Optional[int]
|
Random state to use so that results are exactly reproducible. |
None
|
Source code in turftopic/models/decomp.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
|
transform(raw_documents, embeddings=None)
Infers topic importances for new documents based on a fitted model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_documents |
Documents to fit the model on. |
required | |
embeddings |
Optional[ndarray]
|
Precomputed document encodings. |
None
|
Returns:
Type | Description |
---|---|
ndarray of shape (n_dimensions, n_topics)
|
Document-topic matrix. |
Source code in turftopic/models/decomp.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
|