Semantic Signal Separation (\(S^3\))

Semantic Signal Separation tries to recover dimensions/axes along which most of the semantic variations can be explained. A topic in \(S^3\) is an axis of semantics in the corpus. This makes the model able to recover more nuanced topical content in documents, but is not optimal when you expect topics to be groupings of documents.

\(S^3\) is one of the fastest topic models out there, even rivalling vanilla NMF, when not accounting for embedding time. It also typically produces very high quality topics, and our evaluations indicate that it performs significantly better when no preprocessing is applied to texts.

How does \(S^3\) work?

Step 1: Document-embedding Decomposition

The first step is to decompose the embedding matrix using ICA, this step discovers the underlying semantics axes as latent independent components in the embeddings.

See formula

Let the encodings of documents in the corpus be \(X\).
Decompose \(X\) using FastICA: \(X = AS\), where \(A\) is the mixing matrix and \(S\) is the document-topic-matrix.

Step 2: Term Importance Estimation

Term importances for each topic are calculated by encoding the entire vocabulary of the corpus using the same embedding model, then recovering the strength of each latent component in the word embedding matrix. The strength of the components in the words will be interpreted as the words' importance in a given topic.

Visual representation of term importance approaches in S³

See formula

Let the matrix of word encodings be \(V\).
Calculate the pseudo-inverse of the mixing matrix \(C = A^{+}\), where \(C\) is the unmixing matrix.
Project word embeddings onto the semantic axes by multiplying them with unmixing matrix: \(W = VC^T\). \(W^T\) is then the topic-term matrix (model.components_).

There are three distinct methods to calculate term importances from word projections:

Choose a word importance method

AxialAngularCombined (default)

from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(n_components=10, feature_importance="axial")

Axial word importances are defined as the words' positions on the semantic axes. This approach selects highly relevant words for topic descriptions, but topic descriptions might share words if a word scores high on multiple axes.

The importance of word \(j\) for topic \(t\) is: \(\beta_{tj} = W_{jt}\)

from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(n_components=10, feature_importance="angular")

Angular topics can be calculated by taking the cosine of the angle between projected word vectors and semantic axes. This allows the approach axis descriptions to be very distinct and specific to the given axis, but might include words that are not as relevant in the corpus.

\(\beta_{tj} = cos(\Theta) = \frac{W_{jt}}{||W_j||}\)

from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(n_components=10, feature_importance="combined")

Combined word importance is a combination of axial and andular term importance, and is recommended as it balances the two approaches' strengths and weaknesses.

\(\beta_{tj} = \frac{(W_{jt})^3}{||W_j||}\)

Dynamic Topic Modeling

\(S^3\) can also be used as a dynamic topic model. Temporally changing components are found using the following steps:

Fit a global \(S^3\) model over the whole corpus.
Estimate unmixing matrix for each time-slice by fitting a linear regression from the embeddings in the time slice to the document-topic-matrix for the time slice estimated by the global model.
Estimate term importances for each time slice the same way as the global model.

from datetime import datetime
from turftopic import SemanticSignalSeparation

ts: list[datetime] = [datetime(year=2018, month=2, day=12), ...]
corpus: list[str] = ["First document", ...]

model = SemanticSignalSeparation(10).fit_dynamic(corpus, timestamps=ts, bins=10)
model.plot_topics_over_time()

Info

Topics over time in \(S^3\) are treated slightly differently to most other models. This is because topics are not proportional in \(S^3\), and can tip below zero. In the timeslices where a topic is below zero, its negative definition is displayed.

Topics over time in a dynamic Semantic Signal Separation model.

Model Refitting

Unlike most other models in Turftopic, \(S^3\) can be refit using different parameters and random seeds without needing to initialize the model from scratch. This makes \(S^3\) incredibly convenient for exploring different numbers of topics, or adjusting the number of iterations.

Refitting the model takes a fraction of the time of initializing a new one and fitting it, as the vocabulary doesn't have to be learned or encoded by the model again.

from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(5, random_state=42)
model.fit(corpus)

print(len(model.topic_names))
# 5

model.refit(n_components=10, random_state=30)
print(len(model.topic_names))
# 10

Interpretation

Negative terms

Terms, which rank lowest on a topic have meaning in \(S^3\). Whenever interpreting semantic axes, you should probably consider both ends of the axis. As such, when you print or export topics from \(S^3\), the lowest ranking terms will also be shown along with the highest ranking ones.

Here's an example on ArXiv ML papers:

from turftopic import SemanticSignalSeparation
from sklearn.feature_extraction.text import CountVectorizer

model = SemanticSignalSeparation(5, vectorizer=CountVectorizer(), random_state=42)
model.fit(corpus)

model.print_topics(top_k=5)

	Positive	Negative
0	clustering, histograms, clusterings, histogram, classifying	reinforcement, exploration, planning, tactics, reinforce
1	textual, pagerank, litigants, marginalizing, entailment	matlab, waveforms, microcontroller, accelerometers, microcontrollers
2	sparsestmax, denoiseing, denoising, minimizers, minimizes	automation, affective, chatbots, questionnaire, attitudes
3	rebmigraph, subgraph, subgraphs, graphsage, graph	adversarial, adversarially, adversarialization, adversary, security
4	clustering, estimations, algorithm, dbscan, estimation	cnn, deepmind, deeplabv3, convnet, deepseenet

Concept Compass

If you want to gain a deeper understanding of terms' relation to axes, you can produce a concept compass. This involves plotting terms in a corpus along two semantic axes.

In order to use the compass in Turftopic you will need to have plotly installed:

pip install plotly

You can display a compass based on a fitted model like so:

fig = model.plot_concept_compass(topic_x=1, topic_y=4)
fig.show()

Concept Compass of ArXiv ML Papers along two semantic axes.

Image Compass

In multimodal contexts, you can also plot images along two chosen axes by using plot_image_compass().

model = SemanticSignalSeparation(10)
model.fit_multimodal(corpus, images=images)

fig = model.plot_image_compass(topic_x=0, topic_y=1)
fig.show()

Image Compass of IKEA furnitures along two semantic axes

API Reference

`turftopic.models.decomp.SemanticSignalSeparation`

Bases: ContextualModel, DynamicTopicModel, MultimodalModel

Separates the embedding matrix into 'semantic signals' with component analysis methods. Topics are assumed to be dimensions of semantics.

from turftopic import SemanticSignalSeparation

corpus: list[str] = ["some text", "more text", ...]

model = SemanticSignalSeparation(10).fit(corpus)
model.print_topics()

Parameters:

Name	Type	Description	Default
`n_components`	`int`	Number of topics.	`10`
`encoder`	`Union[Encoder, str, MultimodalEncoder]`	Model to encode documents/terms, all-MiniLM-L6-v2 is the default.	`'sentence-transformers/all-MiniLM-L6-v2'`
`vectorizer`	`Optional[CountVectorizer]`	Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.	`None`
`decomposition`	`Optional[TransformerMixin]`	Custom decomposition method to use. Can be an instance of FastICA or PCA, or basically any dimensionality reduction method. Has to have `fit_transform` and `fit` methods. If not specified, FastICA is used.	`None`
`max_iter`	`int`	Maximum number of iterations for ICA.	`200`
`feature_importance`	`Literal['axial', 'angular', 'combined']`	Defines whether the word's position on an axis ('axial'), it's angle to the axis ('angular') or their combination ('combined') should determine the word's importance for a topic.	`'combined'`
`random_state`	`Optional[int]`	Random state to use so that results are exactly reproducible.	`None`

Source code in turftopic/models/decomp.py

class SemanticSignalSeparation(
    ContextualModel, DynamicTopicModel, MultimodalModel
):
    """Separates the embedding matrix into 'semantic signals' with
    component analysis methods.
    Topics are assumed to be dimensions of semantics.

    ```python
    from turftopic import SemanticSignalSeparation

    corpus: list[str] = ["some text", "more text", ...]

    model = SemanticSignalSeparation(10).fit(corpus)
    model.print_topics()
    ```

    Parameters
    ----------
    n_components: int, default 10
        Number of topics.
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    decomposition: TransformerMixin, default None
        Custom decomposition method to use.
        Can be an instance of FastICA or PCA, or basically any dimensionality
        reduction method. Has to have `fit_transform` and `fit` methods.
        If not specified, FastICA is used.
    max_iter: int, default 200
        Maximum number of iterations for ICA.
    feature_importance: "axial", "angular" or "combined", default "combined"
        Defines whether the word's position on an axis ('axial'), it's angle to the axis ('angular')
        or their combination ('combined') should determine the word's importance for a topic.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.
    """

    def __init__(
        self,
        n_components: int = 10,
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        decomposition: Optional[TransformerMixin] = None,
        max_iter: int = 200,
        feature_importance: Literal[
            "axial", "angular", "combined"
        ] = "combined",
        random_state: Optional[int] = None,
    ):
        self.n_components = n_components
        self.encoder = encoder
        self.feature_importance = feature_importance
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        self.validate_encoder()
        if vectorizer is None:
            self.vectorizer = default_vectorizer()
        else:
            self.vectorizer = vectorizer
        self.max_iter = max_iter
        self.random_state = random_state
        if decomposition is None:
            self.decomposition = FastICA(
                n_components, max_iter=max_iter, random_state=random_state
            )
        else:
            self.decomposition = decomposition

    def estimate_components(
        self, feature_importance: Literal["axial", "angular", "combined"]
    ) -> np.ndarray:
        """Reestimates components based on the chosen feature_importance method."""
        if feature_importance == "axial":
            self.components_ = self.axial_components_
        elif feature_importance == "angular":
            self.components_ = self.angular_components_
        elif feature_importance == "combined":
            self.components_ = (
                np.square(self.axial_components_) * self.angular_components_
            )
        if hasattr(self, "axial_temporal_components_"):
            if feature_importance == "axial":
                self.temporal_components_ = self.axial_temporal_components_
            elif feature_importance == "angular":
                self.temporal_components_ = self.angular_temporal_components_
            elif feature_importance == "combined":
                self.temporal_components_ = (
                    np.square(self.axial_temporal_components_)
                    * self.angular_temporal_components_
                )
        return self.components_

    @property
    def has_negative_side(self) -> bool:
        return False

    def fit_transform(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        console = Console()
        self.embeddings = embeddings
        with console.status("Fitting model") as status:
            if self.embeddings is None:
                status.update("Encoding documents")
                self.embeddings = self.encoder_.encode(raw_documents)
                console.log("Documents encoded.")
            status.update("Decomposing embeddings")
            if isinstance(self.decomposition, FastICA) and (y is not None):
                warnings.warn(
                    "y is specified but decomposition method is FastICA, which can't use labels. y will be ignored. Use a metric learning method for semi-supervised S^3."
                )
            doc_topic = self.decomposition.fit_transform(self.embeddings, y=y)
            console.log("Decomposition done.")
            status.update("Extracting terms.")
            vocab = self.vectorizer.fit(raw_documents).get_feature_names_out()
            console.log("Term extraction done.")
            status.update("Encoding vocabulary")
            self.vocab_embeddings = self.encoder_.encode(vocab)
            if self.vocab_embeddings.shape[1] != self.embeddings.shape[1]:
                raise ValueError(
                    NOT_MATCHING_ERROR.format(
                        n_dims=self.embeddings.shape[1],
                        n_word_dims=self.vocab_embeddings.shape[1],
                    )
                )
            console.log("Vocabulary encoded.")
            status.update("Estimating term importances")
            vocab_topic = self.decomposition.transform(self.vocab_embeddings)
            self.axial_components_ = vocab_topic.T
            if self.feature_importance == "axial":
                self.components_ = self.axial_components_
            elif self.feature_importance == "angular":
                self.components_ = self.angular_components_
            elif self.feature_importance == "combined":
                self.components_ = (
                    np.square(self.axial_components_)
                    * self.angular_components_
                )
            console.log("Model fitting done.")
        return doc_topic

    def fit_transform_multimodal(
        self,
        raw_documents: list[str],
        images: list[ImageRepr],
        y=None,
        embeddings: Optional[MultimodalEmbeddings] = None,
    ) -> np.ndarray:
        self.validate_embeddings(embeddings)
        console = Console()
        self.images = images
        self.multimodal_embeddings = embeddings
        with console.status("Fitting model") as status:
            if self.multimodal_embeddings is None:
                status.update("Encoding documents")
                self.multimodal_embeddings = self.encode_multimodal(
                    raw_documents, images
                )
                console.log("Documents encoded.")
            self.embeddings = self.multimodal_embeddings["document_embeddings"]
            status.update("Decomposing embeddings")
            if isinstance(self.decomposition, FastICA) and (y is not None):
                warnings.warn(
                    "Supervisory signal is specified but decomposition method is FastICA. y will be ignored. Use a metric learning method for supervised S^3."
                )
            doc_topic = self.decomposition.fit_transform(self.embeddings, y=y)
            console.log("Decomposition done.")
            status.update("Extracting terms.")
            vocab = self.vectorizer.fit(raw_documents).get_feature_names_out()
            console.log("Term extraction done.")
            status.update("Encoding vocabulary")
            self.vocab_embeddings = self.encode_documents(vocab)
            if self.vocab_embeddings.shape[1] != self.embeddings.shape[1]:
                raise ValueError(
                    NOT_MATCHING_ERROR.format(
                        n_dims=self.embeddings.shape[1],
                        n_word_dims=self.vocab_embeddings.shape[1],
                    )
                )
            console.log("Vocabulary encoded.")
            status.update("Estimating term importances")
            vocab_topic = self.decomposition.transform(self.vocab_embeddings)
            self.axial_components_ = vocab_topic.T
            if self.feature_importance == "axial":
                self.components_ = self.axial_components_
            elif self.feature_importance == "angular":
                self.components_ = self.angular_components_
            elif self.feature_importance == "combined":
                self.components_ = (
                    np.square(self.axial_components_)
                    * self.angular_components_
                )
            console.log("Model fitting done.")
            status.update("Transforming images")
            self.image_topic_matrix = self.transform(
                [], embeddings=self.multimodal_embeddings["image_embeddings"]
            )
            self.top_images = self.collect_top_images(
                images, self.image_topic_matrix
            )
            self.negative_images = self.collect_top_images(
                images, self.image_topic_matrix, negative=True
            )
            console.log("Images transformed")
        return doc_topic

    def plot_topics_with_images(self, n_columns: int = 3, grid_size: int = 4):
        try:
            import plotly.graph_objects as go
        except (ImportError, ModuleNotFoundError) as e:
            raise ModuleNotFoundError(
                "Please install plotly if you intend to use plots in Turftopic."
            ) from e
        fig = go.Figure()
        width, height = 1200, 1200
        scale_factor = 0.25
        w, h = width * scale_factor, height * scale_factor
        padding = 10
        figure_height = (h + padding) * self.n_components
        figure_width = (w + padding) * 2
        fig = fig.add_trace(
            go.Scatter(
                x=[0, figure_width],
                y=[0, figure_height],
                mode="markers",
                marker_opacity=0,
            )
        )
        vocab = self.get_vocab()
        for i, component in enumerate(self.components_):
            positive = vocab[np.argsort(-component)[:7]]
            negative = vocab[np.argsort(component)[:7]]
            pos_image = self._image_grid(
                self.top_images[i],
                (width, height),
                grid_size=(grid_size, grid_size),
            )
            neg_image = self._image_grid(
                self.negative_images[i],
                (width, height),
                grid_size=(grid_size, grid_size),
            )
            x0 = 0
            y0 = (h + padding) * (self.n_components - i)
            fig = fig.add_layout_image(
                dict(
                    x=x0,
                    sizex=w,
                    y=y0,
                    sizey=h,
                    xref="x",
                    yref="y",
                    opacity=1.0,
                    layer="below",
                    sizing="stretch",
                    source=pos_image,
                ),
            )
            fig.add_annotation(
                x=(w / 2),
                y=(h + padding) * (self.n_components - i) - (h / 2),
                text="<b> " + "<br> ".join(positive),
                font=dict(
                    size=16,
                    family="Times New Roman",
                    color="white",
                ),
                bgcolor="rgba(0,0,255, 0.5)",
            )
            x0 = (w + padding) * 1
            fig = fig.add_layout_image(
                dict(
                    x=x0,
                    sizex=w,
                    y=y0,
                    sizey=h,
                    xref="x",
                    yref="y",
                    opacity=1.0,
                    layer="below",
                    sizing="stretch",
                    source=neg_image,
                ),
            )
            fig.add_annotation(
                x=(w + padding) + (w / 2),
                y=(h + padding) * (self.n_components - i) - (h / 2),
                text="<b> " + "<br> ".join(negative),
                font=dict(
                    size=16,
                    family="Times New Roman",
                    color="white",
                ),
                bgcolor="rgba(255,0,0, 0.5)",
            )
        fig = fig.update_xaxes(visible=False, range=[0, figure_width])
        fig = fig.update_yaxes(
            visible=False,
            range=[0, figure_height],
            # the scaleanchor attribute ensures that the aspect ratio stays constant
            scaleanchor="x",
        )
        fig = fig.update_layout(
            width=figure_width,
            height=figure_height,
            margin={"l": 0, "r": 0, "t": 0, "b": 0},
        )
        return fig

    def _rename_automatic(self, namer: TopicNamer) -> list[str]:
        """Names topics with a topic namer in the model.

        Parameters
        ----------
        namer: TopicNamer
            A Topic namer model to name topics with.

        Returns
        -------
        list[str]
            List of topic names.
        """
        positive_names = namer.name_topics(self._top_terms())
        negative_names = namer.name_topics(self._top_terms(positive=False))
        names = []
        for positive, negative in zip(positive_names, negative_names):
            names.append(f"{positive}/{negative}")
        self.topic_names_ = names
        return self.topic_names_

    def refit_transform(
        self,
        n_components: Optional[int] = None,
        max_iter: Optional[int] = None,
        random_state: Optional[int] = None,
    ):
        """Refits model with the given parameters.
        This is significantly faster than fitting a new model from scratch.

        Parameters
        ----------
        n_components: int, default None
            Number of topics.
        max_iter: int, default None
            Maximum number of iterations for ICA.
        random_state: int, default None
            Random state to use so that results are exactly reproducible.

        Returns
        -------
        ndarray of shape (n_documents, n_topics)
            Document-topic matrix.
        """
        self.n_components = n_components
        self.topic_names_ = None
        n_components = (
            n_components if n_components is not None else self.n_components
        )
        max_iter = max_iter if max_iter is not None else self.max_iter
        random_state = (
            random_state if random_state is not None else self.random_state
        )
        self.decomposition = FastICA(
            n_components, max_iter=max_iter, random_state=random_state
        )
        console = Console()
        with console.status("Refitting model") as status:
            status.update("Decomposing embeddings")
            doc_topic = self.decomposition.fit_transform(self.embeddings)
            console.log("Decomposition done.")
            status.update("Estimating term importances")
            vocab_topic = self.decomposition.transform(self.vocab_embeddings)
            self.axial_components_ = vocab_topic.T
            if self.feature_importance == "axial":
                self.components_ = self.axial_components_
            elif self.feature_importance == "angular":
                self.components_ = self.angular_components_
            elif self.feature_importance == "combined":
                self.components_ = (
                    np.square(self.axial_components_)
                    * self.angular_components_
                )
            console.log("Model fitting done.")
        return doc_topic

    def fit_transform_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ) -> np.ndarray:
        document_topic_matrix = self.fit_transform(
            raw_documents, embeddings=embeddings
        )
        time_labels, self.time_bin_edges = self.bin_timestamps(
            timestamps, bins
        )
        n_comp, n_vocab = self.components_.shape
        n_bins = len(self.time_bin_edges) - 1
        self.axial_temporal_components_ = np.full(
            (n_bins, n_comp, n_vocab),
            np.nan,
            dtype=self.components_.dtype,
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        whitened_embeddings = np.copy(self.embeddings)
        if getattr(self.decomposition, "whiten"):
            whitened_embeddings -= self.decomposition.mean_
        # doc_topic = np.dot(X, self.components_.T)
        for i_timebin in np.unique(time_labels):
            topic_importances = document_topic_matrix[
                time_labels == i_timebin
            ].mean(axis=0)
            self.temporal_importance_[i_timebin, :] = topic_importances
            t_doc_topic = document_topic_matrix[time_labels == i_timebin]
            t_embeddings = whitened_embeddings[time_labels == i_timebin]
            linreg = LinearRegression().fit(t_embeddings, t_doc_topic)
            self.axial_temporal_components_[i_timebin, :, :] = np.dot(
                self.vocab_embeddings, linreg.coef_.T
            ).T
        self.estimate_components(self.feature_importance)
        return document_topic_matrix

    def refit_transform_dynamic(
        self,
        timestamps: list[datetime],
        bins: Union[int, list[datetime]] = 10,
        n_components: Optional[int] = None,
        max_iter: Optional[int] = None,
        random_state: Optional[int] = None,
    ):
        """Refits $S^3$ to be a dynamic model."""
        document_topic_matrix = self.refit_transform(
            n_components=n_components,
            max_iter=max_iter,
            random_state=random_state,
        )
        time_labels, self.time_bin_edges = self.bin_timestamps(
            timestamps, bins
        )
        n_comp, n_vocab = self.components_.shape
        n_bins = len(self.time_bin_edges) - 1
        self.axial_temporal_components_ = np.full(
            (n_bins, n_comp, n_vocab),
            np.nan,
            dtype=self.components_.dtype,
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        whitened_embeddings = np.copy(self.embeddings)
        if getattr(self.decomposition, "whiten"):
            whitened_embeddings -= self.decomposition.mean_
        # doc_topic = np.dot(X, self.components_.T)
        for i_timebin in np.unique(time_labels):
            topic_importances = document_topic_matrix[
                time_labels == i_timebin
            ].mean(axis=0)
            self.temporal_importance_[i_timebin, :] = topic_importances
            t_doc_topic = document_topic_matrix[time_labels == i_timebin]
            t_embeddings = whitened_embeddings[time_labels == i_timebin]
            linreg = LinearRegression().fit(t_embeddings, t_doc_topic)
            self.axial_temporal_components_[i_timebin, :, :] = np.dot(
                self.vocab_embeddings, linreg.coef_.T
            ).T
        self.estimate_components(self.feature_importance)
        return document_topic_matrix

    def refit(
        self,
        n_components: Optional[int] = None,
        max_iter: Optional[int] = None,
        random_state: Optional[int] = None,
    ):
        """Refits model with the given parameters.
        This is significantly faster than fitting a new model from scratch.

        Parameters
        ----------
        n_components: int, default None
            Number of topics.
        max_iter: int, default None
            Maximum number of iterations for ICA.
        random_state: int, default None
            Random state to use so that results are exactly reproducible.

        Returns
        -------
        Refitted model.
        """
        self.refit_transform(n_components, max_iter, random_state)
        return self

    @property
    def angular_components_(self):
        """Reweights words based on their angle in ICA-space to the axis
        base vectors.
        """
        if not hasattr(self, "axial_components_"):
            raise NotFittedError("Model has not been fitted yet.")
        word_vectors = self.axial_components_.T
        n_topics = self.axial_components_.shape[0]
        axis_vectors = np.eye(n_topics)
        cosine_components = cosine_similarity(axis_vectors, word_vectors)
        return cosine_components

    @property
    def angular_temporal_components_(self):
        """Reweights words based on their angle in ICA-space to the axis
        base vectors in a dynamic model.
        """
        if not hasattr(self, "axial_temporal_components_"):
            raise NotFittedError("Model has not been fitted dynamically.")
        components = []
        for axial_components in self.axial_temporal_components_:
            word_vectors = axial_components.T
            n_topics = axial_components.shape[0]
            axis_vectors = np.eye(n_topics)
            cosine_components = cosine_similarity(axis_vectors, word_vectors)
            components.append(cosine_components)
        return np.stack(components)

    def transform(
        self, raw_documents, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        """Infers topic importances for new documents based on a fitted model.

        Parameters
        ----------
        raw_documents: iterable of str
            Documents to fit the model on.
        embeddings: ndarray of shape (n_documents, n_dimensions), optional
            Precomputed document encodings.

        Returns
        -------
        ndarray of shape (n_dimensions, n_topics)
            Document-topic matrix.
        """
        if embeddings is None:
            embeddings = self.encoder_.encode(raw_documents)
        return self.decomposition.transform(embeddings)

    def print_topics(
        self,
        top_k: int = 5,
        show_scores: bool = False,
        show_negative: bool = True,
    ):
        super().print_topics(top_k, show_scores, show_negative)

    def export_topics(
        self,
        top_k: int = 5,
        show_scores: bool = False,
        show_negative: bool = True,
        format: str = "csv",
    ) -> str:
        return super().export_topics(top_k, show_scores, show_negative, format)

    def print_representative_documents(
        self,
        topic_id,
        raw_documents,
        document_topic_matrix=None,
        top_k=5,
        show_negative: bool = True,
    ):
        super().print_representative_documents(
            topic_id,
            raw_documents,
            document_topic_matrix,
            top_k,
            show_negative,
        )

    def export_representative_documents(
        self,
        topic_id,
        raw_documents,
        document_topic_matrix=None,
        top_k=5,
        show_negative: bool = True,
        format: str = "csv",
    ):
        return super().export_representative_documents(
            topic_id,
            raw_documents,
            document_topic_matrix,
            top_k,
            show_negative,
            format,
        )

    def concept_compass(
        self, topic_x: Union[int, str], topic_y: Union[str, int]
    ):
        """[DEPRECATED] will be removed in version 1.0.0.
        See plot_concept_compass().
        """
        warnings.warn(
            "concept_compass() is deprecated and will be removed in version 1.0.0. Use plot_concept_compass() instead."
        )
        return self.plot_concept_compass(topic_x, topic_y)

    def plot_concept_compass(
        self, topic_x: Union[int, str], topic_y: Union[str, int]
    ):
        """Display a compass of concepts along two semantic axes.
        In order for the plot to be concise and readable, terms are randomly selected on
        a grid of the two topics.

        Parameters
        ----------
        topic_x: int or str
            Index or name of the topic to display on the X axis.
        topic_y: int or str
            Index or name of the topic to display on the Y axis.

        Returns
        -------
        go.Figure
            Plotly interactive plot of the concept compass.
        """
        try:
            import plotly.express as px
        except (ImportError, ModuleNotFoundError) as e:
            raise ModuleNotFoundError(
                "Please install plotly if you intend to use plots in Turftopic."
            ) from e
        if isinstance(topic_x, str):
            try:
                topic_x = list(self.topic_names).index(topic_x)
            except ValueError as e:
                raise ValueError(
                    f"{topic_x} is not a valid topic name or index."
                ) from e
        if isinstance(topic_y, str):
            try:
                topic_y = list(self.topic_names).index(topic_y)
            except ValueError as e:
                raise ValueError(
                    f"{topic_y} is not a valid topic name or index."
                ) from e
        x = self.axial_components_[topic_x]
        y = self.axial_components_[topic_y]
        vocab = self.get_vocab()
        points = np.array(list(zip(x, y)))
        xx, yy = np.meshgrid(
            np.linspace(np.min(x), np.max(x), 20),
            np.linspace(np.min(y), np.max(y), 20),
        )
        coords = np.array(list(zip(np.ravel(xx), np.ravel(yy))))
        coords = coords + np.random.default_rng(0).normal(
            [0, 0], [0.1, 0.1], size=coords.shape
        )
        dist = euclidean_distances(coords, points)
        idxs = np.argmin(dist, axis=1)
        fig = px.scatter(
            x=x[idxs],
            y=y[idxs],
            text=vocab[idxs],
            template="plotly_white",
        )
        fig = fig.update_traces(
            mode="text", textfont_color="black", marker=dict(color="black")
        ).update_layout(
            xaxis_title=f"{self.topic_names[topic_x]}",
            yaxis_title=f"{self.topic_names[topic_y]}",
            font=dict(family="Roboto Mono"),
        )
        fig = fig.update_layout(
            font=dict(family="Roboto Mono", color="black", size=21),
            margin=dict(l=5, r=5, t=5, b=5),
        )
        fig = fig.add_hline(y=0, line_color="black", line_width=4)
        fig = fig.add_vline(x=0, line_color="black", line_width=4)
        return fig

    def plot_image_compass(
        self, topic_x: Union[str, int], topic_y: Union[str, int]
    ):
        try:
            import plotly.express as px
        except (ImportError, ModuleNotFoundError) as e:
            raise ModuleNotFoundError(
                "Please install plotly if you intend to use plots in Turftopic."
            ) from e
        top_images = getattr(self, "top_images", None)
        if top_images is None:
            raise ValueError(
                "Topic model is not multimodal. Can't plot image compass."
            )
        if isinstance(topic_x, str):
            try:
                topic_x = list(self.topic_names).index(topic_x)
            except ValueError as e:
                raise ValueError(
                    f"{topic_x} is not a valid topic name or index."
                ) from e
        if isinstance(topic_y, str):
            try:
                topic_y = list(self.topic_names).index(topic_y)
            except ValueError as e:
                raise ValueError(
                    f"{topic_y} is not a valid topic name or index."
                ) from e
        x = self.image_topic_matrix[:, topic_x]
        y = self.image_topic_matrix[:, topic_y]
        points = np.array(list(zip(x, y)))
        xx, yy = np.meshgrid(
            np.linspace(np.min(x), np.max(x), 8),
            np.linspace(np.min(y), np.max(y), 8),
        )
        coords = np.array(list(zip(np.ravel(xx), np.ravel(yy))))
        dist = euclidean_distances(coords, points)
        idxs = np.argmin(dist, axis=1)
        fig = px.scatter(
            x=x[idxs],
            y=y[idxs],
            template="plotly_white",
        )
        sizex = (max(x) - min(x)) / 10
        sizey = (max(y) - min(y)) / 10
        for i in np.unique(idxs):
            fig.add_layout_image(
                dict(
                    name=f"image{i}",
                    source=self.images[i],
                    x=x[i],
                    y=y[i],
                    xref="x",
                    yref="y",
                    xanchor="right",
                    yanchor="top",
                    layer="above",
                    sizex=sizex,
                    sizey=sizey,
                )
            )
        fig = fig.update_traces(
            mode="markers", textfont_color="black", marker=dict(opacity=0)
        ).update_layout(
            xaxis_title=f"{self.topic_names[topic_x]}",
            yaxis_title=f"{self.topic_names[topic_y]}",
            font=dict(family="Roboto Mono"),
        )
        fig = fig.update_layout(
            font=dict(family="Roboto Mono", color="black", size=21),
            margin=dict(l=5, r=5, t=5, b=5),
        )
        fig = fig.update_xaxes(range=[min(x), max(x)])
        fig = fig.update_yaxes(range=[min(y), max(y)])
        return fig

    def plot_topics_over_time(self, top_k: int = 6):
        try:
            import plotly.graph_objects as go
        except (ImportError, ModuleNotFoundError) as e:
            raise ModuleNotFoundError(
                "Please install plotly if you intend to use plots in Turftopic."
            ) from e
        fig = go.Figure()
        vocab = self.get_vocab()
        n_topics = self.temporal_components_.shape[1]
        try:
            topic_names = self.topic_names
        except AttributeError:
            topic_names = [f"Topic {i}" for i in range(n_topics)]
        for i_topic, topic_imp_t in enumerate(self.temporal_importance_.T):
            component_over_time = self.temporal_components_[:, i_topic, :]
            name_over_time = []
            for component, importance in zip(component_over_time, topic_imp_t):
                if importance < 0:
                    component = -component
                top = np.argpartition(-component, top_k)[:top_k]
                values = component[top]
                if np.all(values == 0) or np.all(np.isnan(values)):
                    name_over_time.append("<not present>")
                    continue
                top = top[np.argsort(-values)]
                name_over_time.append(", ".join(vocab[top]))
            times = self.time_bin_edges[:-1]
            fig.add_trace(
                go.Scatter(
                    x=times,
                    y=topic_imp_t,
                    mode="markers+lines",
                    text=name_over_time,
                    name=topic_names[i_topic],
                    hovertemplate="<b>%{text}</b>",
                    marker=dict(
                        line=dict(width=2, color="black"),
                        size=14,
                    ),
                    line=dict(width=3),
                )
            )
        fig.add_hline(y=0, line_dash="dash", opacity=0.5)
        fig.update_layout(
            template="plotly_white",
            hoverlabel=dict(font_size=16, bgcolor="white"),
            hovermode="x",
            font=dict(family="Roboto Mono"),
        )
        fig.update_xaxes(title="Time Slice Start")
        fig.update_yaxes(title="Topic Importance")
        return fig

    def _topics_over_time(
        self,
        top_k: int = 5,
        show_scores: bool = False,
        date_format: str = "%Y %m %d",
    ) -> list[list[str]]:
        temporal_components = self.temporal_components_
        slices = self.get_time_slices()
        slice_names = []
        for start_dt, end_dt in slices:
            start_str = start_dt.strftime(date_format)
            end_str = end_dt.strftime(date_format)
            slice_names.append(f"{start_str} - {end_str}")
        n_topics = self.temporal_components_.shape[1]
        try:
            topic_names = self.topic_names
        except AttributeError:
            topic_names = [f"Topic {i}" for i in range(n_topics)]
        columns = []
        rows = []
        columns.append("Time Slice")
        for topic in topic_names:
            columns.append(topic)
        for slice_name, components, weights in zip(
            slice_names, temporal_components, self.temporal_importance_
        ):
            fields = []
            fields.append(slice_name)
            vocab = self.get_vocab()
            for component, weight in zip(components, weights):
                if np.all(component == 0) or np.all(np.isnan(component)):
                    fields.append("Topic not present.")
                    continue
                if weight < 0:
                    component = -component
                top = np.argpartition(-component, top_k)[:top_k]
                importance = component[top]
                top = top[np.argsort(-importance)]
                top = top[importance != 0]
                scores = component[top]
                words = vocab[top]
                if show_scores:
                    concat_words = ", ".join(
                        [
                            f"{word}({importance:.2f})"
                            for word, importance in zip(words, scores)
                        ]
                    )
                else:
                    concat_words = ", ".join([word for word in words])
                fields.append(concat_words)
            rows.append(fields)
        return [columns, *rows]

`angular_components_` `property`

Reweights words based on their angle in ICA-space to the axis base vectors.

`angular_temporal_components_` `property`

Reweights words based on their angle in ICA-space to the axis base vectors in a dynamic model.

`concept_compass(topic_x, topic_y)`

[DEPRECATED] will be removed in version 1.0.0. See plot_concept_compass().

Source code in turftopic/models/decomp.py

def concept_compass(
    self, topic_x: Union[int, str], topic_y: Union[str, int]
):
    """[DEPRECATED] will be removed in version 1.0.0.
    See plot_concept_compass().
    """
    warnings.warn(
        "concept_compass() is deprecated and will be removed in version 1.0.0. Use plot_concept_compass() instead."
    )
    return self.plot_concept_compass(topic_x, topic_y)

`estimate_components(feature_importance)`

Reestimates components based on the chosen feature_importance method.

Source code in turftopic/models/decomp.py

def estimate_components(
    self, feature_importance: Literal["axial", "angular", "combined"]
) -> np.ndarray:
    """Reestimates components based on the chosen feature_importance method."""
    if feature_importance == "axial":
        self.components_ = self.axial_components_
    elif feature_importance == "angular":
        self.components_ = self.angular_components_
    elif feature_importance == "combined":
        self.components_ = (
            np.square(self.axial_components_) * self.angular_components_
        )
    if hasattr(self, "axial_temporal_components_"):
        if feature_importance == "axial":
            self.temporal_components_ = self.axial_temporal_components_
        elif feature_importance == "angular":
            self.temporal_components_ = self.angular_temporal_components_
        elif feature_importance == "combined":
            self.temporal_components_ = (
                np.square(self.axial_temporal_components_)
                * self.angular_temporal_components_
            )
    return self.components_

`plot_concept_compass(topic_x, topic_y)`

Display a compass of concepts along two semantic axes. In order for the plot to be concise and readable, terms are randomly selected on a grid of the two topics.

Parameters:

Name	Type	Description	Default
`topic_x`	`Union[int, str]`	Index or name of the topic to display on the X axis.	required
`topic_y`	`Union[str, int]`	Index or name of the topic to display on the Y axis.	required

Returns:

Type	Description
`Figure`	Plotly interactive plot of the concept compass.

Source code in turftopic/models/decomp.py

def plot_concept_compass(
    self, topic_x: Union[int, str], topic_y: Union[str, int]
):
    """Display a compass of concepts along two semantic axes.
    In order for the plot to be concise and readable, terms are randomly selected on
    a grid of the two topics.

    Parameters
    ----------
    topic_x: int or str
        Index or name of the topic to display on the X axis.
    topic_y: int or str
        Index or name of the topic to display on the Y axis.

    Returns
    -------
    go.Figure
        Plotly interactive plot of the concept compass.
    """
    try:
        import plotly.express as px
    except (ImportError, ModuleNotFoundError) as e:
        raise ModuleNotFoundError(
            "Please install plotly if you intend to use plots in Turftopic."
        ) from e
    if isinstance(topic_x, str):
        try:
            topic_x = list(self.topic_names).index(topic_x)
        except ValueError as e:
            raise ValueError(
                f"{topic_x} is not a valid topic name or index."
            ) from e
    if isinstance(topic_y, str):
        try:
            topic_y = list(self.topic_names).index(topic_y)
        except ValueError as e:
            raise ValueError(
                f"{topic_y} is not a valid topic name or index."
            ) from e
    x = self.axial_components_[topic_x]
    y = self.axial_components_[topic_y]
    vocab = self.get_vocab()
    points = np.array(list(zip(x, y)))
    xx, yy = np.meshgrid(
        np.linspace(np.min(x), np.max(x), 20),
        np.linspace(np.min(y), np.max(y), 20),
    )
    coords = np.array(list(zip(np.ravel(xx), np.ravel(yy))))
    coords = coords + np.random.default_rng(0).normal(
        [0, 0], [0.1, 0.1], size=coords.shape
    )
    dist = euclidean_distances(coords, points)
    idxs = np.argmin(dist, axis=1)
    fig = px.scatter(
        x=x[idxs],
        y=y[idxs],
        text=vocab[idxs],
        template="plotly_white",
    )
    fig = fig.update_traces(
        mode="text", textfont_color="black", marker=dict(color="black")
    ).update_layout(
        xaxis_title=f"{self.topic_names[topic_x]}",
        yaxis_title=f"{self.topic_names[topic_y]}",
        font=dict(family="Roboto Mono"),
    )
    fig = fig.update_layout(
        font=dict(family="Roboto Mono", color="black", size=21),
        margin=dict(l=5, r=5, t=5, b=5),
    )
    fig = fig.add_hline(y=0, line_color="black", line_width=4)
    fig = fig.add_vline(x=0, line_color="black", line_width=4)
    return fig

`refit(n_components=None, max_iter=None, random_state=None)`

Refits model with the given parameters. This is significantly faster than fitting a new model from scratch.

Parameters:

Name	Type	Description	Default
`n_components`	`Optional[int]`	Number of topics.	`None`
`max_iter`	`Optional[int]`	Maximum number of iterations for ICA.	`None`
`random_state`	`Optional[int]`	Random state to use so that results are exactly reproducible.	`None`

Returns:

Type	Description
`Refitted model.`

Source code in turftopic/models/decomp.py

def refit(
    self,
    n_components: Optional[int] = None,
    max_iter: Optional[int] = None,
    random_state: Optional[int] = None,
):
    """Refits model with the given parameters.
    This is significantly faster than fitting a new model from scratch.

    Parameters
    ----------
    n_components: int, default None
        Number of topics.
    max_iter: int, default None
        Maximum number of iterations for ICA.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.

    Returns
    -------
    Refitted model.
    """
    self.refit_transform(n_components, max_iter, random_state)
    return self

`refit_transform(n_components=None, max_iter=None, random_state=None)`

Refits model with the given parameters. This is significantly faster than fitting a new model from scratch.

Parameters:

Name	Type	Description	Default
`n_components`	`Optional[int]`	Number of topics.	`None`
`max_iter`	`Optional[int]`	Maximum number of iterations for ICA.	`None`
`random_state`	`Optional[int]`	Random state to use so that results are exactly reproducible.	`None`

Returns:

Type	Description
`ndarray of shape (n_documents, n_topics)`	Document-topic matrix.

Source code in turftopic/models/decomp.py

def refit_transform(
    self,
    n_components: Optional[int] = None,
    max_iter: Optional[int] = None,
    random_state: Optional[int] = None,
):
    """Refits model with the given parameters.
    This is significantly faster than fitting a new model from scratch.

    Parameters
    ----------
    n_components: int, default None
        Number of topics.
    max_iter: int, default None
        Maximum number of iterations for ICA.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.

    Returns
    -------
    ndarray of shape (n_documents, n_topics)
        Document-topic matrix.
    """
    self.n_components = n_components
    self.topic_names_ = None
    n_components = (
        n_components if n_components is not None else self.n_components
    )
    max_iter = max_iter if max_iter is not None else self.max_iter
    random_state = (
        random_state if random_state is not None else self.random_state
    )
    self.decomposition = FastICA(
        n_components, max_iter=max_iter, random_state=random_state
    )
    console = Console()
    with console.status("Refitting model") as status:
        status.update("Decomposing embeddings")
        doc_topic = self.decomposition.fit_transform(self.embeddings)
        console.log("Decomposition done.")
        status.update("Estimating term importances")
        vocab_topic = self.decomposition.transform(self.vocab_embeddings)
        self.axial_components_ = vocab_topic.T
        if self.feature_importance == "axial":
            self.components_ = self.axial_components_
        elif self.feature_importance == "angular":
            self.components_ = self.angular_components_
        elif self.feature_importance == "combined":
            self.components_ = (
                np.square(self.axial_components_)
                * self.angular_components_
            )
        console.log("Model fitting done.")
    return doc_topic

`refit_transform_dynamic(timestamps, bins=10, n_components=None, max_iter=None, random_state=None)`

Refits \(S^3\) to be a dynamic model.

Source code in turftopic/models/decomp.py

def refit_transform_dynamic(
    self,
    timestamps: list[datetime],
    bins: Union[int, list[datetime]] = 10,
    n_components: Optional[int] = None,
    max_iter: Optional[int] = None,
    random_state: Optional[int] = None,
):
    """Refits $S^3$ to be a dynamic model."""
    document_topic_matrix = self.refit_transform(
        n_components=n_components,
        max_iter=max_iter,
        random_state=random_state,
    )
    time_labels, self.time_bin_edges = self.bin_timestamps(
        timestamps, bins
    )
    n_comp, n_vocab = self.components_.shape
    n_bins = len(self.time_bin_edges) - 1
    self.axial_temporal_components_ = np.full(
        (n_bins, n_comp, n_vocab),
        np.nan,
        dtype=self.components_.dtype,
    )
    self.temporal_importance_ = np.zeros((n_bins, n_comp))
    whitened_embeddings = np.copy(self.embeddings)
    if getattr(self.decomposition, "whiten"):
        whitened_embeddings -= self.decomposition.mean_
    # doc_topic = np.dot(X, self.components_.T)
    for i_timebin in np.unique(time_labels):
        topic_importances = document_topic_matrix[
            time_labels == i_timebin
        ].mean(axis=0)
        self.temporal_importance_[i_timebin, :] = topic_importances
        t_doc_topic = document_topic_matrix[time_labels == i_timebin]
        t_embeddings = whitened_embeddings[time_labels == i_timebin]
        linreg = LinearRegression().fit(t_embeddings, t_doc_topic)
        self.axial_temporal_components_[i_timebin, :, :] = np.dot(
            self.vocab_embeddings, linreg.coef_.T
        ).T
    self.estimate_components(self.feature_importance)
    return document_topic_matrix

`transform(raw_documents, embeddings=None)`

Infers topic importances for new documents based on a fitted model.

Parameters:

Name	Type	Description	Default
`raw_documents`		Documents to fit the model on.	required
`embeddings`	`Optional[ndarray]`	Precomputed document encodings.	`None`

Returns:

Type	Description
`ndarray of shape (n_dimensions, n_topics)`	Document-topic matrix.

Source code in turftopic/models/decomp.py

def transform(
    self, raw_documents, embeddings: Optional[np.ndarray] = None
) -> np.ndarray:
    """Infers topic importances for new documents based on a fitted model.

    Parameters
    ----------
    raw_documents: iterable of str
        Documents to fit the model on.
    embeddings: ndarray of shape (n_documents, n_dimensions), optional
        Precomputed document encodings.

    Returns
    -------
    ndarray of shape (n_dimensions, n_topics)
        Document-topic matrix.
    """
    if embeddings is None:
        embeddings = self.encoder_.encode(raw_documents)
    return self.decomposition.transform(embeddings)

Semantic Signal Separation (\(S^3\))

How does \(S^3\) work?

Step 1: Document-embedding Decomposition

Step 2: Term Importance Estimation

Dynamic Topic Modeling

Model Refitting

Interpretation

Negative terms

Concept Compass

Image Compass

API Reference

turftopic.models.decomp.SemanticSignalSeparation

angular_components_ property

angular_temporal_components_ property

concept_compass(topic_x, topic_y)

estimate_components(feature_importance)

plot_concept_compass(topic_x, topic_y)

refit(n_components=None, max_iter=None, random_state=None)

refit_transform(n_components=None, max_iter=None, random_state=None)

refit_transform_dynamic(timestamps, bins=10, n_components=None, max_iter=None, random_state=None)

transform(raw_documents, embeddings=None)

`turftopic.models.decomp.SemanticSignalSeparation`

`angular_components_` `property`

`angular_temporal_components_` `property`

`concept_compass(topic_x, topic_y)`

`estimate_components(feature_importance)`

`plot_concept_compass(topic_x, topic_y)`

`refit(n_components=None, max_iter=None, random_state=None)`

`refit_transform(n_components=None, max_iter=None, random_state=None)`

`refit_transform_dynamic(timestamps, bins=10, n_components=None, max_iter=None, random_state=None)`

`transform(raw_documents, embeddings=None)`