Topeax

Topeax is a probabilistic topic model based on the Peax clustering model, which finds topics based on peaks in point density in the embedding space. The model can recover the number of topics automatically.

In the following example I run a Topeax model on the BBC News corpus, and plot the steps of the algorithm to inspect how our documents have been clustered and why:

# pip install datasets, plotly
from datasets import load_dataset
from turftopic import Topeax

ds = load_dataset("gopalkalpande/bbc-news-summary", split="train")
topeax = Topeax(random_state=42)
doc_topic = topeax.fit_transform(list(ds["Summaries"]))

fig = topeax.plot_steps(hover_text=[text[:200] for text in corpus])
fig.show()

Figure 1: Steps in a Topeax model fitted on BBC News displayed on an interactive graph.

topeax.print_topics()

Topic ID	Highest Ranking
0	mobile, microsoft, digital, technology, broadband, phones, devices, internet, mobiles, computer
1	economy, growth, economic, deficit, prices, gdp, inflation, currency, rates, exports
2	profits, shareholders, shares, takeover, shareholder, company, profit, merger, investors, financial
3	film, actor, oscar, films, actress, oscars, bafta, movie, awards, actors
4	band, album, song, singer, concert, rock, songs, rapper, rap, grammy
5	tory, blair, labour, ukip, mps, minister, election, tories, mr, ministers
6	olympic, tennis, iaaf, federer, wimbledon, doping, roddick, champion, athletics, olympics
7	rugby, liverpool, england, mourinho, chelsea, premiership, arsenal, gerrard, hodgson, gareth

How does Topeax work?

The Topeax algorithm, similar to clustering topic models consists of two consecutive steps. One of them discovers the underlying clusters in the data, the other one estimates term importance scores for each topic in the corpus.

Figure 2: Schematic overview of the steps of the Peax clustering algorithm

1. Clustering

Documents embeddings first get projected into two-dimensional space using t-SNE. In order to identify clusters, we first calculate a Kernel Density Estimate over the embedding space, then find local maxima in the KDE by grid approximation. When we discover local maxima (peaks), we assume these to be cluster means. Cluster density is then approximated with a Gaussian Mixture, where we fix means to the density peaks and then use expectation-maximization to fit the rest of the parameters. (see Figure 2) Documents are then assigned to the component with the highest responsibility:

\[\hat{z_d} = arg max_k (r_{kd}); r_{kd}=p(z_k=1 | \hat{x}_d)\]

where \(z_d\) is the cluster label for document \(d\), \(r_{kd}\) is the responsibility of component \(k\) for document \(d\) and \(\hat{x}_d\) is the 2D embedding of document \(d\).

2. Term Importance Estimation

Topeax uses a combined semantic-lexical term importance, which is the geometric mean of the NPMI method (see Clustering Topic Models for more detail) and a slightly modified centroid-based method. The modified centroids are calculated like so:

\[t_k = \frac{\sum_d r_{kd} \cdot x_d}{\sum_d r_{kd}}\]

where \(t_k\) is the embedding of topic \(k\) and \(x_d\) is the embedding of document \(d\).

Visualization

Topeax has a number of plots available that can aid you when interpreting your results:

Density Plots

One can plot the kernel density estimate on both a 2D and a 3D plot.

topeax.plot_density()

Figure 2: Density contour plot of the Topeax model.

topeax.plot_density3d()

Figure 3: 3D Density Surface of the Topeax model.

Component Plots

You can also create a plot over the mixture components/clusters found by the model.

topeax.plot_components()

Figure 4: Gaussian components estimated for the model.

You can also create a datamapplot figure similar to clustering models:

# pip install turftopic[datamapplot]
topeax.plot_components_datamapplot()

Figure 5: Datapoints colored by mixture components on a datamapplot.

API Reference

`turftopic.models.topeax.Topeax`

Bases: GMM

Topic model based on the Peax clustering algorithm. The algorithm discovers the number of topics automatically, and is based on GMM.

Parameters:

Name	Type	Description	Default
`encoder`	`Union[Encoder, str, MultimodalEncoder]`	Model to encode documents/terms, all-MiniLM-L6-v2 is the default.	`'sentence-transformers/all-MiniLM-L6-v2'`
`vectorizer`	`Optional[CountVectorizer]`	Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.	`None`
`perplexity`	`int`	Number of neighbours to take into account when running TSNE.	`50`
`random_state`	`Optional[int]`	Random state to use so that results are exactly reproducible.	`None`

Source code in turftopic/models/topeax.py

class Topeax(GMM):
    """Topic model based on the Peax clustering algorithm.
    The algorithm discovers the number of topics automatically, and is based on GMM.

    Parameters
    ----------
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    perplexity: int, default 50
        Number of neighbours to take into account when running TSNE.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.

    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        perplexity: int = 50,
        random_state: Optional[int] = None,
    ):
        dimensionality_reduction = TSNE(
            2,
            metric="cosine",
            perplexity=perplexity,
            random_state=random_state,
        )
        self.perplexity = perplexity
        super().__init__(
            n_components=0,
            encoder=encoder,
            vectorizer=vectorizer,
            dimensionality_reduction=dimensionality_reduction,
            random_state=random_state,
        )

    def estimate_components(
        self,
        feature_importance: Optional[LexicalWordImportance] = None,
        doc_topic_matrix=None,
        doc_term_matrix=None,
    ) -> np.ndarray:
        doc_topic_matrix = (
            doc_topic_matrix
            if doc_topic_matrix is not None
            else self.doc_topic_matrix
        )
        doc_term_matrix = (
            doc_term_matrix
            if doc_term_matrix is not None
            else self.doc_term_matrix
        )
        lexical_components = super().estimate_components(
            "npmi", doc_topic_matrix, doc_term_matrix
        )
        vocab = self.get_vocab()
        if getattr(self, "vocab_embeddings", None) is None or (
            self.vocab_embeddings.shape[0] != vocab.shape[0]
        ):
            self.vocab_embeddings = self.encode_documents(vocab)
        topic_embeddings = []
        for weight in doc_topic_matrix.T:
            topic_embeddings.append(
                np.average(self.embeddings, axis=0, weights=weight)
            )
        self.topic_embeddings = np.stack(topic_embeddings)
        semantic_components = cosine_similarity(
            self.topic_embeddings, self.vocab_embeddings
        )
        # Transforming to positive values from 0 to 1
        # Then taking geometric average of the two values
        self.components_ = np.sqrt(
            ((1 + lexical_components) / 2) * ((1 + semantic_components) / 2)
        )
        return self.components_

    def _init_model(self, n_components: int):
        mixture = Peax()
        return mixture

    def plot_steps(self, hover_text=None):
        try:
            import plotly.express as px
            from plotly.subplots import make_subplots
        except (ImportError, ModuleNotFoundError) as e:
            raise ModuleNotFoundError(
                "Please install plotly if you intend to use plots in Turftopic."
            ) from e
        dens_3d = self.plot_density_3d()
        component_plot = self.plot_components(
            show_points=True, hover_text=hover_text
        )
        points_plot = px.scatter(
            x=self.reduced_embeddings[:, 0],
            y=self.reduced_embeddings[:, 1],
            template="plotly_white",
        )
        points_plot = points_plot.update_layout(
            margin=dict(l=0, r=0, b=0, t=0),
        )
        points_plot = points_plot.update_traces(
            marker=dict(
                color="#B7B7FF",
                size=6,
                opacity=0.5,
                line=dict(color="#01014B", width=2),
            )
        )
        colormap = {
            name: color
            for name, color in zip(
                self.topic_names, px.colors.qualitative.Dark24
            )
        }
        bar = px.bar(
            y=self.topic_names,
            x=self.weights_,
            template="plotly_white",
            color_discrete_map=colormap,
            color=self.topic_names,
            text=[f"{p:.2f}" for p in self.weights_],
        )
        bar = bar.update_traces(
            marker_line_color="black",
            marker_line_width=1.5,
            opacity=0.8,
        )

        def update_annotation(a):
            name = a.text.removeprefix("<b>").split("<")[0]
            return a.update(
                # text=name,
                font=dict(size=8, color=colormap[name]),
                arrowsize=1,
                arrowhead=1,
                arrowwidth=1,
                bgcolor=None,
                opacity=0.7,
                # bgcolor=colormap[name],
                bordercolor=colormap[name],
                borderwidth=0,
            )

        fig = make_subplots(
            horizontal_spacing=0.0,
            vertical_spacing=0.1,
            rows=2,
            cols=2,
            subplot_titles=[
                "t-SN Embeddings",
                "Peaks in Kernel Density Estimate",
                "Gaussian Mixture Approximation",
                "Component Probabilities",
            ],
            specs=[
                [
                    {"type": "xy"},
                    {"type": "surface"},
                ],
                [
                    {"type": "xy"},
                    {"type": "bar"},
                ],
            ],
        )
        for i, sub in enumerate([points_plot, dens_3d, component_plot, bar]):
            row = i // 2
            col = i % 2
            for trace in sub.data:
                fig.add_trace(trace, row=row + 1, col=col + 1)
            for shape in sub.layout.shapes:
                fig.add_shape(shape, row=row + 1, col=col + 1)
        fig = fig.update_layout(
            template="plotly_white",
            font=dict(family="Merriweather", size=14, color="black"),
            width=1200,
            height=800,
            autosize=False,
            margin=dict(r=0, l=0, t=40, b=0),
        )
        fig = fig.update_scenes(
            annotations=[
                update_annotation(annotation)
                for annotation in dens_3d.layout.scene.annotations
            ],
            col=2,
            row=1,
        )
        fig = fig.for_each_annotation(lambda a: a.update(yshift=0))
        fig = fig.update_yaxes(visible=False, row=2, col=2)
        fig = fig.update_xaxes(
            title=dict(text="$P(z)$", font=dict(size=16)), row=2, col=2
        )
        return fig

`turftopic.models.topeax.Peax`

Bases: ClusterMixin, BaseEstimator

Clustering model based on density peaks.

Parameters:

Name	Type	Description	Default
`random_state`	`Optional[int]`	Random seed to use for fitting gaussian mixture to peaks.	`None`

Source code in turftopic/models/topeax.py

class Peax(ClusterMixin, BaseEstimator):
    """Clustering model based on density peaks.

    Parameters
    ----------
    random_state: int, default None
        Random seed to use for fitting gaussian mixture to peaks.
    """

    def __init__(self, random_state: Optional[int] = None):
        self.random_state = random_state

    def fit(self, X, y=None):
        self.X_range = np.min(X), np.max(X)
        self.density = gaussian_kde(X.T, "scott")
        coord = np.linspace(*self.X_range, num=100)
        z = []
        for yval in coord:
            points = np.stack([coord, np.full(coord.shape, yval)]).T
            prob = np.exp(self.density.logpdf(points.T))
            z.append(prob)
        z = np.stack(z)
        peaks = detect_peaks(z.T)
        peak_ind = np.nonzero(peaks)
        peak_pos = np.stack([coord[peak_ind[0]], coord[peak_ind[1]]]).T
        weights = self.density.pdf(peak_pos.T)
        weights = weights / weights.sum()
        self.gmm_ = FixedMeanGaussianMixture(
            peak_pos.shape[0],
            means_init=peak_pos,
            weights_init=weights,
            random_state=self.random_state,
        )
        self.labels_ = self.gmm_.fit_predict(X)
        # Checking whether there are close to zero components
        is_zero = np.isclose(self.gmm_.weights_, 0)
        n_zero = np.sum(is_zero)
        if n_zero > 0:
            print(
                f"{n_zero} components have zero weight, removing them and refitting."
            )
        peak_pos = peak_pos[~is_zero]
        weights = self.gmm_.weights_[~is_zero]
        weights = weights / weights.sum()
        self.gmm_ = FixedMeanGaussianMixture(
            peak_pos.shape[0],
            means_init=peak_pos,
            weights_init=weights,
            random_state=self.random_state,
        )
        self.labels_ = self.gmm_.fit_predict(X)
        self.classes_ = np.sort(np.unique(self.labels_))
        self.means_ = self.gmm_.means_
        self.weights_ = self.gmm_.weights_
        self.covariances_ = self.gmm_.covariances_
        return self.labels_

    @property
    def n_components(self) -> int:
        return self.gmm_.n_components

    def predict_proba(self, X):
        return self.gmm_.predict_proba(X)

    def score_samples(self, X):
        return self.density.logpdf(X.T)

    def score(self, X):
        return np.mean(self.score_samples(X))