Skip to content

Topeax

Topeax is a probabilistic topic model based on the Peax clustering model, which finds topics based on peaks in point density in the embedding space. The model can recover the number of topics automatically.

In the following example I run a Topeax model on the BBC News corpus, and plot the steps of the algorithm to inspect how our documents have been clustered and why:

# pip install datasets, plotly
from datasets import load_dataset
from turftopic import Topeax

ds = load_dataset("gopalkalpande/bbc-news-summary", split="train")
topeax = Topeax(random_state=42)
doc_topic = topeax.fit_transform(list(ds["Summaries"]))

fig = topeax.plot_steps(hover_text=[text[:200] for text in corpus])
fig.show()
Figure 1: Steps in a Topeax model fitted on BBC News displayed on an interactive graph.
topeax.print_topics()
Topic ID Highest Ranking
0 mobile, microsoft, digital, technology, broadband, phones, devices, internet, mobiles, computer
1 economy, growth, economic, deficit, prices, gdp, inflation, currency, rates, exports
2 profits, shareholders, shares, takeover, shareholder, company, profit, merger, investors, financial
3 film, actor, oscar, films, actress, oscars, bafta, movie, awards, actors
4 band, album, song, singer, concert, rock, songs, rapper, rap, grammy
5 tory, blair, labour, ukip, mps, minister, election, tories, mr, ministers
6 olympic, tennis, iaaf, federer, wimbledon, doping, roddick, champion, athletics, olympics
7 rugby, liverpool, england, mourinho, chelsea, premiership, arsenal, gerrard, hodgson, gareth

How does Topeax work?

The Topeax algorithm, similar to clustering topic models consists of two consecutive steps. One of them discovers the underlying clusters in the data, the other one estimates term importance scores for each topic in the corpus.


Figure 2: Schematic overview of the steps of the Peax clustering algorithm

1. Clustering

Documents embeddings first get projected into two-dimensional space using t-SNE. In order to identify clusters, we first calculate a Kernel Density Estimate over the embedding space, then find local maxima in the KDE by grid approximation. When we discover local maxima (peaks), we assume these to be cluster means. Cluster density is then approximated with a Gaussian Mixture, where we fix means to the density peaks and then use expectation-maximization to fit the rest of the parameters. (see Figure 2) Documents are then assigned to the component with the highest responsibility:

\[\hat{z_d} = arg max_k (r_{kd}); r_{kd}=p(z_k=1 | \hat{x}_d)\]

where \(z_d\) is the cluster label for document \(d\), \(r_{kd}\) is the responsibility of component \(k\) for document \(d\) and \(\hat{x}_d\) is the 2D embedding of document \(d\).

2. Term Importance Estimation

Topeax uses a combined semantic-lexical term importance, which is the geometric mean of the NPMI method (see Clustering Topic Models for more detail) and a slightly modified centroid-based method. The modified centroids are calculated like so:

\[t_k = \frac{\sum_d r_{kd} \cdot x_d}{\sum_d r_{kd}}\]

where \(t_k\) is the embedding of topic \(k\) and \(x_d\) is the embedding of document \(d\).

Visualization

Topeax has a number of plots available that can aid you when interpreting your results:

Density Plots

One can plot the kernel density estimate on both a 2D and a 3D plot.

topeax.plot_density()
Figure 2: Density contour plot of the Topeax model.
topeax.plot_density3d()
Figure 3: 3D Density Surface of the Topeax model.

Component Plots

You can also create a plot over the mixture components/clusters found by the model.

topeax.plot_components()
Figure 4: Gaussian components estimated for the model.

You can also create a datamapplot figure similar to clustering models:

# pip install turftopic[datamapplot]
topeax.plot_components_datamapplot()
Figure 5: Datapoints colored by mixture components on a datamapplot.

API Reference

turftopic.models.topeax.Topeax

Bases: GMM

Topic model based on the Peax clustering algorithm. The algorithm discovers the number of topics automatically, and is based on GMM.

Parameters:

Name Type Description Default
encoder Union[Encoder, str, MultimodalEncoder]

Model to encode documents/terms, all-MiniLM-L6-v2 is the default.

'sentence-transformers/all-MiniLM-L6-v2'
vectorizer Optional[CountVectorizer]

Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.

None
perplexity int

Number of neighbours to take into account when running TSNE.

50
random_state Optional[int]

Random state to use so that results are exactly reproducible.

None
Source code in turftopic/models/topeax.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
class Topeax(GMM):
    """Topic model based on the Peax clustering algorithm.
    The algorithm discovers the number of topics automatically, and is based on GMM.

    Parameters
    ----------
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    perplexity: int, default 50
        Number of neighbours to take into account when running TSNE.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.

    """

    def __init__(
        self,
        encoder: Union[
            Encoder, str, MultimodalEncoder
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        perplexity: int = 50,
        random_state: Optional[int] = None,
    ):
        dimensionality_reduction = TSNE(
            2,
            metric="cosine",
            perplexity=perplexity,
            random_state=random_state,
        )
        self.perplexity = perplexity
        super().__init__(
            n_components=0,
            encoder=encoder,
            vectorizer=vectorizer,
            dimensionality_reduction=dimensionality_reduction,
            random_state=random_state,
        )

    def estimate_components(
        self,
        feature_importance: Optional[LexicalWordImportance] = None,
        doc_topic_matrix=None,
        doc_term_matrix=None,
    ) -> np.ndarray:
        doc_topic_matrix = (
            doc_topic_matrix
            if doc_topic_matrix is not None
            else self.doc_topic_matrix
        )
        doc_term_matrix = (
            doc_term_matrix
            if doc_term_matrix is not None
            else self.doc_term_matrix
        )
        lexical_components = super().estimate_components(
            "npmi", doc_topic_matrix, doc_term_matrix
        )
        vocab = self.get_vocab()
        if getattr(self, "vocab_embeddings", None) is None or (
            self.vocab_embeddings.shape[0] != vocab.shape[0]
        ):
            self.vocab_embeddings = self.encode_documents(vocab)
        topic_embeddings = []
        for weight in doc_topic_matrix.T:
            topic_embeddings.append(
                np.average(self.embeddings, axis=0, weights=weight)
            )
        self.topic_embeddings = np.stack(topic_embeddings)
        semantic_components = cosine_similarity(
            self.topic_embeddings, self.vocab_embeddings
        )
        # Transforming to positive values from 0 to 1
        # Then taking geometric average of the two values
        self.components_ = np.sqrt(
            ((1 + lexical_components) / 2) * ((1 + semantic_components) / 2)
        )
        return self.components_

    def _init_model(self, n_components: int):
        mixture = Peax()
        return mixture

    def plot_steps(self, hover_text=None):
        try:
            import plotly.express as px
            from plotly.subplots import make_subplots
        except (ImportError, ModuleNotFoundError) as e:
            raise ModuleNotFoundError(
                "Please install plotly if you intend to use plots in Turftopic."
            ) from e
        dens_3d = self.plot_density_3d()
        component_plot = self.plot_components(
            show_points=True, hover_text=hover_text
        )
        points_plot = px.scatter(
            x=self.reduced_embeddings[:, 0],
            y=self.reduced_embeddings[:, 1],
            template="plotly_white",
        )
        points_plot = points_plot.update_layout(
            margin=dict(l=0, r=0, b=0, t=0),
        )
        points_plot = points_plot.update_traces(
            marker=dict(
                color="#B7B7FF",
                size=6,
                opacity=0.5,
                line=dict(color="#01014B", width=2),
            )
        )
        colormap = {
            name: color
            for name, color in zip(
                self.topic_names, px.colors.qualitative.Dark24
            )
        }
        bar = px.bar(
            y=self.topic_names,
            x=self.weights_,
            template="plotly_white",
            color_discrete_map=colormap,
            color=self.topic_names,
            text=[f"{p:.2f}" for p in self.weights_],
        )
        bar = bar.update_traces(
            marker_line_color="black",
            marker_line_width=1.5,
            opacity=0.8,
        )

        def update_annotation(a):
            name = a.text.removeprefix("<b>").split("<")[0]
            return a.update(
                # text=name,
                font=dict(size=8, color=colormap[name]),
                arrowsize=1,
                arrowhead=1,
                arrowwidth=1,
                bgcolor=None,
                opacity=0.7,
                # bgcolor=colormap[name],
                bordercolor=colormap[name],
                borderwidth=0,
            )

        fig = make_subplots(
            horizontal_spacing=0.0,
            vertical_spacing=0.1,
            rows=2,
            cols=2,
            subplot_titles=[
                "t-SN Embeddings",
                "Peaks in Kernel Density Estimate",
                "Gaussian Mixture Approximation",
                "Component Probabilities",
            ],
            specs=[
                [
                    {"type": "xy"},
                    {"type": "surface"},
                ],
                [
                    {"type": "xy"},
                    {"type": "bar"},
                ],
            ],
        )
        for i, sub in enumerate([points_plot, dens_3d, component_plot, bar]):
            row = i // 2
            col = i % 2
            for trace in sub.data:
                fig.add_trace(trace, row=row + 1, col=col + 1)
            for shape in sub.layout.shapes:
                fig.add_shape(shape, row=row + 1, col=col + 1)
        fig = fig.update_layout(
            template="plotly_white",
            font=dict(family="Merriweather", size=14, color="black"),
            width=1200,
            height=800,
            autosize=False,
            margin=dict(r=0, l=0, t=40, b=0),
        )
        fig = fig.update_scenes(
            annotations=[
                update_annotation(annotation)
                for annotation in dens_3d.layout.scene.annotations
            ],
            col=2,
            row=1,
        )
        fig = fig.for_each_annotation(lambda a: a.update(yshift=0))
        fig = fig.update_yaxes(visible=False, row=2, col=2)
        fig = fig.update_xaxes(
            title=dict(text="$P(z)$", font=dict(size=16)), row=2, col=2
        )
        return fig

turftopic.models.topeax.Peax

Bases: ClusterMixin, BaseEstimator

Clustering model based on density peaks.

Parameters:

Name Type Description Default
random_state Optional[int]

Random seed to use for fitting gaussian mixture to peaks.

None
Source code in turftopic/models/topeax.py
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
class Peax(ClusterMixin, BaseEstimator):
    """Clustering model based on density peaks.

    Parameters
    ----------
    random_state: int, default None
        Random seed to use for fitting gaussian mixture to peaks.
    """

    def __init__(self, random_state: Optional[int] = None):
        self.random_state = random_state

    def fit(self, X, y=None):
        self.X_range = np.min(X), np.max(X)
        self.density = gaussian_kde(X.T, "scott")
        coord = np.linspace(*self.X_range, num=100)
        z = []
        for yval in coord:
            points = np.stack([coord, np.full(coord.shape, yval)]).T
            prob = np.exp(self.density.logpdf(points.T))
            z.append(prob)
        z = np.stack(z)
        peaks = detect_peaks(z.T)
        peak_ind = np.nonzero(peaks)
        peak_pos = np.stack([coord[peak_ind[0]], coord[peak_ind[1]]]).T
        weights = self.density.pdf(peak_pos.T)
        weights = weights / weights.sum()
        self.gmm_ = FixedMeanGaussianMixture(
            peak_pos.shape[0],
            means_init=peak_pos,
            weights_init=weights,
            random_state=self.random_state,
        )
        self.labels_ = self.gmm_.fit_predict(X)
        # Checking whether there are close to zero components
        is_zero = np.isclose(self.gmm_.weights_, 0)
        n_zero = np.sum(is_zero)
        if n_zero > 0:
            print(
                f"{n_zero} components have zero weight, removing them and refitting."
            )
        peak_pos = peak_pos[~is_zero]
        weights = self.gmm_.weights_[~is_zero]
        weights = weights / weights.sum()
        self.gmm_ = FixedMeanGaussianMixture(
            peak_pos.shape[0],
            means_init=peak_pos,
            weights_init=weights,
            random_state=self.random_state,
        )
        self.labels_ = self.gmm_.fit_predict(X)
        self.classes_ = np.sort(np.unique(self.labels_))
        self.means_ = self.gmm_.means_
        self.weights_ = self.gmm_.weights_
        self.covariances_ = self.gmm_.covariances_
        return self.labels_

    @property
    def n_components(self) -> int:
        return self.gmm_.n_components

    def predict_proba(self, X):
        return self.gmm_.predict_proba(X)

    def score_samples(self, X):
        return self.density.logpdf(X.T)

    def score(self, X):
        return np.mean(self.score_samples(X))