Skip to content

Variational Autoencoding Topic Models

Topic models based on Variational Autoencoding are generative models based on ProdLDA (citation) enhanced with contextual representations.

Pseudo-Plate Notation of Autoencoding Topic Models

You will also hear people refer to these models as CTMs or Contextualized Topic Models. This is confusing, as technically all of the models in Turftopic are contextualized, but most of them do not use autoencoding variational inference. We will therefore stick to calling these models Autoencoding topic models.

You will need to install Turftopic with Pyro to be able to use these models:

pip install turftopic[pyro-ppl]

The Model

Autoencoding Topic Models are generative models over word content in documents, similarly to classical generative topic models, such as Latent Dirichlet Allocation. This means that we have a probabilistic description of how words in documents are generated based on latent representations (topic proportions).

Where these models differ from LDA is that they:

  1. Use a Logistic Normal distribution for topic proportions instead of a Dirichlet.
  2. Words in a document are determined by a product of experts, rather than drawing a topic label for each word in a document.
  3. Use Amortized Variational Inference: A mapping between parameters of the topic proportions and input representations is learned by an artificial neural network (encoder network), instead of sampling the posterior.

Note that term importance estimation is built into the model, instead of

Depending on what the input of the encoder network is, we are either talking about a ZeroShotTM or a CombinedTM. ZeroShotTM(default) only uses the contextual embeddings as the input, while CombinedTM concatenates these to Bag-of-Words representations.

You can choose either, by modifying the combined parameter of the model:

from turftopic import AutoEncodingTopicModel

zeroshot_tm = AutoEncodingTopicModel(10, combined=False)

combined_tm = AutoEncodingTopicModel(10, combined=True)

Comparison with the CTM Package

The main difference is in the implementation. CTM implements inference from scratch in Torch, whereas Turftopic uses a 3rd party inference engine (and probabilistic programming language) called Pyro. This has a number of implications, most notably:

  • Default hyperparameters are different, as such you might get different results with the two packages.
  • Turftopic's inference is more stable, and is less likely to fail due to issues with numerical stability. This is simply because Pyro is a very well tested and widely used engine, and is a more reliable choice than writing inference by hand.
  • Inference in CTM might be faster, as it uses a specific implementation that does not need to be universal in opposition to Pyro.

Turftopic, similarly to Clustering models might not contain some model specific utilites, that CTM boasts.

API Reference

turftopic.models.ctm.AutoEncodingTopicModel

Bases: ContextualModel

Variational autoencoding topic models with contextualized representations (CTM). Uses amortized variational inference with neural networks to estimate posterior for ProdLDA.

from turftopic import AutoEncodingTopicModel

corpus: list[str] = ["some text", "more text", ...]

model = AutoEncodingTopicModel(10, combined=False).fit(corpus)
model.print_topics()

Parameters:

Name Type Description Default
n_components int

Number of topics.

required
encoder Union[Encoder, SentenceTransformer]

Model to encode documents/terms, all-MiniLM-L6-v2 is the default.

'sentence-transformers/all-MiniLM-L6-v2'
vectorizer Optional[CountVectorizer]

Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.

None
combined bool

Indicates whether encoder inputs should be combined with bow representations. When False the model is equivalent to ZeroShotTM, when True it is CombinedTM.

False
dropout_rate float

Dropout in the encoder layers.

0.1
hidden int

Size of hidden layers in the encoder network.

100
batch_size int

Batch size when training the network.

42
learning_rate float

Learning rate for the optimizer.

0.01
n_epochs int

Number of epochs to run during training.

50
random_state Optional[int]

Random state to use so that results are exactly reproducible.

None
Source code in turftopic/models/ctm.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
class AutoEncodingTopicModel(ContextualModel):
    """Variational autoencoding topic models
    with contextualized representations (CTM).
    Uses amortized variational inference with neural networks
    to estimate posterior for ProdLDA.

    ```python
    from turftopic import AutoEncodingTopicModel

    corpus: list[str] = ["some text", "more text", ...]

    model = AutoEncodingTopicModel(10, combined=False).fit(corpus)
    model.print_topics()
    ```

    Parameters
    ----------
    n_components: int
        Number of topics.
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    combined: bool, default False
        Indicates whether encoder inputs should be combined
        with bow representations.
        When False the model is equivalent to ZeroShotTM,
        when True it is CombinedTM.
    dropout_rate: float, default 0.1
        Dropout in the encoder layers.
    hidden: int, default 100
        Size of hidden layers in the encoder network.
    batch_size: int, default 42
        Batch size when training the network.
    learning_rate: float, default 1e-2
        Learning rate for the optimizer.
    n_epochs: int, default 50
        Number of epochs to run during training.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.
    """

    def __init__(
        self,
        n_components: int,
        encoder: Union[
            Encoder, SentenceTransformer
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        combined: bool = False,
        dropout_rate: float = 0.1,
        hidden: int = 100,
        batch_size: int = 42,
        learning_rate: float = 1e-2,
        n_epochs: int = 50,
        random_state: Optional[int] = None,
    ):
        self.n_components = n_components
        self.random_state = random_state
        self.encoder = encoder
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        if vectorizer is None:
            self.vectorizer = default_vectorizer()
        else:
            self.vectorizer = vectorizer
        self.combined = combined
        self.dropout_rate = dropout_rate
        self.batch_size = batch_size
        self.n_epochs = n_epochs
        self.learning_rate = learning_rate
        self.hidden = hidden

    def transform(
        self, raw_documents, embeddings: Optional[np.ndarray] = None
    ):
        """Infers topic importances for new documents based on a fitted model.

        Parameters
        ----------
        raw_documents: iterable of str
            Documents to fit the model on.
        embeddings: ndarray of shape (n_documents, n_dimensions), optional
            Precomputed document encodings.

        Returns
        -------
        ndarray of shape (n_dimensions, n_topics)
            Document-topic matrix.
        """
        if embeddings is None:
            embeddings = self.encoder_.encode(raw_documents)
        if self.combined:
            bow = self.vectorizer.fit_transform(raw_documents)
            contextual_embeddings = np.concatenate(
                (embeddings, bow.toarray()), axis=1
            )
        else:
            contextual_embeddings = embeddings
        contextual_embeddings = torch.tensor(contextual_embeddings).float()
        loc, scale = self.model.encoder(contextual_embeddings)
        prob = torch.softmax(loc, dim=-1)
        return prob.cpu().data.numpy()

    def fit(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ):
        console = Console()
        with console.status("Fitting model") as status:
            if embeddings is None:
                status.update("Encoding documents")
                embeddings = self.encoder_.encode(raw_documents)
                console.log("Documents encoded.")
            status.update("Extracting terms.")
            document_term_matrix = self.vectorizer.fit_transform(raw_documents)
            console.log("Term extraction done.")
            seed = self.random_state or random.randint(0, 10_000)
            torch.manual_seed(seed)
            pyro.set_rng_seed(seed)
            device = torch.device("cpu")
            pyro.clear_param_store()
            contextualized_size = embeddings.shape[1]
            if self.combined:
                contextualized_size = (
                    contextualized_size + document_term_matrix.shape[1]
                )
            self.model = Model(
                vocab_size=document_term_matrix.shape[1],
                contextualized_size=contextualized_size,
                num_topics=self.n_components,
                hidden=self.hidden,
                dropout=self.dropout_rate,
            )
            self.model.to(device)
            optimizer = pyro.optim.Adam({"lr": self.learning_rate})
            svi = SVI(
                self.model.model,
                self.model.guide,
                optimizer,
                loss=TraceMeanField_ELBO(),
            )
            num_batches = int(
                math.ceil(document_term_matrix.shape[0] / self.batch_size)
            )

            status.update(f"Fitting model. Epoch [0/{self.n_epochs}]")
            for epoch in range(self.n_epochs):
                running_loss = 0.0
                for i in range(num_batches):
                    batch_bow = np.atleast_2d(
                        document_term_matrix[
                            i * self.batch_size : (i + 1) * self.batch_size, :
                        ].toarray()
                    )
                    # Skipping batches that are smaller than 2
                    if batch_bow.shape[0] < 2:
                        continue
                    batch_contextualized = np.atleast_2d(
                        embeddings[
                            i * self.batch_size : (i + 1) * self.batch_size, :
                        ]
                    )
                    if self.combined:
                        batch_contextualized = np.concatenate(
                            (batch_contextualized, batch_bow), axis=1
                        )
                    batch_contextualized = (
                        torch.tensor(batch_contextualized).float().to(device)
                    )
                    batch_bow = torch.tensor(batch_bow).float().to(device)
                    loss = svi.step(batch_bow, batch_contextualized)
                    running_loss += loss / batch_bow.size(0)
                status.update(
                    f"Fitting model. Epoch [{epoch}/{self.n_epochs}], Loss [{running_loss}]"
                )
            self.components_ = self.model.beta()
            console.log("Model fitting done.")
        return self

    def fit_transform(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        return self.fit(raw_documents, y, embeddings).transform(
            raw_documents, embeddings
        )

transform(raw_documents, embeddings=None)

Infers topic importances for new documents based on a fitted model.

Parameters:

Name Type Description Default
raw_documents

Documents to fit the model on.

required
embeddings Optional[ndarray]

Precomputed document encodings.

None

Returns:

Type Description
ndarray of shape (n_dimensions, n_topics)

Document-topic matrix.

Source code in turftopic/models/ctm.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
def transform(
    self, raw_documents, embeddings: Optional[np.ndarray] = None
):
    """Infers topic importances for new documents based on a fitted model.

    Parameters
    ----------
    raw_documents: iterable of str
        Documents to fit the model on.
    embeddings: ndarray of shape (n_documents, n_dimensions), optional
        Precomputed document encodings.

    Returns
    -------
    ndarray of shape (n_dimensions, n_topics)
        Document-topic matrix.
    """
    if embeddings is None:
        embeddings = self.encoder_.encode(raw_documents)
    if self.combined:
        bow = self.vectorizer.fit_transform(raw_documents)
        contextual_embeddings = np.concatenate(
            (embeddings, bow.toarray()), axis=1
        )
    else:
        contextual_embeddings = embeddings
    contextual_embeddings = torch.tensor(contextual_embeddings).float()
    loc, scale = self.model.encoder(contextual_embeddings)
    prob = torch.softmax(loc, dim=-1)
    return prob.cpu().data.numpy()