Variational Autoencoding Topic Models

Topic models based on Variational Autoencoding are generative models based on ProdLDA (citation) enhanced with contextual representations.

Pseudo-Plate Notation of Autoencoding Topic Models

You will also hear people refer to these models as CTMs or Contextualized Topic Models. This is confusing, as technically all of the models in Turftopic are contextualized, but most of them do not use autoencoding variational inference. We will therefore stick to calling these models Autoencoding topic models.

You will need to install Turftopic with Pyro to be able to use these models:

pip install turftopic[pyro-ppl]

The Model

Autoencoding Topic Models are generative models over word content in documents, similarly to classical generative topic models, such as Latent Dirichlet Allocation. This means that we have a probabilistic description of how words in documents are generated based on latent representations (topic proportions).

Where these models differ from LDA is that they:

Use a Logistic Normal distribution for topic proportions instead of a Dirichlet.
Words in a document are determined by a product of experts, rather than drawing a topic label for each word in a document.
Use Amortized Variational Inference: A mapping between parameters of the topic proportions and input representations is learned by an artificial neural network (encoder network), instead of sampling the posterior.

Note that term importance estimation is built into the model, instead of

Depending on what the input of the encoder network is, we are either talking about a ZeroShotTM or a CombinedTM. ZeroShotTM(default) only uses the contextual embeddings as the input, while CombinedTM concatenates these to Bag-of-Words representations.

You can choose either, by modifying the combined parameter of the model:

from turftopic import AutoEncodingTopicModel

zeroshot_tm = AutoEncodingTopicModel(10, combined=False)

combined_tm = AutoEncodingTopicModel(10, combined=True)

Comparison with the CTM Package

The main difference is in the implementation. CTM implements inference from scratch in Torch, whereas Turftopic uses a 3rd party inference engine (and probabilistic programming language) called Pyro. This has a number of implications, most notably:

Default hyperparameters are different, as such you might get different results with the two packages.
Turftopic's inference is more stable, and is less likely to fail due to issues with numerical stability. This is simply because Pyro is a very well tested and widely used engine, and is a more reliable choice than writing inference by hand.
Inference in CTM might be faster, as it uses a specific implementation that does not need to be universal in opposition to Pyro.

Turftopic, similarly to Clustering models might not contain some model specific utilites, that CTM boasts.

API Reference

`turftopic.models.ctm.AutoEncodingTopicModel`

Bases: ContextualModel, MultimodalModel

Variational autoencoding topic models with contextualized representations (CTM). Uses amortized variational inference with neural networks to estimate posterior for ProdLDA.

from turftopic import AutoEncodingTopicModel

corpus: list[str] = ["some text", "more text", ...]

model = AutoEncodingTopicModel(10, combined=False).fit(corpus)
model.print_topics()

Parameters:

Name	Type	Description	Default
`n_components`	`int`	Number of topics.	required
`encoder`	`Union[Encoder, SentenceTransformer, MultimodalEncoder]`	Model to encode documents/terms, all-MiniLM-L6-v2 is the default.	`'sentence-transformers/all-MiniLM-L6-v2'`
`vectorizer`	`Optional[CountVectorizer]`	Vectorizer used for term extraction. Can be used to prune or filter the vocabulary.	`None`
`combined`	`bool`	Indicates whether encoder inputs should be combined with bow representations. When False the model is equivalent to ZeroShotTM, when True it is CombinedTM.	`False`
`dropout_rate`	`float`	Dropout in the encoder layers.	`0.1`
`hidden`	`int`	Size of hidden layers in the encoder network.	`100`
`batch_size`	`int`	Batch size when training the network.	`42`
`learning_rate`	`float`	Learning rate for the optimizer.	`0.01`
`n_epochs`	`int`	Number of epochs to run during training.	`50`
`random_state`	`Optional[int]`	Random state to use so that results are exactly reproducible.	`None`

Source code in turftopic/models/ctm.py

class AutoEncodingTopicModel(ContextualModel, MultimodalModel):
    """Variational autoencoding topic models
    with contextualized representations (CTM).
    Uses amortized variational inference with neural networks
    to estimate posterior for ProdLDA.

    ```python
    from turftopic import AutoEncodingTopicModel

    corpus: list[str] = ["some text", "more text", ...]

    model = AutoEncodingTopicModel(10, combined=False).fit(corpus)
    model.print_topics()
    ```

    Parameters
    ----------
    n_components: int
        Number of topics.
    encoder: str or SentenceTransformer
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    vectorizer: CountVectorizer, default None
        Vectorizer used for term extraction.
        Can be used to prune or filter the vocabulary.
    combined: bool, default False
        Indicates whether encoder inputs should be combined
        with bow representations.
        When False the model is equivalent to ZeroShotTM,
        when True it is CombinedTM.
    dropout_rate: float, default 0.1
        Dropout in the encoder layers.
    hidden: int, default 100
        Size of hidden layers in the encoder network.
    batch_size: int, default 42
        Batch size when training the network.
    learning_rate: float, default 1e-2
        Learning rate for the optimizer.
    n_epochs: int, default 50
        Number of epochs to run during training.
    random_state: int, default None
        Random state to use so that results are exactly reproducible.
    """

    def __init__(
        self,
        n_components: int,
        encoder: Union[
            Encoder, SentenceTransformer, MultimodalEncoder
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        vectorizer: Optional[CountVectorizer] = None,
        combined: bool = False,
        dropout_rate: float = 0.1,
        hidden: int = 100,
        batch_size: int = 42,
        learning_rate: float = 1e-2,
        n_epochs: int = 50,
        random_state: Optional[int] = None,
    ):
        self.n_components = n_components
        self.random_state = random_state
        self.encoder = encoder
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        self.validate_encoder()
        if vectorizer is None:
            self.vectorizer = default_vectorizer()
        else:
            self.vectorizer = vectorizer
        self.combined = combined
        self.dropout_rate = dropout_rate
        self.batch_size = batch_size
        self.n_epochs = n_epochs
        self.learning_rate = learning_rate
        self.hidden = hidden

    def transform(
        self, raw_documents, embeddings: Optional[np.ndarray] = None
    ):
        """Infers topic importances for new documents based on a fitted model.

        Parameters
        ----------
        raw_documents: iterable of str
            Documents to fit the model on.
        embeddings: ndarray of shape (n_documents, n_dimensions), optional
            Precomputed document encodings.

        Returns
        -------
        ndarray of shape (n_dimensions, n_topics)
            Document-topic matrix.
        """
        if embeddings is None:
            embeddings = self.encoder_.encode(raw_documents)
        if self.combined:
            bow = self.vectorizer.fit_transform(raw_documents)
            contextual_embeddings = np.concatenate(
                (embeddings, bow.toarray()), axis=1
            )
        else:
            contextual_embeddings = embeddings
        contextual_embeddings = torch.tensor(contextual_embeddings).float()
        loc, scale = self.model.encoder(contextual_embeddings)
        prob = torch.softmax(loc, dim=-1)
        return prob.cpu().data.numpy()

    def fit(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ):
        console = Console()
        with console.status("Fitting model") as status:
            if embeddings is None:
                status.update("Encoding documents")
                embeddings = self.encoder_.encode(raw_documents)
                console.log("Documents encoded.")
            status.update("Extracting terms.")
            document_term_matrix = self.vectorizer.fit_transform(raw_documents)
            console.log("Term extraction done.")
            seed = self.random_state or random.randint(0, 10_000)
            torch.manual_seed(seed)
            pyro.set_rng_seed(seed)
            device = torch.device("cpu")
            pyro.clear_param_store()
            contextualized_size = embeddings.shape[1]
            if self.combined:
                contextualized_size = (
                    contextualized_size + document_term_matrix.shape[1]
                )
            self.model = Model(
                vocab_size=document_term_matrix.shape[1],
                contextualized_size=contextualized_size,
                num_topics=self.n_components,
                hidden=self.hidden,
                dropout=self.dropout_rate,
            )
            self.model.to(device)
            optimizer = pyro.optim.Adam({"lr": self.learning_rate})
            svi = SVI(
                self.model.model,
                self.model.guide,
                optimizer,
                loss=TraceMeanField_ELBO(),
            )
            num_batches = int(
                math.ceil(document_term_matrix.shape[0] / self.batch_size)
            )

            status.update(f"Fitting model. Epoch [0/{self.n_epochs}]")
            for epoch in range(self.n_epochs):
                running_loss = 0.0
                for i in range(num_batches):
                    batch_bow = np.atleast_2d(
                        document_term_matrix[
                            i * self.batch_size : (i + 1) * self.batch_size, :
                        ].toarray()
                    )
                    # Skipping batches that are smaller than 2
                    if batch_bow.shape[0] < 2:
                        continue
                    batch_contextualized = np.atleast_2d(
                        embeddings[
                            i * self.batch_size : (i + 1) * self.batch_size, :
                        ]
                    )
                    if self.combined:
                        batch_contextualized = np.concatenate(
                            (batch_contextualized, batch_bow), axis=1
                        )
                    batch_contextualized = (
                        torch.tensor(batch_contextualized).float().to(device)
                    )
                    batch_bow = torch.tensor(batch_bow).float().to(device)
                    loss = svi.step(batch_bow, batch_contextualized)
                    running_loss += loss / batch_bow.size(0)
                status.update(
                    f"Fitting model. Epoch [{epoch}/{self.n_epochs}], Loss [{running_loss}]"
                )
            self.components_ = np.array(self.model.beta())
            console.log("Model fitting done.")
        return self

    def fit_transform(
        self, raw_documents, y=None, embeddings: Optional[np.ndarray] = None
    ) -> np.ndarray:
        return self.fit(raw_documents, y, embeddings).transform(
            raw_documents, embeddings
        )

    def fit_transform_multimodal(
        self,
        raw_documents: list[str],
        images: list[ImageRepr],
        y=None,
        embeddings: Optional[MultimodalEmbeddings] = None,
    ) -> np.ndarray:
        self.validate_embeddings(embeddings)
        console = Console()
        self.multimodal_embeddings = embeddings
        with console.status("Fitting model") as status:
            if self.multimodal_embeddings is None:
                status.update("Encoding documents")
                self.multimodal_embeddings = self.encode_multimodal(
                    raw_documents, images
                )
                console.log("Documents encoded.")
            status.update("Extracting terms.")
            document_term_matrix = self.vectorizer.fit_transform(raw_documents)
            console.log("Term extraction done.")
            seed = self.random_state or random.randint(0, 10_000)
            torch.manual_seed(seed)
            pyro.set_rng_seed(seed)
            device = torch.device("cpu")
            pyro.clear_param_store()
            contextualized_size = self.multimodal_embeddings[
                "document_embeddings"
            ].shape[1]
            if self.combined:
                contextualized_size = (
                    contextualized_size + document_term_matrix.shape[1]
                )
            self.model = Model(
                vocab_size=document_term_matrix.shape[1],
                contextualized_size=contextualized_size,
                num_topics=self.n_components,
                hidden=self.hidden,
                dropout=self.dropout_rate,
            )
            self.model.to(device)
            optimizer = pyro.optim.Adam({"lr": self.learning_rate})
            svi = SVI(
                self.model.model,
                self.model.guide,
                optimizer,
                loss=TraceMeanField_ELBO(),
            )
            num_batches = int(
                math.ceil(document_term_matrix.shape[0] / self.batch_size)
            )

            status.update(f"Fitting model. Epoch [0/{self.n_epochs}]")
            for epoch in range(self.n_epochs):
                running_loss = 0.0
                for i in range(num_batches):
                    batch_bow = np.atleast_2d(
                        document_term_matrix[
                            i * self.batch_size : (i + 1) * self.batch_size, :
                        ].toarray()
                    )
                    # Skipping batches that are smaller than 2
                    if batch_bow.shape[0] < 2:
                        continue
                    batch_contextualized = np.atleast_2d(
                        self.multimodal_embeddings["document_embeddings"][
                            i * self.batch_size : (i + 1) * self.batch_size, :
                        ]
                    )
                    if self.combined:
                        batch_contextualized = np.concatenate(
                            (batch_contextualized, batch_bow), axis=1
                        )
                    batch_contextualized = (
                        torch.tensor(batch_contextualized).float().to(device)
                    )
                    batch_bow = torch.tensor(batch_bow).float().to(device)
                    loss = svi.step(batch_bow, batch_contextualized)
                    running_loss += loss / batch_bow.size(0)
                status.update(
                    f"Fitting model. Epoch [{epoch}/{self.n_epochs}], Loss [{running_loss}]"
                )
            self.components_ = np.array(self.model.beta())
            console.log("Model fitting done.")
            status.update("Transforming documents and images.")
            document_topic_matrix = self.transform(
                raw_documents,
                embeddings=self.multimodal_embeddings["document_embeddings"],
            )
            self.image_topic_matrix = self.transform(
                raw_documents,
                embeddings=self.multimodal_embeddings["image_embeddings"],
            )
            self.top_images: list[list[Image.Image]] = self.collect_top_images(
                images, self.image_topic_matrix
            )
            console.log("Transformation done.")
        return document_topic_matrix

`transform(raw_documents, embeddings=None)`

Infers topic importances for new documents based on a fitted model.

Parameters:

Name	Type	Description	Default
`raw_documents`		Documents to fit the model on.	required
`embeddings`	`Optional[ndarray]`	Precomputed document encodings.	`None`

Returns:

Type	Description
`ndarray of shape (n_dimensions, n_topics)`	Document-topic matrix.

Source code in turftopic/models/ctm.py

def transform(
    self, raw_documents, embeddings: Optional[np.ndarray] = None
):
    """Infers topic importances for new documents based on a fitted model.

    Parameters
    ----------
    raw_documents: iterable of str
        Documents to fit the model on.
    embeddings: ndarray of shape (n_documents, n_dimensions), optional
        Precomputed document encodings.

    Returns
    -------
    ndarray of shape (n_dimensions, n_topics)
        Document-topic matrix.
    """
    if embeddings is None:
        embeddings = self.encoder_.encode(raw_documents)
    if self.combined:
        bow = self.vectorizer.fit_transform(raw_documents)
        contextual_embeddings = np.concatenate(
            (embeddings, bow.toarray()), axis=1
        )
    else:
        contextual_embeddings = embeddings
    contextual_embeddings = torch.tensor(contextual_embeddings).float()
    loc, scale = self.model.encoder(contextual_embeddings)
    prob = torch.softmax(loc, dim=-1)
    return prob.cpu().data.numpy()