Classical Topic Models#

topicwizard uses the idea of a topic pipeline as its main abstraction to understand classical topic models.

Classical topic models (aka. ones that make the bag-of-words assumption) in Scikit-learn typically consist of a vectorizer, optional weighting and a topic model that only operates on the BoW representations.

One can either use a regular sklearn Pipeline, or topicwizards own abstraction, TopicPipeline.

In the app, all pipelines get turned into TopicPipelines, functionally there is no difference between the two.

Vectorizer#

The vectorizer is a component that turns texts into bag-of-words vectors. A sensible default would be scikit-learn’s CountVectorizer, which makes this process rather customizable and is quite reliable.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

Topic Model#

topicwizard assumes that topic models are some sort of decomposition model, that can turn bag-of-words, or similar representations into a decomposed signal of topics. Anything that turns a document-term-matrix into a document-topic matrix is considered a topic model by topicwizard. We additionally require that all models have to have a “.components_” attribute, which is a topic-term importance matrix. Good examples of of this are Non-negative Matrix Factorization, or Latent Dirichlet Allocation from scikit-learn. Some thrid-party libraries, such as tweetopic also come with sklearn-compatible components.

# LDA for long texts
from sklearn.decomposition import LatentDirichletAllocation

model = LatentDirichletAllocation(n_components=10)

# You can use NMF too
from sklearn.decomposition import NMF

model = NMF(n_components=10)

# Or tweetopic's DMM for short texts
# pip install tweetopic

from tweetopic import DMM

model = DMM(n_components=10)

Pipeline#

You can string these components together into a pipeline, and can even add additional transformations in the middle.

from sklearn.pipeline import make_pipeline

topic_pipeline = make_pipeline(vectorizer, model)
Schematic overview of pipeline.

TopicPipeline#

TopicPipeline is a subclass of scikit-learn Pipelines, and for the most part is functionally identical to using a regular Pipeline. We recommend that you use TopicPipeline instead of a regular pipeline as it is more convenient to use in downstream tasks and model interpretation.

from topicwizard.pipeline import make_topic_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model)

You can also convert an already existent Pipeline to a TopicPipeline:

from topicwizard.pipeline import TopicPipeline

topic_pipeline = TopicPipeline.from_pipeline(pipeline)

Named Outputs#

Topic Pipelines do automatic topic name inference upon fitting, this can be useful if you intend to use these names further down a pipeline for example:

topic_pipeline.fit(texts)
print(topic_pipeline.get_feature_names_out())

Freezing Components#

If you intend to use the topics in a pipeline downstream, you might want to first train a topic model, interpret the topics with topicwizard, and then train downstream components. In these cases you can freeze the vectorizer and the topic model, so that they do not change when you call fit() or partial_fit() on an outer pipeline.

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts)

# Investigate topics
topicwizard.visualize(topic_pipeline)

# Freezing topic pipeline
topic_pipeline.freeze = True
# Constructing classification pipeline
cls_pipeline = make_pipeline(topic_pipeline, LogisticRegression())
cls_pipeline.fit(X, y)

Output as DataFrame#

Scikit-learn pipelines and components can now output pandas DataFrames instead of matrices when asked to. The issue is that vectorizers do not play very well with this dynamic, since they have sparse outputs, and pandas cannot deal with sparse matrices.

TopicPipeline allows you to set DataFrames to be the output of your topic pipeline either by providing a parameter or by using the set_output API in scikit-learn.

# Set a parameter
pipeline = make_topic_pipeline(vectorizer, model, pandas_out=True)

# Or use set_output API
pipeline = make_topic_pipeline(vectorizer, model).set_output(transform="pandas")

This is insanely useful when you are trying to investigate the topic content of individual documents. You can for example display a heatmap of topics in a set of documents as such:

import plotly.express as px

texts = [
   "Coronavirus killed 50000 people today.",
   "Donald Trump's presidential campaing is going very well",
   "Protests against police brutality have been going on all around the US.",
]
topic_df = pipeline.transform(texts)
topic_df.index = texts
px.imshow(topic_df).show()

Alternatively you can use human-learn to create rule based components around your topic model.

Here’s an example of how you could construct a classification pipeline for seeing which document is about Covid using a topic model we train and investigate. These kind of pipelines can be very useful when you do not have labelled data but would still like to filter or label texts.

# Install human-learn from PyPI
# pip install human-learn

from hulearn.classification import FunctionClassifier
from sklearn.pipeline import make_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts)

# Investigate topics
topicwizard.visualize(topic_pipeline)

# Creating rule for classifying something as a corona document
def corona_rule(df, threshold=0.5):
    is_about_corona = df["11_vaccine_pandemic_virus_coronavirus"] > threshold
    return is_about_corona.astype(int)

# Freezing topic pipeline
topic_pipeline.freeze = True
classifier = FunctionClassifier(corona_rule)
cls_pipeline = make_pipeline(topic_pipeline, classifier)

Pseudoprobabilites#

TopicPipeline can be instructed to normalize document-topic importances as if they were probabilites. This is useful if you want to treat importances as probabilities in calculations, or when specifying thresholds.

pipeline = make_topic_pipeline(vectorizer, model, norm_row=True)
# Or set it to false if you want to turn it off
pipeline = make_topic_pipeline(vectorizer, model, norm_row=False)

Validation#

TopicPipeline validates whether the passed components are appropriate to use as a topic model in topicwizard unlike a regular scikit-learn pipeline.

API Reference#

class topicwizard.pipeline.TopicPipeline(steps: List[Tuple[str, BaseEstimator]], *, memory=None, verbose=False, pandas_out=False, norm_row=False, freeze=False)#

Scikit-learn compatible topic pipeline. It assigns topic names to the output, can return DataFrames and validates models and vectorizers.

Parameters:
  • steps (list of tuple of str and BaseEstimator) – Estimators in the pipeline. The first one has to be an sklearn compatible vectorizer, the last one has to be an sklearn compatible topic model.

  • memory (str or object with the joblib.Memory interface, default None) – Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.

  • verbose (bool, default False) – If True, the time elapsed while fitting each step will be printed as it is completed.

  • pandas_out (bool, default False) – If True, transform() will return a DataFrame.

  • norm_row (bool, default True) – If True, every row in transform() will be sum-normalized so that they can be interpreted as probabilities.

  • freeze (bool, default False) – If True, components of the pipeline will not be fitted when fit() is called. This is good for downstream uses of the topic model.

topic_names#

Inferred names of topics. Can be changed.

Type:

list of str or None

vectorizer_#

Vectorizer model in the pipeline.

Type:

BaseEstimator

topic_model_#

Topic model in the pipeline.

Type:

BaseEstimator

fit(X: Iterable[str], y=None)#

Fits the pipeline, infers topic names and validates that the individual estimators are indeed a vectorizer and a topic model.

Parameters:
  • X (iterable of str) – Texts to fit the model on.

  • y (None) – Ignored, exists for compatibility.

Returns:

Fitted pipeline.

Return type:

self

partial_fit(X, y=None, classes=None, **kwargs)#

Fits the pipeline on a batch, infers topic names and validates that the individual estimators are indeed a vectorizer and a topic model.

Parameters:
  • X (iterable of str) – Texts to fit the model on.

  • y (None) – Ignored, exists for compatibility.

Returns:

Fitted pipeline.

Return type:

self

transform(X: Iterable[str])#

Turns texts into a document-topic matrix.

Parameters:

X (iterable of str) – List of documents.

Returns:

Document-topic importance matrix.

Return type:

array or DataFrame of shape (n_documents, n_topics)

get_feature_names_out()#

Returns names of topics.

fit_transform(X: Iterable[str], y=None)#

Fits the pipeline, infers topic names and validates that the individual estimators are indeed a vectorizer and a topic model. Then turns texts into a document-topic matrix.

Parameters:
  • X (iterable of str) – Texts to fit the model on.

  • y (None) – Ignored, exists for compatibility.

Returns:

Document-topic importance matrix.

Return type:

array or DataFrame of shape (n_documents, n_topics)

set_output(transform=None)#

You can set the output of the pipeline to be a pandas dataframe. If you pass ‘pandas’ it will do this, otherwise it will disable pandas output.

prepare_topic_data(corpus: List[str], document_representation: Literal['term', 'topic'] = 'term') TopicData#

Prepares topic data