Classical Topic Models#
topicwizard uses the idea of a topic pipeline as its main abstraction to understand classical topic models.
Classical topic models (aka. ones that make the bag-of-words assumption) in Scikit-learn typically consist of a vectorizer, optional weighting and a topic model that only operates on the BoW representations.
One can either use a regular sklearn Pipeline, or topicwizards own abstraction, TopicPipeline.
In the app, all pipelines get turned into TopicPipelines, functionally there is no difference between the two.
Vectorizer#
The vectorizer is a component that turns texts into bag-of-words vectors. A sensible default would be scikit-learn’s CountVectorizer, which makes this process rather customizable and is quite reliable.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
Topic Model#
topicwizard assumes that topic models are some sort of decomposition model, that can turn bag-of-words, or similar representations into a decomposed signal of topics. Anything that turns a document-term-matrix into a document-topic matrix is considered a topic model by topicwizard. We additionally require that all models have to have a “.components_” attribute, which is a topic-term importance matrix. Good examples of of this are Non-negative Matrix Factorization, or Latent Dirichlet Allocation from scikit-learn. Some thrid-party libraries, such as tweetopic also come with sklearn-compatible components.
# LDA for long texts
from sklearn.decomposition import LatentDirichletAllocation
model = LatentDirichletAllocation(n_components=10)
# You can use NMF too
from sklearn.decomposition import NMF
model = NMF(n_components=10)
# Or tweetopic's DMM for short texts
# pip install tweetopic
from tweetopic import DMM
model = DMM(n_components=10)
Pipeline#
You can string these components together into a pipeline, and can even add additional transformations in the middle.
from sklearn.pipeline import make_pipeline
topic_pipeline = make_pipeline(vectorizer, model)
TopicPipeline#
TopicPipeline is a subclass of scikit-learn Pipelines, and for the most part is functionally identical to using a regular Pipeline. We recommend that you use TopicPipeline instead of a regular pipeline as it is more convenient to use in downstream tasks and model interpretation.
from topicwizard.pipeline import make_topic_pipeline
topic_pipeline = make_topic_pipeline(vectorizer, model)
You can also convert an already existent Pipeline to a TopicPipeline:
from topicwizard.pipeline import TopicPipeline
topic_pipeline = TopicPipeline.from_pipeline(pipeline)
Named Outputs#
Topic Pipelines do automatic topic name inference upon fitting, this can be useful if you intend to use these names further down a pipeline for example:
topic_pipeline.fit(texts)
print(topic_pipeline.get_feature_names_out())
Freezing Components#
If you intend to use the topics in a pipeline downstream, you might want to first train a topic model, interpret the topics with topicwizard, and then train downstream components. In these cases you can freeze the vectorizer and the topic model, so that they do not change when you call fit() or partial_fit() on an outer pipeline.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts)
# Investigate topics
topicwizard.visualize(topic_pipeline)
# Freezing topic pipeline
topic_pipeline.freeze = True
# Constructing classification pipeline
cls_pipeline = make_pipeline(topic_pipeline, LogisticRegression())
cls_pipeline.fit(X, y)
Output as DataFrame#
Scikit-learn pipelines and components can now output pandas DataFrames instead of matrices when asked to. The issue is that vectorizers do not play very well with this dynamic, since they have sparse outputs, and pandas cannot deal with sparse matrices.
TopicPipeline allows you to set DataFrames to be the output of your topic pipeline either by providing a parameter or by using the set_output API in scikit-learn.
# Set a parameter
pipeline = make_topic_pipeline(vectorizer, model, pandas_out=True)
# Or use set_output API
pipeline = make_topic_pipeline(vectorizer, model).set_output(transform="pandas")
This is insanely useful when you are trying to investigate the topic content of individual documents. You can for example display a heatmap of topics in a set of documents as such:
import plotly.express as px
texts = [
"Coronavirus killed 50000 people today.",
"Donald Trump's presidential campaing is going very well",
"Protests against police brutality have been going on all around the US.",
]
topic_df = pipeline.transform(texts)
topic_df.index = texts
px.imshow(topic_df).show()
Alternatively you can use human-learn to create rule based components around your topic model.
Here’s an example of how you could construct a classification pipeline for seeing which document is about Covid using a topic model we train and investigate. These kind of pipelines can be very useful when you do not have labelled data but would still like to filter or label texts.
# Install human-learn from PyPI
# pip install human-learn
from hulearn.classification import FunctionClassifier
from sklearn.pipeline import make_pipeline
topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts)
# Investigate topics
topicwizard.visualize(topic_pipeline)
# Creating rule for classifying something as a corona document
def corona_rule(df, threshold=0.5):
is_about_corona = df["11_vaccine_pandemic_virus_coronavirus"] > threshold
return is_about_corona.astype(int)
# Freezing topic pipeline
topic_pipeline.freeze = True
classifier = FunctionClassifier(corona_rule)
cls_pipeline = make_pipeline(topic_pipeline, classifier)
Pseudoprobabilites#
TopicPipeline can be instructed to normalize document-topic importances as if they were probabilites. This is useful if you want to treat importances as probabilities in calculations, or when specifying thresholds.
pipeline = make_topic_pipeline(vectorizer, model, norm_row=True)
# Or set it to false if you want to turn it off
pipeline = make_topic_pipeline(vectorizer, model, norm_row=False)
Validation#
TopicPipeline validates whether the passed components are appropriate to use as a topic model in topicwizard unlike a regular scikit-learn pipeline.
API Reference#
- class topicwizard.pipeline.TopicPipeline(steps: List[Tuple[str, BaseEstimator]], *, memory=None, verbose=False, pandas_out=False, norm_row=False, freeze=False)#
Scikit-learn compatible topic pipeline. It assigns topic names to the output, can return DataFrames and validates models and vectorizers.
- Parameters:
steps (
list
oftuple
ofstr
andBaseEstimator
) – Estimators in the pipeline. The first one has to be an sklearn compatible vectorizer, the last one has to be an sklearn compatible topic model.memory (
str
orobject with the joblib.Memory interface
, defaultNone
) – Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attributenamed_steps
orsteps
to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.verbose (
bool
, defaultFalse
) – If True, the time elapsed while fitting each step will be printed as it is completed.pandas_out (
bool
, defaultFalse
) – If True, transform() will return a DataFrame.norm_row (
bool
, defaultTrue
) – If True, every row in transform() will be sum-normalized so that they can be interpreted as probabilities.freeze (
bool
, defaultFalse
) – If True, components of the pipeline will not be fitted when fit() is called. This is good for downstream uses of the topic model.
- topic_names#
Inferred names of topics. Can be changed.
- Type:
list
ofstr
orNone
- vectorizer_#
Vectorizer model in the pipeline.
- Type:
BaseEstimator
- topic_model_#
Topic model in the pipeline.
- Type:
BaseEstimator
- fit(X: Iterable[str], y=None)#
Fits the pipeline, infers topic names and validates that the individual estimators are indeed a vectorizer and a topic model.
- Parameters:
X (
iterable
ofstr
) – Texts to fit the model on.y (
None
) – Ignored, exists for compatibility.
- Returns:
Fitted pipeline.
- Return type:
self
- partial_fit(X, y=None, classes=None, **kwargs)#
Fits the pipeline on a batch, infers topic names and validates that the individual estimators are indeed a vectorizer and a topic model.
- Parameters:
X (
iterable
ofstr
) – Texts to fit the model on.y (
None
) – Ignored, exists for compatibility.
- Returns:
Fitted pipeline.
- Return type:
self
- transform(X: Iterable[str])#
Turns texts into a document-topic matrix.
- Parameters:
X (
iterable
ofstr
) – List of documents.- Returns:
Document-topic importance matrix.
- Return type:
array
orDataFrame
ofshape (n_documents
,n_topics)
- get_feature_names_out()#
Returns names of topics.
- fit_transform(X: Iterable[str], y=None)#
Fits the pipeline, infers topic names and validates that the individual estimators are indeed a vectorizer and a topic model. Then turns texts into a document-topic matrix.
- Parameters:
X (
iterable
ofstr
) – Texts to fit the model on.y (
None
) – Ignored, exists for compatibility.
- Returns:
Document-topic importance matrix.
- Return type:
array
orDataFrame
ofshape (n_documents
,n_topics)
- set_output(transform=None)#
You can set the output of the pipeline to be a pandas dataframe. If you pass ‘pandas’ it will do this, otherwise it will disable pandas output.