TopicData
While Turftopic provides a fully sklearn-compatible interface for training and using topic models, this is not always optimal, especially when you have to visualize models, or save more information about inference then would be practical to have in a model
object.
We have thus added an abstraction borrowed from topicwizard called TopicData
.
Producing TopicData
Every model has methods, with which you can produce this object:
Prepare TopicData
objects
topic_data = model.prepare_topic_data(corpus)
# print to see what attributes are available
print(topic_data)
TopicData
├── corpus (1000)
├── vocab (1746,)
├── document_term_matrix (1000, 1746)
├── topic_term_matrix (10, 1746)
├── document_topic_matrix (1000, 10)
├── document_representation (1000, 384)
├── transform
├── topic_names (10)
├── has_negative_side
└── hierarchy
Models that support dynamic topic modeling have this method too, which includes dynamic topics in the resulting TopicData
object.
import datetime
timestamps: list[datetime.datetime] = [...]
topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps)
Using TopicData
TopicData
is a dict-like object, and for all intents and purposes can be used as a Python dictionary, but for convenience you can also access its attributes with the dot syntax:
# They are the same
assert topic_data["document_term_matrix"].shape == topic_data.document_term_matrix.shape
Much like models, you can pretty-print information about topic models based on the TopicData
object, but, since it contains more information on inference then the model object itself, you sometimes have to pass less parameters than if you called the same method on the model:
model.print_representative_documents(0, corpus, document_topic_matrix)
# This is simpler with TopicData, since you only have to pass the topic ID
topic_data.print_representative_documents(0)
When producing figures, TopicData
also gives you shorthands for accessing the topicwizard web app and Figures API:
topic_data.figures.topic_map()
See our guide on Model Interpretation for more info.
API Reference
turftopic.data.TopicData
Bases: Mapping
, TopicContainer
Contains data about topic inference on a corpus. Can be used with multiple convenience and interpretation utilities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vocab
|
ndarray
|
Array of all words in the vocabulary of the topic model. |
required |
document_term_matrix
|
ndarray
|
Bag-of-words document representations. Elements of the matrix are word importances/frequencies for given documents. |
required |
document_topic_matrix
|
ndarray
|
Topic importances for each document. |
required |
topic_term_matrix
|
ndarray
|
Importances of each term for each topic in a matrix. |
required |
document_representation
|
ndarray
|
Embedded representations for documents. Can also be a sparse BoW matrix for classical models. |
required |
topic_names
|
Optional[list[str]]
|
Names or topic descriptions inferred for topics by the model. |
None
|
classes
|
Optional[ndarray]
|
Topic IDs that might be different from 0-n_topics. (For instance if you have an outlier topic, which is labelled -1) |
None
|
corpus
|
Optional[list[str]]
|
The corpus on which inference was run. Can be None. |
None
|
transform
|
Optional[Callable]
|
Function that transforms documents to document-topic matrices. Can be None in the case of transductive models. |
None
|
time_bin_edges
|
Optional[list[datetime]]
|
Edges of the time bins in a dynamic topic model. |
None
|
temporal_components
|
Optional[ndarray]
|
Topic-term importances over time. Only relevant for dynamic topic models. |
None
|
temporal_importance
|
Optional[ndarray]
|
Topic strength signal over time. Only relevant for dynamic topic models. |
None
|
has_negative_side
|
bool
|
Indicates whether the topic model's components are supposed to be interpreted in both directions. e.g. in SemanticSignalSeparation, one is supposed to look at highest, but also lowest ranking words. This is in contrast to KeyNMF for instance, where only positive word importance should be considered. |
False
|
hierarchy
|
Optional[TopicNode]
|
Optional topic hierarchy for models that support hierarchical topic modeling. |
None
|
Source code in turftopic/data.py
|
|
figures
property
Container object for topicwizard figures that can be generated from this TopicData object. You can use any of the interactive figures from the Figures API in topicwizard.
For instance:
topic_data.figures.topic_barcharts()
# or
topic_data.figures.topic_wordclouds()
from_disk(path)
classmethod
Loads TopicData object from disk with Joblib.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str | Path
|
Path to load the data from, e.g. "topic_data.joblib" |
required |
Source code in turftopic/data.py
231 232 233 234 235 236 237 238 239 240 241 242 |
|
to_disk(path)
Saves TopicData object to disk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str | Path
|
Path to save the data to, e.g. "topic_data.joblib" |
required |
Source code in turftopic/data.py
244 245 246 247 248 249 250 251 252 253 |
|
visualize_topicwizard(**kwargs)
Opens the topicwizard web app with which you can interactively investigate your model. See topicwizard's documentation for more detail.
Source code in turftopic/data.py
182 183 184 185 186 187 188 189 190 191 192 |
|