TopicData
While Turftopic provides a fully sklearn-compatible interface for training and using topic models, this is not always optimal, especially when you have to visualize models, or save more information about inference then would be practical to have in a model
object.
We have thus added an abstraction borrowed from topicwizard called TopicData
.
Producing TopicData
Every model has methods, with which you can produce this object:
Prepare TopicData
objects
topic_data = model.prepare_topic_data(corpus)
# print to see what attributes are available
print(topic_data)
TopicData
├── corpus (1000)
├── vocab (1746,)
├── document_term_matrix (1000, 1746)
├── topic_term_matrix (10, 1746)
├── document_topic_matrix (1000, 10)
├── document_representation (1000, 384)
├── transform
├── topic_names (10)
├── has_negative_side
└── hierarchy
Models that support dynamic topic modeling have this method too, which includes dynamic topics in the resulting TopicData
object.
import datetime
timestamps: list[datetime.datetime] = [...]
topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps)
Using TopicData
TopicData
is a dict-like object, and for all intents and purposes can be used as a Python dictionary, but for convenience you can also access its attributes with the dot syntax:
# They are the same
assert topic_data["document_term_matrix"].shape == topic_data.document_term_matrix.shape
Much like models, you can pretty-print information about topic models based on the TopicData
object, but, since it contains more information on inference then the model object itself, you sometimes have to pass less parameters than if you called the same method on the model:
model.print_representative_documents(0, corpus, document_topic_matrix)
# This is simpler with TopicData, since you only have to pass the topic ID
topic_data.print_representative_documents(0)
When producing figures, TopicData
also gives you shorthands for accessing the topicwizard web app and Figures API:
topic_data.figures.topic_map()
See our guide on Model Interpretation for more info.
API Reference
turftopic.data.TopicData
Bases: Mapping
, TopicContainer
Contains data about topic inference on a corpus. Can be used with multiple convenience and interpretation utilities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vocab
|
ndarray
|
Array of all words in the vocabulary of the topic model. |
required |
document_term_matrix
|
ndarray
|
Bag-of-words document representations. Elements of the matrix are word importances/frequencies for given documents. |
required |
document_topic_matrix
|
ndarray
|
Topic importances for each document. |
required |
topic_term_matrix
|
ndarray
|
Importances of each term for each topic in a matrix. |
required |
document_representation
|
ndarray
|
Embedded representations for documents. Can also be a sparse BoW matrix for classical models. |
required |
topic_names
|
Optional[list[str]]
|
Names or topic descriptions inferred for topics by the model. |
None
|
classes
|
Optional[ndarray]
|
Topic IDs that might be different from 0-n_topics. (For instance if you have an outlier topic, which is labelled -1) |
None
|
corpus
|
Optional[list[str]]
|
The corpus on which inference was run. Can be None. |
None
|
transform
|
Optional[Callable]
|
Function that transforms documents to document-topic matrices. Can be None in the case of transductive models. |
None
|
time_bin_edges
|
Optional[list[datetime]]
|
Edges of the time bins in a dynamic topic model. |
None
|
temporal_components
|
Optional[ndarray]
|
Topic-term importances over time. Only relevant for dynamic topic models. |
None
|
temporal_importance
|
Optional[ndarray]
|
Topic strength signal over time. Only relevant for dynamic topic models. |
None
|
has_negative_side
|
bool
|
Indicates whether the topic model's components are supposed to be interpreted in both directions. e.g. in SemanticSignalSeparation, one is supposed to look at highest, but also lowest ranking words. This is in contrast to KeyNMF for instance, where only positive word importance should be considered. |
False
|
hierarchy
|
Optional[TopicNode]
|
Optional topic hierarchy for models that support hierarchical topic modeling. |
None
|
Source code in turftopic/data.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 |
|
figures
property
Container object for topicwizard figures that can be generated from this TopicData object. You can use any of the interactive figures from the Figures API in topicwizard.
For instance:
topic_data.figures.topic_barcharts()
# or
topic_data.figures.topic_wordclouds()
from_disk(path)
classmethod
Loads TopicData object from disk with Joblib.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str | Path
|
Path to load the data from, e.g. "topic_data.joblib" |
required |
Source code in turftopic/data.py
231 232 233 234 235 236 237 238 239 240 241 242 |
|
to_disk(path)
Saves TopicData object to disk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str | Path
|
Path to save the data to, e.g. "topic_data.joblib" |
required |
Source code in turftopic/data.py
244 245 246 247 248 249 250 251 252 253 |
|
visualize_topicwizard(**kwargs)
Opens the topicwizard web app with which you can interactively investigate your model. See topicwizard's documentation for more detail.
Source code in turftopic/data.py
182 183 184 185 186 187 188 189 190 191 192 |
|