Hierarchical Topic Modeling
You might expect some topics in your corpus to belong to a hierarchy of topics.
Some models in Turftopic allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.
Models in Turftopic that can model hierarchical relations will have a hierarchy
property, that you can manipulate and print/visualize:
from turftopic import ClusteringTopicModel
model = ClusteringTopicModel(n_reduce_to=10).fit(corpus)
# We cut at level 3 for plotting, since the hierarchy is very deep
model.hierarchy.cut(3).plot_tree()
Drag and click to zoom, hover to see word importance
1. Divisive/Top-down Hierarchical Modeling
In divisive modeling, you start from larger structures, higher up in the hierarchy, and divide topics into smaller sub-topics on-demand.
This is how hierarchical modeling works in KeyNMF, which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
As a demonstration, let's load a corpus, that we know to have hierarchical themes.
from sklearn.datasets import fetch_20newsgroups
corpus = fetch_20newsgroups(
subset="all",
remove=("headers", "footers", "quotes"),
categories=[
"comp.os.ms-windows.misc",
"comp.sys.ibm.pc.hardware",
"talk.religion.misc",
"alt.atheism",
],
).data
In this case, we have two base themes, which are computers, and religion.
Let us fit a KeyNMF model with two topics to see if the model finds these.
from turftopic import KeyNMF
model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
model.print_topics()
Topic ID |
Highest Ranking |
0 |
windows, dos, os, disk, card, drivers, file, pc, files, microsoft |
1 |
atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs |
The results conform our intuition. Topic 0 seems to revolve around IT, while Topic 1 around atheism and religion.
We can already suspect, however that more granular topics could be discovered in this corpus.
For instance Topic 0 contains terms related to operating systems, like windows and dos, but also components, like disk and card.
We can access the hierarchy of topics in the model at the current stage, with the model's hierarchy
property.
Root
├── 0: windows, dos, os, disk, card, drivers, file, pc, files, microsoft
└── 1: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs
There isn't much to see yet, the model contains a flat hierarchy of the two topics we discovered and we are at root level.
We can dissect these topics, by adding a level to the hierarchy.
Let us add 3 subtopics to each topic on the root level.
model.hierarchy.divide_children(n_subtopics=3)
Root
├── 0: windows, dos, os, disk, card, drivers, file, pc, files, microsoft
│ ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory
│ ├── 0.1: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform
│ └── 0.2: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati
...
As you can see, the model managed to identify meaningful subtopics of the two larger topics we found earlier.
Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware.
You can also divide individual topics to a number of subtopics, by using the divide()
method.
Let us divide Topic 0.0 to 5 subtopics.
model.hierarchy[0][0].divide(5)
model.hierarchy
Root
├── 0: windows, dos, os, disk, card, drivers, file, pc, files, microsoft
│ ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory
│ │ ├── 0.0.1: file, files, ftp, bmp, program, windows, shareware, directory, bitmap, zip
│ │ ├── 0.0.2: os, windows, unix, microsoft, crash, apps, crashes, nt, pc, operating
...
2. Agglomerative/Bottom-up Hierarchical Modeling
In other models, hierarchies arise from starting from smaller, more specific topics, and then merging them together based on their similarity until a desired number of top-level topics are obtained.
This is how it is done in clustering topic models like BERTopic and Top2Vec.
Clustering models typically find a lot of topics, and it can help with interpretation to merge topics until you gain 10-20 top-level topics.
You can either do this by default on a clustering model by setting n_reduce_to
on initialization or you can do it manually with reduce_topics()
.
For more details, check our guide on Clustering models.
from turftopic import ClusteringTopicModel
model = ClusteringTopicModel(
n_reduce_to=10,
feature_importance="centroid",
reduction_method="smallest",
reduction_topic_representation="centroid",
reduction_distance_metric="cosine",
)
model.fit(corpus)
print(model.hierarchy)
Root:
├── -1: documented, obsolete, et4000, concerns, dubious, embedded, hardware, xfree86, alternative, seeking
├── 20: hitter, pitching, batting, hitters, pitchers, fielder, shortstop, inning, baseman, pitcher
├── 284: nhl, goaltenders, canucks, sabres, hockey, bruins, puck, oilers, canadiens, flyers
│ ├── 242: sportschannel, espn, nbc, nhl, broadcasts, broadcasting, broadcast, mlb, cbs, cbc
│ │ ├── 171: stadium, tickets, mlb, ticket, sportschannel, mets, inning, nationals, schedule, cubs
│ │ │ └── ...
│ │ └── 21: sportschannel, nbc, espn, nhl, broadcasting, broadcasts, broadcast, hockey, cbc, cbs
│ └── 236: nhl, goaltenders, canucks, sabres, puck, oilers, andreychuk, bruins, goaltender, leafs
...
API reference
turftopic.hierarchical.TopicNode
dataclass
Node for a topic in a topic hierarchy.
Parameters:
Name |
Type |
Description |
Default |
model |
ContextualModel
|
Underlying topic model, which the hierarchy is based on.
|
required
|
path |
tuple[int]
|
Path that leads to this node from the root of the tree.
|
()
|
word_importance |
Optional[ndarray]
|
Importance of each word in the vocabulary for given topic.
|
None
|
document_topic_vector |
Optional[ndarray]
|
Importance of the topic in all documents in the corpus.
|
None
|
children |
Optional[list[TopicNode]]
|
List of subtopics within this topic.
|
None
|
Source code in turftopic/hierarchical.py
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424 | @dataclass
class TopicNode:
"""Node for a topic in a topic hierarchy.
Parameters
----------
model: ContextualModel
Underlying topic model, which the hierarchy is based on.
path: tuple[int], default ()
Path that leads to this node from the root of the tree.
word_importance: ndarray of shape (n_vocab), default None
Importance of each word in the vocabulary for given topic.
document_topic_vector: ndarray of shape (n_documents), default None
Importance of the topic in all documents in the corpus.
children: list[TopicNode], default None
List of subtopics within this topic.
"""
model: ContextualModel
path: tuple[int] = ()
word_importance: Optional[np.ndarray] = None
document_topic_vector: Optional[np.ndarray] = None
children: Optional[list[TopicNode]] = None
def _path_str(self):
return ".".join([str(level_id) for level_id in self.path])
@property
def classes_(self):
if self.children is None:
raise AttributeError("TopicNode doesn't have children.")
return np.array([child.path[-1] for child in self.children])
@property
def components_(self):
if self.children is None:
raise AttributeError("TopicNode doesn't have children.")
return np.stack([child.word_importance for child in self.children])
@classmethod
def create_root(
cls,
model: ContextualModel,
components: np.ndarray,
document_topic_matrix: np.ndarray,
) -> TopicNode:
"""Creates root node from a topic models' components and topic importances in documents."""
children = []
n_components = components.shape[0]
classes = getattr(model, "classes_", None)
if classes is None:
classes = np.arange(n_components)
for topic_id, comp, doc_top in zip(
classes, components, document_topic_matrix.T
):
children.append(
cls(
model,
path=(topic_id,),
word_importance=comp,
document_topic_vector=doc_top,
children=None,
)
)
return cls(
model,
path=(),
word_importance=None,
document_topic_vector=None,
children=children,
)
@property
def level(self) -> int:
"""Indicates how deep down the hierarchy the topic is."""
return len(self.path)
def get_words(self, top_k: int = 10) -> list[tuple[str, float]]:
"""Returns top words and words importances for the topic.
Parameters
----------
top_k: int, default 10
Number of top words to return.
Returns
-------
list[tuple[str, float]]
List of word, importance pairs.
"""
if self.word_importance is None:
return []
vocab = self.model.get_vocab()
most_important = np.argsort(-self.word_importance)[:top_k]
words = vocab[most_important]
imp = self.word_importance[most_important]
return list(zip(words, imp))
@property
def description(self) -> str:
"""Returns a high level description of the topic with its path in the tree
and top words."""
if not len(self.path):
path = "Root"
else:
path = str(
self.path[-1]
) # ".".join([str(idx) for idx in self.path])
words = []
for word, imp in self.get_words(top_k=10):
words.append(word)
concat_words = ", ".join(words)
color = COLOR_PER_LEVEL[min(self.level, len(COLOR_PER_LEVEL) - 1)]
stylized = f"[{color} bold]{path}[/]: [italic]{concat_words}[/]"
console = Console()
with console.capture() as capture:
console.print(stylized, end="")
return capture.get()
@property
def _simple_desc(self) -> str:
if not len(self.path):
path = "Root"
else:
path = str(
self.path[-1]
) # ".".join([str(idx) for idx in self.path])
words = []
for word, imp in self.get_words(top_k=5):
words.append(word)
concat_words = ", ".join(words)
return f"{path}: {concat_words}"
def _build_tree(
self,
tree: Tree = None,
top_k: int = 10,
max_depth: Optional[int] = None,
) -> Tree:
if tree is None:
tree = Tree(self.description)
else:
tree = tree.add(self.description)
out_of_depth = (max_depth is not None) and (self.level >= max_depth)
if out_of_depth:
if self.children is not None:
tree.add("...")
return tree
if self.children is not None:
for child in self.children:
child._build_tree(tree, max_depth=max_depth)
return tree
def print_tree(
self,
top_k: int = 10,
max_depth: Optional[int] = None,
) -> None:
"""Print hierarchy in tree form.
Parameters
----------
top_k: int, default 10
Number of words to print for each topic.
max_depth: int, default None
Maximum depth at which topics should be printed in the hierarchy.
If None, the entire hierarchy is printed.
"""
tree = self._build_tree(top_k=top_k, max_depth=max_depth)
console = Console()
console.print(tree)
def __str__(self):
tree = self._build_tree(top_k=10, max_depth=3)
console = Console()
with console.capture() as capture:
console.print(tree)
return capture.get()
def __repr__(self):
return str(self)
def __getitem__(self, id_or_path: int):
if self.children is None:
raise IndexError(
"Current node is a leaf and does not have children."
)
mapping = {
topic_class: i_topic
for i_topic, topic_class in enumerate(self.classes_)
}
return self.children[mapping[id_or_path]]
def __iter__(self):
return iter(self.children)
def plot_tree(self):
"""Plots hierarchy as an interactive tree in Plotly."""
return _tree_plot(self)
def _append_path(self, path_prefix: int):
self.path = (path_prefix, *self.path)
if self.children is not None:
for child in self.children:
child._append_path(path_prefix)
def copy(self, deep: bool = True) -> TopicNode:
"""Creates a copy of the given node.
Parameters
----------
deep: bool, default True
Indicates whether the copy should be deep or shallow.
Deep copies are done recursively, while shallow copies only
contain references to the original children.
Returns
-------
Copy of original hierarchy.
"""
if (self.children is None) or (not deep):
return type(self)(
model=self.model,
path=self.path,
children=self.children,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
)
else:
children = [child.copy(deep=True) for child in self.children]
return type(self)(
model=self.model,
path=self.path,
children=children,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
)
def cut(self, max_depth: int) -> TopicNode:
"""Cuts hierarchy at a given depth, returns copy of the hierarchy with levels beyond max_depth removed.
Parameters
----------
max_depth: int
Maximum level of nodes to keep.
Returns
-------
TopicNode
Hierarchy cut at the given level.
Contains a deep copy of the original nodes.
"""
if (self.level >= max_depth) or (not self.children):
return type(self)(
model=self.model,
path=self.path,
children=None,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
)
else:
children = [child.cut(max_depth) for child in self.children]
return type(self)(
model=self.model,
path=self.path,
children=children,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
)
def collect_leaves(self) -> list[TopicNode]:
def _collect_leaves(node: TopicNode, leaves: list[TopicNode]):
if not node.children:
leaves.append(node.copy(deep=False))
else:
for child in node.children:
_collect_leaves(child, leaves)
leaves = []
_collect_leaves(self, leaves)
return leaves
def flatten(self) -> TopicNode:
"""Returns new hierarchy with only the leaves of the tree.
Returns
-------
TopicNode
Root node containing all leaves in a hierarchy.
Copies of the original nodes.
"""
leaves = self.collect_leaves()
ids = [leaf.path[-1] for leaf in leaves]
# If the IDs are not unique, we label them from 0 to N
if len(set(ids)) != len(ids):
current = 0
new_ids = []
for node_id in ids:
if node_id != -1:
new_ids.append(current)
current += 1
else:
new_ids.append(-1)
ids = new_ids
for leaf_id, leaf in zip(ids, leaves):
leaf.path = (*self.path, leaf_id)
return type(self)(
model=self.model,
path=self.path,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
children=leaves,
)
|
description: str
property
Returns a high level description of the topic with its path in the tree
and top words.
level: int
property
Indicates how deep down the hierarchy the topic is.
copy(deep=True)
Creates a copy of the given node.
Parameters:
Name |
Type |
Description |
Default |
deep |
bool
|
Indicates whether the copy should be deep or shallow.
Deep copies are done recursively, while shallow copies only
contain references to the original children.
|
True
|
Returns:
Type |
Description |
Copy of original hierarchy.
|
|
Source code in turftopic/hierarchical.py
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348 | def copy(self, deep: bool = True) -> TopicNode:
"""Creates a copy of the given node.
Parameters
----------
deep: bool, default True
Indicates whether the copy should be deep or shallow.
Deep copies are done recursively, while shallow copies only
contain references to the original children.
Returns
-------
Copy of original hierarchy.
"""
if (self.children is None) or (not deep):
return type(self)(
model=self.model,
path=self.path,
children=self.children,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
)
else:
children = [child.copy(deep=True) for child in self.children]
return type(self)(
model=self.model,
path=self.path,
children=children,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
)
|
create_root(model, components, document_topic_matrix)
classmethod
Creates root node from a topic models' components and topic importances in documents.
Source code in turftopic/hierarchical.py
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182 | @classmethod
def create_root(
cls,
model: ContextualModel,
components: np.ndarray,
document_topic_matrix: np.ndarray,
) -> TopicNode:
"""Creates root node from a topic models' components and topic importances in documents."""
children = []
n_components = components.shape[0]
classes = getattr(model, "classes_", None)
if classes is None:
classes = np.arange(n_components)
for topic_id, comp, doc_top in zip(
classes, components, document_topic_matrix.T
):
children.append(
cls(
model,
path=(topic_id,),
word_importance=comp,
document_topic_vector=doc_top,
children=None,
)
)
return cls(
model,
path=(),
word_importance=None,
document_topic_vector=None,
children=children,
)
|
cut(max_depth)
Cuts hierarchy at a given depth, returns copy of the hierarchy with levels beyond max_depth removed.
Parameters:
Name |
Type |
Description |
Default |
max_depth |
int
|
Maximum level of nodes to keep.
|
required
|
Returns:
Type |
Description |
TopicNode
|
Hierarchy cut at the given level.
Contains a deep copy of the original nodes.
|
Source code in turftopic/hierarchical.py
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380 | def cut(self, max_depth: int) -> TopicNode:
"""Cuts hierarchy at a given depth, returns copy of the hierarchy with levels beyond max_depth removed.
Parameters
----------
max_depth: int
Maximum level of nodes to keep.
Returns
-------
TopicNode
Hierarchy cut at the given level.
Contains a deep copy of the original nodes.
"""
if (self.level >= max_depth) or (not self.children):
return type(self)(
model=self.model,
path=self.path,
children=None,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
)
else:
children = [child.cut(max_depth) for child in self.children]
return type(self)(
model=self.model,
path=self.path,
children=children,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
)
|
flatten()
Returns new hierarchy with only the leaves of the tree.
Returns:
Type |
Description |
TopicNode
|
Root node containing all leaves in a hierarchy.
Copies of the original nodes.
|
Source code in turftopic/hierarchical.py
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424 | def flatten(self) -> TopicNode:
"""Returns new hierarchy with only the leaves of the tree.
Returns
-------
TopicNode
Root node containing all leaves in a hierarchy.
Copies of the original nodes.
"""
leaves = self.collect_leaves()
ids = [leaf.path[-1] for leaf in leaves]
# If the IDs are not unique, we label them from 0 to N
if len(set(ids)) != len(ids):
current = 0
new_ids = []
for node_id in ids:
if node_id != -1:
new_ids.append(current)
current += 1
else:
new_ids.append(-1)
ids = new_ids
for leaf_id, leaf in zip(ids, leaves):
leaf.path = (*self.path, leaf_id)
return type(self)(
model=self.model,
path=self.path,
word_importance=self.word_importance,
document_topic_vector=self.document_topic_vector,
children=leaves,
)
|
get_words(top_k=10)
Returns top words and words importances for the topic.
Parameters:
Name |
Type |
Description |
Default |
top_k |
int
|
Number of top words to return.
|
10
|
Returns:
Type |
Description |
list[tuple[str, float]]
|
List of word, importance pairs.
|
Source code in turftopic/hierarchical.py
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208 | def get_words(self, top_k: int = 10) -> list[tuple[str, float]]:
"""Returns top words and words importances for the topic.
Parameters
----------
top_k: int, default 10
Number of top words to return.
Returns
-------
list[tuple[str, float]]
List of word, importance pairs.
"""
if self.word_importance is None:
return []
vocab = self.model.get_vocab()
most_important = np.argsort(-self.word_importance)[:top_k]
words = vocab[most_important]
imp = self.word_importance[most_important]
return list(zip(words, imp))
|
plot_tree()
Plots hierarchy as an interactive tree in Plotly.
Source code in turftopic/hierarchical.py
| def plot_tree(self):
"""Plots hierarchy as an interactive tree in Plotly."""
return _tree_plot(self)
|
print_tree(top_k=10, max_depth=None)
Print hierarchy in tree form.
Parameters:
Name |
Type |
Description |
Default |
top_k |
int
|
Number of words to print for each topic.
|
10
|
max_depth |
Optional[int]
|
Maximum depth at which topics should be printed in the hierarchy.
If None, the entire hierarchy is printed.
|
None
|
Source code in turftopic/hierarchical.py
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282 | def print_tree(
self,
top_k: int = 10,
max_depth: Optional[int] = None,
) -> None:
"""Print hierarchy in tree form.
Parameters
----------
top_k: int, default 10
Number of words to print for each topic.
max_depth: int, default None
Maximum depth at which topics should be printed in the hierarchy.
If None, the entire hierarchy is printed.
"""
tree = self._build_tree(top_k=top_k, max_depth=max_depth)
console = Console()
console.print(tree)
|
turftopic.hierarchical.DivisibleTopicNode
dataclass
Bases: TopicNode
Node for a topic in a topic hierarchy that can be subdivided.
Source code in turftopic/hierarchical.py
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480 | @dataclass
class DivisibleTopicNode(TopicNode):
"""Node for a topic in a topic hierarchy that can be subdivided."""
def clear(self):
"""Deletes children of the given node."""
self.children = None
return self
def divide(self, n_subtopics: int, **kwargs):
"""Divides current node into smaller subtopics.
Only works when the underlying model is a divisive hierarchical model.
Parameters
----------
n_subtopics: int
Number of topics to divide the topic into.
"""
try:
self.children = self.model.divide_topic(
node=self, n_subtopics=n_subtopics, **kwargs
)
except AttributeError as e:
raise AttributeError(
"Looks like your model is not a divisive hierarchical model."
) from e
return self
def divide_children(self, n_subtopics: int, **kwargs):
"""Divides all children of the current node to smaller topics.
Only works when the underlying model is a divisive hierarchical model.
Parameters
----------
n_subtopics: int
Number of topics to divide the topics into.
"""
if self.children is None:
raise ValueError(
"Current Node is a leaf, children can't be subdivided."
)
for child in self.children:
child.divide(n_subtopics, **kwargs)
return self
def __str__(self):
tree = self._build_tree(top_k=10, max_depth=3)
console = Console()
with console.capture() as capture:
console.print(tree)
return capture.get()
def __repr__(self):
return str(self)
|
clear()
Deletes children of the given node.
Source code in turftopic/hierarchical.py
| def clear(self):
"""Deletes children of the given node."""
self.children = None
return self
|
divide(n_subtopics, **kwargs)
Divides current node into smaller subtopics.
Only works when the underlying model is a divisive hierarchical model.
Parameters:
Name |
Type |
Description |
Default |
n_subtopics |
int
|
Number of topics to divide the topic into.
|
required
|
Source code in turftopic/hierarchical.py
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453 | def divide(self, n_subtopics: int, **kwargs):
"""Divides current node into smaller subtopics.
Only works when the underlying model is a divisive hierarchical model.
Parameters
----------
n_subtopics: int
Number of topics to divide the topic into.
"""
try:
self.children = self.model.divide_topic(
node=self, n_subtopics=n_subtopics, **kwargs
)
except AttributeError as e:
raise AttributeError(
"Looks like your model is not a divisive hierarchical model."
) from e
return self
|
divide_children(n_subtopics, **kwargs)
Divides all children of the current node to smaller topics.
Only works when the underlying model is a divisive hierarchical model.
Parameters:
Name |
Type |
Description |
Default |
n_subtopics |
int
|
Number of topics to divide the topics into.
|
required
|
Source code in turftopic/hierarchical.py
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470 | def divide_children(self, n_subtopics: int, **kwargs):
"""Divides all children of the current node to smaller topics.
Only works when the underlying model is a divisive hierarchical model.
Parameters
----------
n_subtopics: int
Number of topics to divide the topics into.
"""
if self.children is None:
raise ValueError(
"Current Node is a leaf, children can't be subdivided."
)
for child in self.children:
child.divide(n_subtopics, **kwargs)
return self
|
turftopic.models._hierarchical_clusters.ClusterNode
Bases: TopicNode
Hierarchical Topic Node for clustering models.
Supports merging topics based on a hierarchical merging strategy.
Source code in turftopic/models/_hierarchical_clusters.py
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268 | class ClusterNode(TopicNode):
"""Hierarchical Topic Node for clustering models.
Supports merging topics based on a hierarchical merging strategy."""
@classmethod
def create_root(cls, model: ContextualModel, labels: np.ndarray):
"""Creates root node from a topic models' components and topic importances in documents."""
classes = np.sort(np.unique(labels))
document_topic_matrix = safe_binarize(labels, classes=classes)
children = []
for topic_id, doc_top in zip(classes, document_topic_matrix.T):
children.append(
cls(
model,
path=(topic_id,),
document_topic_vector=doc_top,
children=None,
)
)
res = cls(
model,
path=(),
word_importance=None,
document_topic_vector=None,
children=children,
)
res.estimate_components()
return res
def join_topics(
self, to_join: Sequence[int], joint_id: Optional[int] = None
):
"""Joins a number of topics into a new topic with a given ID.
Parameters
----------
to_join: Sequence of int
Children in the hierarchy to join (IDs indicate the last element of the path).
joint_id: int, default None
ID to give to the joint topic. By default, this will be the topic with the smallest ID.
"""
if self.children is None:
raise TypeError("Node doesn't have children, can't merge.")
if len(set(to_join)) < len(to_join):
raise ValueError(
f"You can't join a cluster with itself: {to_join}"
)
if joint_id is None:
joint_id = min(to_join)
children = [self[i] for i in to_join]
joint_membership = np.stack(
[child.document_topic_vector for child in children]
)
joint_membership = np.sum(joint_membership, axis=0)
child_ids = [child.path[-1] for child in children]
joint_node = TopicNode(
model=self.model,
children=children,
document_topic_vector=joint_membership,
path=(*self.path, joint_id),
)
for child in joint_node:
child._append_path(joint_id)
self.children = [
child for child in self.children if child.path[-1] not in child_ids
] + [joint_node]
component_map = self._estimate_children_components()
for child in self.children:
child.word_importance = component_map[child.path[-1]]
def estimate_components(self) -> np.ndarray:
component_map = self._estimate_children_components()
for child in self.children:
child.word_importance = component_map[child.path[-1]]
return self.components_
@property
def labels_(self) -> np.ndarray:
topic_document_membership = np.stack(
[child.document_topic_vector for child in self.children]
)
labels = np.argmax(topic_document_membership, axis=0)
strength = np.max(topic_document_membership, axis=0)
# documents that are not in this part of the hierarchy are treated as outliers
labels[strength == 0] = -1
return np.array(
[self.children[label].path[-1] for label in labels if label != -1]
)
def _estimate_children_components(self) -> dict[int, np.ndarray]:
"""Estimates feature importances based on a fitted clustering."""
clusters = np.unique(self.labels_)
classes = np.sort(clusters)
labels = self.labels_
topic_vectors = self.model._calculate_topic_vectors(
classes=classes, labels=labels
)
document_topic_matrix = safe_binarize(labels, classes=classes)
if self.model.feature_importance == "soft-c-tf-idf":
components = soft_ctf_idf(
document_topic_matrix, self.model.doc_term_matrix
) # type: ignore
elif self.model.feature_importance == "centroid":
if not hasattr(self.model, "vocab_embeddings"):
self.model.vocab_embeddings = self.model.encode_documents(
self.model.vectorizer.get_feature_names_out()
) # type: ignore
if (
self.model.vocab_embeddings.shape[1]
!= topic_vectors.shape[1]
):
raise ValueError(
NOT_MATCHING_ERROR.format(
n_dims=topic_vectors.shape[1],
n_word_dims=self.model.vocab_embeddings.shape[1],
)
)
components = cluster_centroid_distance(
topic_vectors,
self.model.vocab_embeddings,
)
elif self.model.feature_importance == "bayes":
components = bayes_rule(
document_topic_matrix, self.model.doc_term_matrix
)
else:
components = ctf_idf(
document_topic_matrix, self.model.doc_term_matrix
)
return dict(zip(classes, components))
def _merge_clusters(self, linkage_matrix: np.ndarray):
classes = self.classes_
max_class = len(classes[classes != -1])
for i_cluster, (left, right, *_) in enumerate(linkage_matrix):
self.join_topics(
[int(left), int(right)], int(max_class + i_cluster)
)
def _calculate_linkage(
self, n_reduce_to: int, method: str = "average", metric: str = "cosine"
) -> np.ndarray:
if method not in VALID_LINKAGE_METHODS:
raise ValueError(
f"Linkage method has to be one of: {VALID_LINKAGE_METHODS}, but got {method} instead."
)
classes = self.classes_
labels = self.labels_
topic_sizes = np.array([np.sum(labels == label) for label in classes])
topic_representations = self.model.topic_representations
if method == "smallest":
return smallest_linkage(
n_reduce_to=n_reduce_to,
topic_vectors=topic_representations,
topic_sizes=topic_sizes,
classes=classes,
metric=metric,
)
else:
n_classes = len(classes[classes != -1])
topic_vectors = topic_representations[classes != -1]
n_reductions = n_classes - n_reduce_to
return linkage(topic_vectors, method=method, metric=metric)[
:n_reductions
]
def reduce_topics(
self, n_reduce_to: int, method: str = "average", metric: str = "cosine"
):
n_topics = np.sum(self.classes_ != -1)
if n_topics <= n_reduce_to:
warnings.warn(
f"Number of clusters is already {n_topics} <= {n_reduce_to}, nothing to do."
)
return
linkage_matrix = self._calculate_linkage(
n_reduce_to, method=method, metric=metric
)
self.linkage_matrix_ = linkage_matrix
self._merge_clusters(linkage_matrix)
|
create_root(model, labels)
classmethod
Creates root node from a topic models' components and topic importances in documents.
Source code in turftopic/models/_hierarchical_clusters.py
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116 | @classmethod
def create_root(cls, model: ContextualModel, labels: np.ndarray):
"""Creates root node from a topic models' components and topic importances in documents."""
classes = np.sort(np.unique(labels))
document_topic_matrix = safe_binarize(labels, classes=classes)
children = []
for topic_id, doc_top in zip(classes, document_topic_matrix.T):
children.append(
cls(
model,
path=(topic_id,),
document_topic_vector=doc_top,
children=None,
)
)
res = cls(
model,
path=(),
word_importance=None,
document_topic_vector=None,
children=children,
)
res.estimate_components()
return res
|
join_topics(to_join, joint_id=None)
Joins a number of topics into a new topic with a given ID.
Parameters:
Name |
Type |
Description |
Default |
to_join |
Sequence[int]
|
Children in the hierarchy to join (IDs indicate the last element of the path).
|
required
|
joint_id |
Optional[int]
|
ID to give to the joint topic. By default, this will be the topic with the smallest ID.
|
None
|
Source code in turftopic/models/_hierarchical_clusters.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157 | def join_topics(
self, to_join: Sequence[int], joint_id: Optional[int] = None
):
"""Joins a number of topics into a new topic with a given ID.
Parameters
----------
to_join: Sequence of int
Children in the hierarchy to join (IDs indicate the last element of the path).
joint_id: int, default None
ID to give to the joint topic. By default, this will be the topic with the smallest ID.
"""
if self.children is None:
raise TypeError("Node doesn't have children, can't merge.")
if len(set(to_join)) < len(to_join):
raise ValueError(
f"You can't join a cluster with itself: {to_join}"
)
if joint_id is None:
joint_id = min(to_join)
children = [self[i] for i in to_join]
joint_membership = np.stack(
[child.document_topic_vector for child in children]
)
joint_membership = np.sum(joint_membership, axis=0)
child_ids = [child.path[-1] for child in children]
joint_node = TopicNode(
model=self.model,
children=children,
document_topic_vector=joint_membership,
path=(*self.path, joint_id),
)
for child in joint_node:
child._append_path(joint_id)
self.children = [
child for child in self.children if child.path[-1] not in child_ids
] + [joint_node]
component_map = self._estimate_children_components()
for child in self.children:
child.word_importance = component_map[child.path[-1]]
|