Skip to content

Hierarchical Topic Modeling

You might expect some topics in your corpus to belong to a hierarchy of topics. Some models in Turftopic allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.

Models in Turftopic that can model hierarchical relations will have a hierarchy property, that you can manipulate and print/visualize:

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10).fit(corpus)
# We cut at level 3 for plotting, since the hierarchy is very deep
model.hierarchy.cut(3).plot_tree()

Drag and click to zoom, hover to see word importance

1. Divisive/Top-down Hierarchical Modeling

In divisive modeling, you start from larger structures, higher up in the hierarchy, and divide topics into smaller sub-topics on-demand. This is how hierarchical modeling works in KeyNMF, which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.

As a demonstration, let's load a corpus, that we know to have hierarchical themes.

from sklearn.datasets import fetch_20newsgroups

corpus = fetch_20newsgroups(
    subset="all",
    remove=("headers", "footers", "quotes"),
    categories=[
        "comp.os.ms-windows.misc",
        "comp.sys.ibm.pc.hardware",
        "talk.religion.misc",
        "alt.atheism",
    ],
).data

In this case, we have two base themes, which are computers, and religion. Let us fit a KeyNMF model with two topics to see if the model finds these.

from turftopic import KeyNMF

model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
model.print_topics()
Topic ID Highest Ranking
0 windows, dos, os, disk, card, drivers, file, pc, files, microsoft
1 atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs

The results conform our intuition. Topic 0 seems to revolve around IT, while Topic 1 around atheism and religion. We can already suspect, however that more granular topics could be discovered in this corpus. For instance Topic 0 contains terms related to operating systems, like windows and dos, but also components, like disk and card.

We can access the hierarchy of topics in the model at the current stage, with the model's hierarchy property.

print(model.hierarchy)
Root
├── 0: windows, dos, os, disk, card, drivers, file, pc, files, microsoft
└── 1: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs

There isn't much to see yet, the model contains a flat hierarchy of the two topics we discovered and we are at root level. We can dissect these topics, by adding a level to the hierarchy.

Let us add 3 subtopics to each topic on the root level.

model.hierarchy.divide_children(n_subtopics=3)
Root
├── 0: windows, dos, os, disk, card, drivers, file, pc, files, microsoft
│ ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory
│ ├── 0.1: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform
│ └── 0.2: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati
...

As you can see, the model managed to identify meaningful subtopics of the two larger topics we found earlier. Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware.

You can also divide individual topics to a number of subtopics, by using the divide() method. Let us divide Topic 0.0 to 5 subtopics.

model.hierarchy[0][0].divide(5)
model.hierarchy
Root
├── 0: windows, dos, os, disk, card, drivers, file, pc, files, microsoft
│ ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory
│ │ ├── 0.0.1: file, files, ftp, bmp, program, windows, shareware, directory, bitmap, zip
│ │ ├── 0.0.2: os, windows, unix, microsoft, crash, apps, crashes, nt, pc, operating
...

2. Agglomerative/Bottom-up Hierarchical Modeling

In other models, hierarchies arise from starting from smaller, more specific topics, and then merging them together based on their similarity until a desired number of top-level topics are obtained.

This is how it is done in clustering topic models like BERTopic and Top2Vec. Clustering models typically find a lot of topics, and it can help with interpretation to merge topics until you gain 10-20 top-level topics.

You can either do this by default on a clustering model by setting n_reduce_to on initialization or you can do it manually with reduce_topics(). For more details, check our guide on Clustering models.

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(
    n_reduce_to=10,
    feature_importance="centroid",
    reduction_method="smallest",
    reduction_topic_representation="centroid",
    reduction_distance_metric="cosine",
)
model.fit(corpus)

print(model.hierarchy)
Root:
├── -1: documented, obsolete, et4000, concerns, dubious, embedded, hardware, xfree86, alternative, seeking
├── 20: hitter, pitching, batting, hitters, pitchers, fielder, shortstop, inning, baseman, pitcher
├── 284: nhl, goaltenders, canucks, sabres, hockey, bruins, puck, oilers, canadiens, flyers
│ ├── 242: sportschannel, espn, nbc, nhl, broadcasts, broadcasting, broadcast, mlb, cbs, cbc
│ │ ├── 171: stadium, tickets, mlb, ticket, sportschannel, mets, inning, nationals, schedule, cubs
│ │ │ └── ...
│ │ └── 21: sportschannel, nbc, espn, nhl, broadcasting, broadcasts, broadcast, hockey, cbc, cbs
│ └── 236: nhl, goaltenders, canucks, sabres, puck, oilers, andreychuk, bruins, goaltender, leafs
...

API reference

turftopic.hierarchical.TopicNode dataclass

Node for a topic in a topic hierarchy.

Parameters:

Name Type Description Default
model ContextualModel

Underlying topic model, which the hierarchy is based on.

required
path tuple[int]

Path that leads to this node from the root of the tree.

()
word_importance Optional[ndarray]

Importance of each word in the vocabulary for given topic.

None
document_topic_vector Optional[ndarray]

Importance of the topic in all documents in the corpus.

None
children Optional[list[TopicNode]]

List of subtopics within this topic.

None
Source code in turftopic/hierarchical.py
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
@dataclass
class TopicNode:
    """Node for a topic in a topic hierarchy.

    Parameters
    ----------
    model: ContextualModel
        Underlying topic model, which the hierarchy is based on.
    path: tuple[int], default ()
        Path that leads to this node from the root of the tree.
    word_importance: ndarray of shape (n_vocab), default None
        Importance of each word in the vocabulary for given topic.
    document_topic_vector: ndarray of shape (n_documents), default None
        Importance of the topic in all documents in the corpus.
    children: list[TopicNode], default None
        List of subtopics within this topic.
    """

    model: ContextualModel
    path: tuple[int] = ()
    word_importance: Optional[np.ndarray] = None
    document_topic_vector: Optional[np.ndarray] = None
    children: Optional[list[TopicNode]] = None

    def _path_str(self):
        return ".".join([str(level_id) for level_id in self.path])

    @property
    def classes_(self):
        if self.children is None:
            raise AttributeError("TopicNode doesn't have children.")
        return np.array([child.path[-1] for child in self.children])

    @property
    def components_(self):
        if self.children is None:
            raise AttributeError("TopicNode doesn't have children.")
        return np.stack([child.word_importance for child in self.children])

    @classmethod
    def create_root(
        cls,
        model: ContextualModel,
        components: np.ndarray,
        document_topic_matrix: np.ndarray,
    ) -> TopicNode:
        """Creates root node from a topic models' components and topic importances in documents."""
        children = []
        n_components = components.shape[0]
        classes = getattr(model, "classes_", None)
        if classes is None:
            classes = np.arange(n_components)
        for topic_id, comp, doc_top in zip(
            classes, components, document_topic_matrix.T
        ):
            children.append(
                cls(
                    model,
                    path=(topic_id,),
                    word_importance=comp,
                    document_topic_vector=doc_top,
                    children=None,
                )
            )
        return cls(
            model,
            path=(),
            word_importance=None,
            document_topic_vector=None,
            children=children,
        )

    @property
    def level(self) -> int:
        """Indicates how deep down the hierarchy the topic is."""
        return len(self.path)

    def get_words(self, top_k: int = 10) -> list[tuple[str, float]]:
        """Returns top words and words importances for the topic.

        Parameters
        ----------
        top_k: int, default 10
            Number of top words to return.

        Returns
        -------
        list[tuple[str, float]]
            List of word, importance pairs.
        """
        if self.word_importance is None:
            return []
        vocab = self.model.get_vocab()
        most_important = np.argsort(-self.word_importance)[:top_k]
        words = vocab[most_important]
        imp = self.word_importance[most_important]
        return list(zip(words, imp))

    @property
    def description(self) -> str:
        """Returns a high level description of the topic with its path in the tree
        and top words."""
        if not len(self.path):
            path = "Root"
        else:
            path = str(
                self.path[-1]
            )  # ".".join([str(idx) for idx in self.path])
        words = []
        for word, imp in self.get_words(top_k=10):
            words.append(word)
        concat_words = ", ".join(words)
        color = COLOR_PER_LEVEL[min(self.level, len(COLOR_PER_LEVEL) - 1)]
        stylized = f"[{color} bold]{path}[/]: [italic]{concat_words}[/]"
        console = Console()
        with console.capture() as capture:
            console.print(stylized, end="")
        return capture.get()

    @property
    def _simple_desc(self) -> str:
        if not len(self.path):
            path = "Root"
        else:
            path = str(
                self.path[-1]
            )  # ".".join([str(idx) for idx in self.path])
        words = []
        for word, imp in self.get_words(top_k=5):
            words.append(word)
        concat_words = ", ".join(words)
        return f"{path}: {concat_words}"

    def _build_tree(
        self,
        tree: Tree = None,
        top_k: int = 10,
        max_depth: Optional[int] = None,
    ) -> Tree:
        if tree is None:
            tree = Tree(self.description)
        else:
            tree = tree.add(self.description)
        out_of_depth = (max_depth is not None) and (self.level >= max_depth)
        if out_of_depth:
            if self.children is not None:
                tree.add("...")
            return tree
        if self.children is not None:
            for child in self.children:
                child._build_tree(tree, max_depth=max_depth)
        return tree

    def print_tree(
        self,
        top_k: int = 10,
        max_depth: Optional[int] = None,
    ) -> None:
        """Print hierarchy in tree form.

        Parameters
        ----------
        top_k: int, default 10
            Number of words to print for each topic.
        max_depth: int, default None
            Maximum depth at which topics should be printed in the hierarchy.
            If None, the entire hierarchy is printed.
        """
        tree = self._build_tree(top_k=top_k, max_depth=max_depth)
        console = Console()
        console.print(tree)

    def __str__(self):
        tree = self._build_tree(top_k=10, max_depth=3)
        console = Console()
        with console.capture() as capture:
            console.print(tree)
        return capture.get()

    def __repr__(self):
        return str(self)

    def __getitem__(self, id_or_path: int):
        if self.children is None:
            raise IndexError(
                "Current node is a leaf and does not have children."
            )
        mapping = {
            topic_class: i_topic
            for i_topic, topic_class in enumerate(self.classes_)
        }
        return self.children[mapping[id_or_path]]

    def __iter__(self):
        return iter(self.children)

    def plot_tree(self):
        """Plots hierarchy as an interactive tree in Plotly."""
        return _tree_plot(self)

    def _append_path(self, path_prefix: int):
        self.path = (path_prefix, *self.path)
        if self.children is not None:
            for child in self.children:
                child._append_path(path_prefix)

    def copy(self, deep: bool = True) -> TopicNode:
        """Creates a copy of the given node.

        Parameters
        ----------
        deep: bool, default True
            Indicates whether the copy should be deep or shallow.
            Deep copies are done recursively, while shallow copies only
            contain references to the original children.

        Returns
        -------
        Copy of original hierarchy.
        """
        if (self.children is None) or (not deep):
            return type(self)(
                model=self.model,
                path=self.path,
                children=self.children,
                word_importance=self.word_importance,
                document_topic_vector=self.document_topic_vector,
            )
        else:
            children = [child.copy(deep=True) for child in self.children]
            return type(self)(
                model=self.model,
                path=self.path,
                children=children,
                word_importance=self.word_importance,
                document_topic_vector=self.document_topic_vector,
            )

    def cut(self, max_depth: int) -> TopicNode:
        """Cuts hierarchy at a given depth, returns copy of the hierarchy with levels beyond max_depth removed.

        Parameters
        ----------
        max_depth: int
            Maximum level of nodes to keep.

        Returns
        -------
        TopicNode
            Hierarchy cut at the given level.
            Contains a deep copy of the original nodes.
        """
        if (self.level >= max_depth) or (not self.children):
            return type(self)(
                model=self.model,
                path=self.path,
                children=None,
                word_importance=self.word_importance,
                document_topic_vector=self.document_topic_vector,
            )
        else:
            children = [child.cut(max_depth) for child in self.children]
            return type(self)(
                model=self.model,
                path=self.path,
                children=children,
                word_importance=self.word_importance,
                document_topic_vector=self.document_topic_vector,
            )

    def collect_leaves(self) -> list[TopicNode]:
        def _collect_leaves(node: TopicNode, leaves: list[TopicNode]):
            if not node.children:
                leaves.append(node.copy(deep=False))
            else:
                for child in node.children:
                    _collect_leaves(child, leaves)

        leaves = []
        _collect_leaves(self, leaves)
        return leaves

    def flatten(self) -> TopicNode:
        """Returns new hierarchy with only the leaves of the tree.

        Returns
        -------
        TopicNode
            Root node containing all leaves in a hierarchy.
            Copies of the original nodes.
        """
        leaves = self.collect_leaves()
        ids = [leaf.path[-1] for leaf in leaves]
        # If the IDs are not unique, we label them from 0 to N
        if len(set(ids)) != len(ids):
            current = 0
            new_ids = []
            for node_id in ids:
                if node_id != -1:
                    new_ids.append(current)
                    current += 1
                else:
                    new_ids.append(-1)
            ids = new_ids
        for leaf_id, leaf in zip(ids, leaves):
            leaf.path = (*self.path, leaf_id)
        return type(self)(
            model=self.model,
            path=self.path,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
            children=leaves,
        )

description property

Returns a high level description of the topic with its path in the tree and top words.

level property

Indicates how deep down the hierarchy the topic is.

copy(deep=True)

Creates a copy of the given node.

Parameters:

Name Type Description Default
deep bool

Indicates whether the copy should be deep or shallow. Deep copies are done recursively, while shallow copies only contain references to the original children.

True

Returns:

Type Description
Copy of original hierarchy.
Source code in turftopic/hierarchical.py
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
def copy(self, deep: bool = True) -> TopicNode:
    """Creates a copy of the given node.

    Parameters
    ----------
    deep: bool, default True
        Indicates whether the copy should be deep or shallow.
        Deep copies are done recursively, while shallow copies only
        contain references to the original children.

    Returns
    -------
    Copy of original hierarchy.
    """
    if (self.children is None) or (not deep):
        return type(self)(
            model=self.model,
            path=self.path,
            children=self.children,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
        )
    else:
        children = [child.copy(deep=True) for child in self.children]
        return type(self)(
            model=self.model,
            path=self.path,
            children=children,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
        )

create_root(model, components, document_topic_matrix) classmethod

Creates root node from a topic models' components and topic importances in documents.

Source code in turftopic/hierarchical.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
@classmethod
def create_root(
    cls,
    model: ContextualModel,
    components: np.ndarray,
    document_topic_matrix: np.ndarray,
) -> TopicNode:
    """Creates root node from a topic models' components and topic importances in documents."""
    children = []
    n_components = components.shape[0]
    classes = getattr(model, "classes_", None)
    if classes is None:
        classes = np.arange(n_components)
    for topic_id, comp, doc_top in zip(
        classes, components, document_topic_matrix.T
    ):
        children.append(
            cls(
                model,
                path=(topic_id,),
                word_importance=comp,
                document_topic_vector=doc_top,
                children=None,
            )
        )
    return cls(
        model,
        path=(),
        word_importance=None,
        document_topic_vector=None,
        children=children,
    )

cut(max_depth)

Cuts hierarchy at a given depth, returns copy of the hierarchy with levels beyond max_depth removed.

Parameters:

Name Type Description Default
max_depth int

Maximum level of nodes to keep.

required

Returns:

Type Description
TopicNode

Hierarchy cut at the given level. Contains a deep copy of the original nodes.

Source code in turftopic/hierarchical.py
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
def cut(self, max_depth: int) -> TopicNode:
    """Cuts hierarchy at a given depth, returns copy of the hierarchy with levels beyond max_depth removed.

    Parameters
    ----------
    max_depth: int
        Maximum level of nodes to keep.

    Returns
    -------
    TopicNode
        Hierarchy cut at the given level.
        Contains a deep copy of the original nodes.
    """
    if (self.level >= max_depth) or (not self.children):
        return type(self)(
            model=self.model,
            path=self.path,
            children=None,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
        )
    else:
        children = [child.cut(max_depth) for child in self.children]
        return type(self)(
            model=self.model,
            path=self.path,
            children=children,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
        )

flatten()

Returns new hierarchy with only the leaves of the tree.

Returns:

Type Description
TopicNode

Root node containing all leaves in a hierarchy. Copies of the original nodes.

Source code in turftopic/hierarchical.py
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
def flatten(self) -> TopicNode:
    """Returns new hierarchy with only the leaves of the tree.

    Returns
    -------
    TopicNode
        Root node containing all leaves in a hierarchy.
        Copies of the original nodes.
    """
    leaves = self.collect_leaves()
    ids = [leaf.path[-1] for leaf in leaves]
    # If the IDs are not unique, we label them from 0 to N
    if len(set(ids)) != len(ids):
        current = 0
        new_ids = []
        for node_id in ids:
            if node_id != -1:
                new_ids.append(current)
                current += 1
            else:
                new_ids.append(-1)
        ids = new_ids
    for leaf_id, leaf in zip(ids, leaves):
        leaf.path = (*self.path, leaf_id)
    return type(self)(
        model=self.model,
        path=self.path,
        word_importance=self.word_importance,
        document_topic_vector=self.document_topic_vector,
        children=leaves,
    )

get_words(top_k=10)

Returns top words and words importances for the topic.

Parameters:

Name Type Description Default
top_k int

Number of top words to return.

10

Returns:

Type Description
list[tuple[str, float]]

List of word, importance pairs.

Source code in turftopic/hierarchical.py
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
def get_words(self, top_k: int = 10) -> list[tuple[str, float]]:
    """Returns top words and words importances for the topic.

    Parameters
    ----------
    top_k: int, default 10
        Number of top words to return.

    Returns
    -------
    list[tuple[str, float]]
        List of word, importance pairs.
    """
    if self.word_importance is None:
        return []
    vocab = self.model.get_vocab()
    most_important = np.argsort(-self.word_importance)[:top_k]
    words = vocab[most_important]
    imp = self.word_importance[most_important]
    return list(zip(words, imp))

plot_tree()

Plots hierarchy as an interactive tree in Plotly.

Source code in turftopic/hierarchical.py
301
302
303
def plot_tree(self):
    """Plots hierarchy as an interactive tree in Plotly."""
    return _tree_plot(self)

print_tree(top_k=10, max_depth=None)

Print hierarchy in tree form.

Parameters:

Name Type Description Default
top_k int

Number of words to print for each topic.

10
max_depth Optional[int]

Maximum depth at which topics should be printed in the hierarchy. If None, the entire hierarchy is printed.

None
Source code in turftopic/hierarchical.py
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
def print_tree(
    self,
    top_k: int = 10,
    max_depth: Optional[int] = None,
) -> None:
    """Print hierarchy in tree form.

    Parameters
    ----------
    top_k: int, default 10
        Number of words to print for each topic.
    max_depth: int, default None
        Maximum depth at which topics should be printed in the hierarchy.
        If None, the entire hierarchy is printed.
    """
    tree = self._build_tree(top_k=top_k, max_depth=max_depth)
    console = Console()
    console.print(tree)

turftopic.hierarchical.DivisibleTopicNode dataclass

Bases: TopicNode

Node for a topic in a topic hierarchy that can be subdivided.

Source code in turftopic/hierarchical.py
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
@dataclass
class DivisibleTopicNode(TopicNode):
    """Node for a topic in a topic hierarchy that can be subdivided."""

    def clear(self):
        """Deletes children of the given node."""
        self.children = None
        return self

    def divide(self, n_subtopics: int, **kwargs):
        """Divides current node into smaller subtopics.
        Only works when the underlying model is a divisive hierarchical model.

        Parameters
        ----------
        n_subtopics: int
            Number of topics to divide the topic into.
        """
        try:
            self.children = self.model.divide_topic(
                node=self, n_subtopics=n_subtopics, **kwargs
            )
        except AttributeError as e:
            raise AttributeError(
                "Looks like your model is not a divisive hierarchical model."
            ) from e
        return self

    def divide_children(self, n_subtopics: int, **kwargs):
        """Divides all children of the current node to smaller topics.
        Only works when the underlying model is a divisive hierarchical model.

        Parameters
        ----------
        n_subtopics: int
            Number of topics to divide the topics into.
        """
        if self.children is None:
            raise ValueError(
                "Current Node is a leaf, children can't be subdivided."
            )
        for child in self.children:
            child.divide(n_subtopics, **kwargs)
        return self

    def __str__(self):
        tree = self._build_tree(top_k=10, max_depth=3)
        console = Console()
        with console.capture() as capture:
            console.print(tree)
        return capture.get()

    def __repr__(self):
        return str(self)

clear()

Deletes children of the given node.

Source code in turftopic/hierarchical.py
424
425
426
427
def clear(self):
    """Deletes children of the given node."""
    self.children = None
    return self

divide(n_subtopics, **kwargs)

Divides current node into smaller subtopics. Only works when the underlying model is a divisive hierarchical model.

Parameters:

Name Type Description Default
n_subtopics int

Number of topics to divide the topic into.

required
Source code in turftopic/hierarchical.py
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
def divide(self, n_subtopics: int, **kwargs):
    """Divides current node into smaller subtopics.
    Only works when the underlying model is a divisive hierarchical model.

    Parameters
    ----------
    n_subtopics: int
        Number of topics to divide the topic into.
    """
    try:
        self.children = self.model.divide_topic(
            node=self, n_subtopics=n_subtopics, **kwargs
        )
    except AttributeError as e:
        raise AttributeError(
            "Looks like your model is not a divisive hierarchical model."
        ) from e
    return self

divide_children(n_subtopics, **kwargs)

Divides all children of the current node to smaller topics. Only works when the underlying model is a divisive hierarchical model.

Parameters:

Name Type Description Default
n_subtopics int

Number of topics to divide the topics into.

required
Source code in turftopic/hierarchical.py
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
def divide_children(self, n_subtopics: int, **kwargs):
    """Divides all children of the current node to smaller topics.
    Only works when the underlying model is a divisive hierarchical model.

    Parameters
    ----------
    n_subtopics: int
        Number of topics to divide the topics into.
    """
    if self.children is None:
        raise ValueError(
            "Current Node is a leaf, children can't be subdivided."
        )
    for child in self.children:
        child.divide(n_subtopics, **kwargs)
    return self

turftopic.models._hierarchical_clusters.ClusterNode dataclass

Bases: TopicNode

Hierarchical Topic Node for clustering models. Supports merging topics based on a hierarchical merging strategy.

Source code in turftopic/models/_hierarchical_clusters.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
class ClusterNode(TopicNode):
    """Hierarchical Topic Node for clustering models.
    Supports merging topics based on a hierarchical merging strategy."""

    @classmethod
    def create_root(cls, model: ContextualModel, labels: np.ndarray):
        """Creates root node from a topic models' components and topic importances in documents."""
        classes = np.sort(np.unique(labels))
        document_topic_matrix = label_binarize(labels, classes=classes)
        children = []
        for topic_id, doc_top in zip(classes, document_topic_matrix.T):
            children.append(
                cls(
                    model,
                    path=(topic_id,),
                    document_topic_vector=doc_top,
                    children=None,
                )
            )
        res = cls(
            model,
            path=(),
            word_importance=None,
            document_topic_vector=None,
            children=children,
        )
        res.estimate_components()
        return res

    def join_topics(
        self, to_join: Sequence[int], joint_id: Optional[int] = None
    ):
        """Joins a number of topics into a new topic with a given ID.

        Parameters
        ----------
        to_join: Sequence of int
            Children in the hierarchy to join (IDs indicate the last element of the path).
        joint_id: int, default None
            ID to give to the joint topic. By default, this will be the topic with the smallest ID.
        """
        if self.children is None:
            raise TypeError("Node doesn't have children, can't merge.")
        if len(set(to_join)) < len(to_join):
            raise ValueError(
                f"You can't join a cluster with itself: {to_join}"
            )
        if joint_id is None:
            joint_id = min(to_join)
        children = [self[i] for i in to_join]
        joint_membership = np.stack(
            [child.document_topic_vector for child in children]
        )
        joint_membership = np.sum(joint_membership, axis=0)
        child_ids = [child.path[-1] for child in children]
        joint_node = TopicNode(
            model=self.model,
            children=children,
            document_topic_vector=joint_membership,
            path=(*self.path, joint_id),
        )
        for child in joint_node:
            child._append_path(joint_id)
        self.children = [
            child for child in self.children if child.path[-1] not in child_ids
        ] + [joint_node]
        component_map = self._estimate_children_components()
        for child in self.children:
            child.word_importance = component_map[child.path[-1]]

    def estimate_components(self) -> np.ndarray:
        component_map = self._estimate_children_components()
        for child in self.children:
            child.word_importance = component_map[child.path[-1]]
        return self.components_

    @property
    def labels_(self) -> np.ndarray:
        topic_document_membership = np.stack(
            [child.document_topic_vector for child in self.children]
        )
        labels = np.argmax(topic_document_membership, axis=0)
        strength = np.max(topic_document_membership, axis=0)
        # documents that are not in this part of the hierarchy are treated as outliers
        labels[strength == 0] = -1
        return np.array(
            [self.children[label].path[-1] for label in labels if label != -1]
        )

    def _estimate_children_components(self) -> dict[int, np.ndarray]:
        """Estimates feature importances based on a fitted clustering."""
        clusters = np.unique(self.labels_)
        classes = np.sort(clusters)
        labels = self.labels_
        topic_vectors = self.model._calculate_topic_vectors(
            classes=classes, labels=labels
        )
        document_topic_matrix = label_binarize(labels, classes=classes)
        if self.model.feature_importance == "soft-c-tf-idf":
            components = soft_ctf_idf(
                document_topic_matrix, self.model.doc_term_matrix
            )  # type: ignore
        elif self.model.feature_importance == "centroid":
            if not hasattr(self.model, "vocab_embeddings"):
                self.model.vocab_embeddings = self.model.encoder_.encode(
                    self.model.vectorizer.get_feature_names_out()
                )  # type: ignore
                if (
                    self.model.vocab_embeddings.shape[1]
                    != topic_vectors.shape[1]
                ):
                    raise ValueError(
                        NOT_MATCHING_ERROR.format(
                            n_dims=topic_vectors.shape[1],
                            n_word_dims=self.model.vocab_embeddings.shape[1],
                        )
                    )
            components = cluster_centroid_distance(
                topic_vectors,
                self.model.vocab_embeddings,
            )
        elif self.model.feature_importance == "bayes":
            components = bayes_rule(
                document_topic_matrix, self.model.doc_term_matrix
            )
        else:
            components = ctf_idf(
                document_topic_matrix, self.model.doc_term_matrix
            )
        return dict(zip(classes, components))

    def _merge_clusters(self, linkage_matrix: np.ndarray):
        classes = self.classes_
        max_class = len(classes[classes != -1])
        for i_cluster, (left, right, *_) in enumerate(linkage_matrix):
            self.join_topics(
                [int(left), int(right)], int(max_class + i_cluster)
            )

    def _calculate_linkage(
        self, n_reduce_to: int, method: str = "average", metric: str = "cosine"
    ) -> np.ndarray:
        if method not in VALID_LINKAGE_METHODS:
            raise ValueError(
                f"Linkage method has to be one of: {VALID_LINKAGE_METHODS}, but got {method} instead."
            )
        classes = self.classes_
        labels = self.labels_
        topic_sizes = np.array([np.sum(labels == label) for label in classes])
        topic_representations = self.model.topic_representations
        if method == "smallest":
            return smallest_linkage(
                n_reduce_to=n_reduce_to,
                topic_vectors=topic_representations,
                topic_sizes=topic_sizes,
                classes=classes,
                metric=metric,
            )
        else:
            n_classes = len(classes[classes != -1])
            topic_vectors = topic_representations[classes != -1]
            n_reductions = n_classes - n_reduce_to
            return linkage(topic_vectors, method=method, metric=metric)[
                :n_reductions
            ]

    def reduce_topics(
        self, n_reduce_to: int, method: str = "average", metric: str = "cosine"
    ):
        n_topics = np.sum(self.classes_ != -1)
        if n_topics <= n_reduce_to:
            warnings.warn(
                f"Number of clusters is already {n_topics} <= {n_reduce_to}, nothing to do."
            )
            return
        linkage_matrix = self._calculate_linkage(
            n_reduce_to, method=method, metric=metric
        )
        self.linkage_matrix_ = linkage_matrix
        self._merge_clusters(linkage_matrix)

_estimate_children_components()

Estimates feature importances based on a fitted clustering.

Source code in turftopic/models/_hierarchical_clusters.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
def _estimate_children_components(self) -> dict[int, np.ndarray]:
    """Estimates feature importances based on a fitted clustering."""
    clusters = np.unique(self.labels_)
    classes = np.sort(clusters)
    labels = self.labels_
    topic_vectors = self.model._calculate_topic_vectors(
        classes=classes, labels=labels
    )
    document_topic_matrix = label_binarize(labels, classes=classes)
    if self.model.feature_importance == "soft-c-tf-idf":
        components = soft_ctf_idf(
            document_topic_matrix, self.model.doc_term_matrix
        )  # type: ignore
    elif self.model.feature_importance == "centroid":
        if not hasattr(self.model, "vocab_embeddings"):
            self.model.vocab_embeddings = self.model.encoder_.encode(
                self.model.vectorizer.get_feature_names_out()
            )  # type: ignore
            if (
                self.model.vocab_embeddings.shape[1]
                != topic_vectors.shape[1]
            ):
                raise ValueError(
                    NOT_MATCHING_ERROR.format(
                        n_dims=topic_vectors.shape[1],
                        n_word_dims=self.model.vocab_embeddings.shape[1],
                    )
                )
        components = cluster_centroid_distance(
            topic_vectors,
            self.model.vocab_embeddings,
        )
    elif self.model.feature_importance == "bayes":
        components = bayes_rule(
            document_topic_matrix, self.model.doc_term_matrix
        )
    else:
        components = ctf_idf(
            document_topic_matrix, self.model.doc_term_matrix
        )
    return dict(zip(classes, components))

create_root(model, labels) classmethod

Creates root node from a topic models' components and topic importances in documents.

Source code in turftopic/models/_hierarchical_clusters.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
@classmethod
def create_root(cls, model: ContextualModel, labels: np.ndarray):
    """Creates root node from a topic models' components and topic importances in documents."""
    classes = np.sort(np.unique(labels))
    document_topic_matrix = label_binarize(labels, classes=classes)
    children = []
    for topic_id, doc_top in zip(classes, document_topic_matrix.T):
        children.append(
            cls(
                model,
                path=(topic_id,),
                document_topic_vector=doc_top,
                children=None,
            )
        )
    res = cls(
        model,
        path=(),
        word_importance=None,
        document_topic_vector=None,
        children=children,
    )
    res.estimate_components()
    return res

join_topics(to_join, joint_id=None)

Joins a number of topics into a new topic with a given ID.

Parameters:

Name Type Description Default
to_join Sequence[int]

Children in the hierarchy to join (IDs indicate the last element of the path).

required
joint_id Optional[int]

ID to give to the joint topic. By default, this will be the topic with the smallest ID.

None
Source code in turftopic/models/_hierarchical_clusters.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
def join_topics(
    self, to_join: Sequence[int], joint_id: Optional[int] = None
):
    """Joins a number of topics into a new topic with a given ID.

    Parameters
    ----------
    to_join: Sequence of int
        Children in the hierarchy to join (IDs indicate the last element of the path).
    joint_id: int, default None
        ID to give to the joint topic. By default, this will be the topic with the smallest ID.
    """
    if self.children is None:
        raise TypeError("Node doesn't have children, can't merge.")
    if len(set(to_join)) < len(to_join):
        raise ValueError(
            f"You can't join a cluster with itself: {to_join}"
        )
    if joint_id is None:
        joint_id = min(to_join)
    children = [self[i] for i in to_join]
    joint_membership = np.stack(
        [child.document_topic_vector for child in children]
    )
    joint_membership = np.sum(joint_membership, axis=0)
    child_ids = [child.path[-1] for child in children]
    joint_node = TopicNode(
        model=self.model,
        children=children,
        document_topic_vector=joint_membership,
        path=(*self.path, joint_id),
    )
    for child in joint_node:
        child._append_path(joint_id)
    self.children = [
        child for child in self.children if child.path[-1] not in child_ids
    ] + [joint_node]
    component_map = self._estimate_children_components()
    for child in self.children:
        child.word_importance = component_map[child.path[-1]]