Hierarchical Topic Modeling

You might expect some topics in your corpus to belong to a hierarchy of topics. Some models in Turftopic allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.

Models in Turftopic that can model hierarchical relations will have a hierarchy property, that you can manipulate and print/visualize:

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10).fit(corpus)
# We cut at level 3 for plotting, since the hierarchy is very deep
model.hierarchy.cut(3).plot_tree()

Drag and click to zoom, hover to see word importance

1. Divisive/Top-down Hierarchical Modeling

In divisive modeling, you start from larger structures, higher up in the hierarchy, and divide topics into smaller sub-topics on-demand. This is how hierarchical modeling works in KeyNMF, which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.

As a demonstration, let's load a corpus, that we know to have hierarchical themes.

from sklearn.datasets import fetch_20newsgroups

corpus = fetch_20newsgroups(
    subset="all",
    remove=("headers", "footers", "quotes"),
    categories=[
        "comp.os.ms-windows.misc",
        "comp.sys.ibm.pc.hardware",
        "talk.religion.misc",
        "alt.atheism",
    ],
).data

In this case, we have two base themes, which are computers, and religion. Let us fit a KeyNMF model with two topics to see if the model finds these.

from turftopic import KeyNMF

model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
model.print_topics()

Topic ID	Highest Ranking
0	windows, dos, os, disk, card, drivers, file, pc, files, microsoft
1	atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs

The results conform our intuition. Topic 0 seems to revolve around IT, while Topic 1 around atheism and religion. We can already suspect, however that more granular topics could be discovered in this corpus. For instance Topic 0 contains terms related to operating systems, like windows and dos, but also components, like disk and card.

We can access the hierarchy of topics in the model at the current stage, with the model's hierarchy property.

print(model.hierarchy)

Root ├── 0: windows, dos, os, disk, card, drivers, file, pc, files, microsoft └── 1: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs

There isn't much to see yet, the model contains a flat hierarchy of the two topics we discovered and we are at root level. We can dissect these topics, by adding a level to the hierarchy.

Let us add 3 subtopics to each topic on the root level.

model.hierarchy.divide_children(n_subtopics=3)

Root ├── 0: windows, dos, os, disk, card, drivers, file, pc, files, microsoft │ ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory │ ├── 0.1: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform │ └── 0.2: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati ...

As you can see, the model managed to identify meaningful subtopics of the two larger topics we found earlier. Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware.

You can also divide individual topics to a number of subtopics, by using the divide() method. Let us divide Topic 0.0 to 5 subtopics.

model.hierarchy[0][0].divide(5)
model.hierarchy

Root ├── 0: windows, dos, os, disk, card, drivers, file, pc, files, microsoft │ ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory │ │ ├── 0.0.1: file, files, ftp, bmp, program, windows, shareware, directory, bitmap, zip │ │ ├── 0.0.2: os, windows, unix, microsoft, crash, apps, crashes, nt, pc, operating ...

2. Agglomerative/Bottom-up Hierarchical Modeling

In other models, hierarchies arise from starting from smaller, more specific topics, and then merging them together based on their similarity until a desired number of top-level topics are obtained.

This is how it is done in clustering topic models like BERTopic and Top2Vec. Clustering models typically find a lot of topics, and it can help with interpretation to merge topics until you gain 10-20 top-level topics.

You can either do this by default on a clustering model by setting n_reduce_to on initialization or you can do it manually with reduce_topics(). For more details, check our guide on Clustering models.

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(
    n_reduce_to=10,
    feature_importance="centroid",
    reduction_method="smallest",
    reduction_topic_representation="centroid",
    reduction_distance_metric="cosine",
)
model.fit(corpus)

print(model.hierarchy)

Root: ├── -1: documented, obsolete, et4000, concerns, dubious, embedded, hardware, xfree86, alternative, seeking ├── 20: hitter, pitching, batting, hitters, pitchers, fielder, shortstop, inning, baseman, pitcher ├── 284: nhl, goaltenders, canucks, sabres, hockey, bruins, puck, oilers, canadiens, flyers │ ├── 242: sportschannel, espn, nbc, nhl, broadcasts, broadcasting, broadcast, mlb, cbs, cbc │ │ ├── 171: stadium, tickets, mlb, ticket, sportschannel, mets, inning, nationals, schedule, cubs │ │ │ └── ... │ │ └── 21: sportschannel, nbc, espn, nhl, broadcasting, broadcasts, broadcast, hockey, cbc, cbs │ └── 236: nhl, goaltenders, canucks, sabres, puck, oilers, andreychuk, bruins, goaltender, leafs ...

API reference

`turftopic.hierarchical.TopicNode` `dataclass`

Node for a topic in a topic hierarchy.

Parameters:

Name	Type	Description	Default
`model`	`ContextualModel`	Underlying topic model, which the hierarchy is based on.	required
`path`	`tuple[int]`	Path that leads to this node from the root of the tree.	`()`
`word_importance`	`Optional[ndarray]`	Importance of each word in the vocabulary for given topic.	`None`
`document_topic_vector`	`Optional[ndarray]`	Importance of the topic in all documents in the corpus.	`None`
`children`	`Optional[list[TopicNode]]`	List of subtopics within this topic.	`None`

Source code in turftopic/hierarchical.py

@dataclass
class TopicNode:
    """Node for a topic in a topic hierarchy.

    Parameters
    ----------
    model: ContextualModel
        Underlying topic model, which the hierarchy is based on.
    path: tuple[int], default ()
        Path that leads to this node from the root of the tree.
    word_importance: ndarray of shape (n_vocab), default None
        Importance of each word in the vocabulary for given topic.
    document_topic_vector: ndarray of shape (n_documents), default None
        Importance of the topic in all documents in the corpus.
    children: list[TopicNode], default None
        List of subtopics within this topic.
    """

    model: ContextualModel
    path: tuple[int] = ()
    word_importance: Optional[np.ndarray] = None
    document_topic_vector: Optional[np.ndarray] = None
    children: Optional[list[TopicNode]] = None

    def _path_str(self):
        return ".".join([str(level_id) for level_id in self.path])

    @property
    def classes_(self):
        if self.children is None:
            raise AttributeError("TopicNode doesn't have children.")
        return np.array([child.path[-1] for child in self.children])

    @property
    def components_(self):
        if self.children is None:
            raise AttributeError("TopicNode doesn't have children.")
        return np.stack([child.word_importance for child in self.children])

    @classmethod
    def create_root(
        cls,
        model: ContextualModel,
        components: np.ndarray,
        document_topic_matrix: np.ndarray,
    ) -> TopicNode:
        """Creates root node from a topic models' components and topic importances in documents."""
        children = []
        n_components = components.shape[0]
        classes = getattr(model, "classes_", None)
        if classes is None:
            classes = np.arange(n_components)
        for topic_id, comp, doc_top in zip(
            classes, components, document_topic_matrix.T
        ):
            children.append(
                cls(
                    model,
                    path=(topic_id,),
                    word_importance=comp,
                    document_topic_vector=doc_top,
                    children=None,
                )
            )
        return cls(
            model,
            path=(),
            word_importance=None,
            document_topic_vector=None,
            children=children,
        )

    @property
    def level(self) -> int:
        """Indicates how deep down the hierarchy the topic is."""
        return len(self.path)

    def get_words(self, top_k: int = 10) -> list[tuple[str, float]]:
        """Returns top words and words importances for the topic.

        Parameters
        ----------
        top_k: int, default 10
            Number of top words to return.

        Returns
        -------
        list[tuple[str, float]]
            List of word, importance pairs.
        """
        if self.word_importance is None:
            return []
        vocab = self.model.get_vocab()
        most_important = np.argsort(-self.word_importance)[:top_k]
        words = vocab[most_important]
        imp = self.word_importance[most_important]
        return list(zip(words, imp))

    @property
    def description(self) -> str:
        """Returns a high level description of the topic with its path in the tree
        and top words."""
        if not len(self.path):
            path = "Root"
        else:
            path = str(
                self.path[-1]
            )  # ".".join([str(idx) for idx in self.path])
        words = []
        for word, imp in self.get_words(top_k=10):
            words.append(word)
        concat_words = ", ".join(words)
        color = COLOR_PER_LEVEL[min(self.level, len(COLOR_PER_LEVEL) - 1)]
        stylized = f"[{color} bold]{path}[/]: [italic]{concat_words}[/]"
        console = Console()
        with console.capture() as capture:
            console.print(stylized, end="")
        return capture.get()

    @property
    def _simple_desc(self) -> str:
        if not len(self.path):
            path = "Root"
        else:
            path = str(
                self.path[-1]
            )  # ".".join([str(idx) for idx in self.path])
        words = []
        for word, imp in self.get_words(top_k=5):
            words.append(word)
        concat_words = ", ".join(words)
        return f"{path}: {concat_words}"

    def _build_tree(
        self,
        tree: Tree = None,
        top_k: int = 10,
        max_depth: Optional[int] = None,
    ) -> Tree:
        if tree is None:
            tree = Tree(self.description)
        else:
            tree = tree.add(self.description)
        out_of_depth = (max_depth is not None) and (self.level >= max_depth)
        if out_of_depth:
            if self.children is not None:
                tree.add("...")
            return tree
        if self.children is not None:
            for child in self.children:
                child._build_tree(tree, max_depth=max_depth)
        return tree

    def print_tree(
        self,
        top_k: int = 10,
        max_depth: Optional[int] = None,
    ) -> None:
        """Print hierarchy in tree form.

        Parameters
        ----------
        top_k: int, default 10
            Number of words to print for each topic.
        max_depth: int, default None
            Maximum depth at which topics should be printed in the hierarchy.
            If None, the entire hierarchy is printed.
        """
        tree = self._build_tree(top_k=top_k, max_depth=max_depth)
        console = Console()
        console.print(tree)

    def __str__(self):
        tree = self._build_tree(top_k=10, max_depth=3)
        console = Console()
        with console.capture() as capture:
            console.print(tree)
        return capture.get()

    def __repr__(self):
        return str(self)

    def __getitem__(self, id_or_path: int):
        if self.children is None:
            raise IndexError(
                "Current node is a leaf and does not have children."
            )
        mapping = {
            topic_class: i_topic
            for i_topic, topic_class in enumerate(self.classes_)
        }
        return self.children[mapping[id_or_path]]

    def __iter__(self):
        return iter(self.children)

    def plot_tree(self):
        """Plots hierarchy as an interactive tree in Plotly."""
        return _tree_plot(self)

    def _append_path(self, path_prefix: int):
        self.path = (path_prefix, *self.path)
        if self.children is not None:
            for child in self.children:
                child._append_path(path_prefix)

    def copy(self, deep: bool = True) -> TopicNode:
        """Creates a copy of the given node.

        Parameters
        ----------
        deep: bool, default True
            Indicates whether the copy should be deep or shallow.
            Deep copies are done recursively, while shallow copies only
            contain references to the original children.

        Returns
        -------
        Copy of original hierarchy.
        """
        if (self.children is None) or (not deep):
            return type(self)(
                model=self.model,
                path=self.path,
                children=self.children,
                word_importance=self.word_importance,
                document_topic_vector=self.document_topic_vector,
            )
        else:
            children = [child.copy(deep=True) for child in self.children]
            return type(self)(
                model=self.model,
                path=self.path,
                children=children,
                word_importance=self.word_importance,
                document_topic_vector=self.document_topic_vector,
            )

    def cut(self, max_depth: int) -> TopicNode:
        """Cuts hierarchy at a given depth, returns copy of the hierarchy with levels beyond max_depth removed.

        Parameters
        ----------
        max_depth: int
            Maximum level of nodes to keep.

        Returns
        -------
        TopicNode
            Hierarchy cut at the given level.
            Contains a deep copy of the original nodes.
        """
        if (self.level >= max_depth) or (not self.children):
            return type(self)(
                model=self.model,
                path=self.path,
                children=None,
                word_importance=self.word_importance,
                document_topic_vector=self.document_topic_vector,
            )
        else:
            children = [child.cut(max_depth) for child in self.children]
            return type(self)(
                model=self.model,
                path=self.path,
                children=children,
                word_importance=self.word_importance,
                document_topic_vector=self.document_topic_vector,
            )

    def collect_leaves(self) -> list[TopicNode]:
        def _collect_leaves(node: TopicNode, leaves: list[TopicNode]):
            if not node.children:
                leaves.append(node.copy(deep=False))
            else:
                for child in node.children:
                    _collect_leaves(child, leaves)

        leaves = []
        _collect_leaves(self, leaves)
        return leaves

    def flatten(self) -> TopicNode:
        """Returns new hierarchy with only the leaves of the tree.

        Returns
        -------
        TopicNode
            Root node containing all leaves in a hierarchy.
            Copies of the original nodes.
        """
        leaves = self.collect_leaves()
        ids = [leaf.path[-1] for leaf in leaves]
        # If the IDs are not unique, we label them from 0 to N
        if len(set(ids)) != len(ids):
            current = 0
            new_ids = []
            for node_id in ids:
                if node_id != -1:
                    new_ids.append(current)
                    current += 1
                else:
                    new_ids.append(-1)
            ids = new_ids
        for leaf_id, leaf in zip(ids, leaves):
            leaf.path = (*self.path, leaf_id)
        return type(self)(
            model=self.model,
            path=self.path,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
            children=leaves,
        )

`description: str` `property`

Returns a high level description of the topic with its path in the tree and top words.

`level: int` `property`

Indicates how deep down the hierarchy the topic is.

`copy(deep=True)`

Creates a copy of the given node.

Parameters:

Name	Type	Description	Default
`deep`	`bool`	Indicates whether the copy should be deep or shallow. Deep copies are done recursively, while shallow copies only contain references to the original children.	`True`

Returns:

Type	Description
`Copy of original hierarchy.`

Source code in turftopic/hierarchical.py

def copy(self, deep: bool = True) -> TopicNode:
    """Creates a copy of the given node.

    Parameters
    ----------
    deep: bool, default True
        Indicates whether the copy should be deep or shallow.
        Deep copies are done recursively, while shallow copies only
        contain references to the original children.

    Returns
    -------
    Copy of original hierarchy.
    """
    if (self.children is None) or (not deep):
        return type(self)(
            model=self.model,
            path=self.path,
            children=self.children,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
        )
    else:
        children = [child.copy(deep=True) for child in self.children]
        return type(self)(
            model=self.model,
            path=self.path,
            children=children,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
        )

`create_root(model, components, document_topic_matrix)` `classmethod`

Creates root node from a topic models' components and topic importances in documents.

Source code in turftopic/hierarchical.py

@classmethod
def create_root(
    cls,
    model: ContextualModel,
    components: np.ndarray,
    document_topic_matrix: np.ndarray,
) -> TopicNode:
    """Creates root node from a topic models' components and topic importances in documents."""
    children = []
    n_components = components.shape[0]
    classes = getattr(model, "classes_", None)
    if classes is None:
        classes = np.arange(n_components)
    for topic_id, comp, doc_top in zip(
        classes, components, document_topic_matrix.T
    ):
        children.append(
            cls(
                model,
                path=(topic_id,),
                word_importance=comp,
                document_topic_vector=doc_top,
                children=None,
            )
        )
    return cls(
        model,
        path=(),
        word_importance=None,
        document_topic_vector=None,
        children=children,
    )

`cut(max_depth)`

Cuts hierarchy at a given depth, returns copy of the hierarchy with levels beyond max_depth removed.

Parameters:

Name	Type	Description	Default
`max_depth`	`int`	Maximum level of nodes to keep.	required

Returns:

Type	Description
`TopicNode`	Hierarchy cut at the given level. Contains a deep copy of the original nodes.

Source code in turftopic/hierarchical.py

def cut(self, max_depth: int) -> TopicNode:
    """Cuts hierarchy at a given depth, returns copy of the hierarchy with levels beyond max_depth removed.

    Parameters
    ----------
    max_depth: int
        Maximum level of nodes to keep.

    Returns
    -------
    TopicNode
        Hierarchy cut at the given level.
        Contains a deep copy of the original nodes.
    """
    if (self.level >= max_depth) or (not self.children):
        return type(self)(
            model=self.model,
            path=self.path,
            children=None,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
        )
    else:
        children = [child.cut(max_depth) for child in self.children]
        return type(self)(
            model=self.model,
            path=self.path,
            children=children,
            word_importance=self.word_importance,
            document_topic_vector=self.document_topic_vector,
        )

`flatten()`

Returns new hierarchy with only the leaves of the tree.

Returns:

Type	Description
`TopicNode`	Root node containing all leaves in a hierarchy. Copies of the original nodes.

Source code in turftopic/hierarchical.py

def flatten(self) -> TopicNode:
    """Returns new hierarchy with only the leaves of the tree.

    Returns
    -------
    TopicNode
        Root node containing all leaves in a hierarchy.
        Copies of the original nodes.
    """
    leaves = self.collect_leaves()
    ids = [leaf.path[-1] for leaf in leaves]
    # If the IDs are not unique, we label them from 0 to N
    if len(set(ids)) != len(ids):
        current = 0
        new_ids = []
        for node_id in ids:
            if node_id != -1:
                new_ids.append(current)
                current += 1
            else:
                new_ids.append(-1)
        ids = new_ids
    for leaf_id, leaf in zip(ids, leaves):
        leaf.path = (*self.path, leaf_id)
    return type(self)(
        model=self.model,
        path=self.path,
        word_importance=self.word_importance,
        document_topic_vector=self.document_topic_vector,
        children=leaves,
    )

`get_words(top_k=10)`

Returns top words and words importances for the topic.

Parameters:

Name	Type	Description	Default
`top_k`	`int`	Number of top words to return.	`10`

Returns:

Type	Description
`list[tuple[str, float]]`	List of word, importance pairs.

Source code in turftopic/hierarchical.py

def get_words(self, top_k: int = 10) -> list[tuple[str, float]]:
    """Returns top words and words importances for the topic.

    Parameters
    ----------
    top_k: int, default 10
        Number of top words to return.

    Returns
    -------
    list[tuple[str, float]]
        List of word, importance pairs.
    """
    if self.word_importance is None:
        return []
    vocab = self.model.get_vocab()
    most_important = np.argsort(-self.word_importance)[:top_k]
    words = vocab[most_important]
    imp = self.word_importance[most_important]
    return list(zip(words, imp))

`plot_tree()`

Plots hierarchy as an interactive tree in Plotly.

Source code in turftopic/hierarchical.py

def plot_tree(self):
    """Plots hierarchy as an interactive tree in Plotly."""
    return _tree_plot(self)

`print_tree(top_k=10, max_depth=None)`

Print hierarchy in tree form.

Parameters:

Name	Type	Description	Default
`top_k`	`int`	Number of words to print for each topic.	`10`
`max_depth`	`Optional[int]`	Maximum depth at which topics should be printed in the hierarchy. If None, the entire hierarchy is printed.	`None`

Source code in turftopic/hierarchical.py

def print_tree(
    self,
    top_k: int = 10,
    max_depth: Optional[int] = None,
) -> None:
    """Print hierarchy in tree form.

    Parameters
    ----------
    top_k: int, default 10
        Number of words to print for each topic.
    max_depth: int, default None
        Maximum depth at which topics should be printed in the hierarchy.
        If None, the entire hierarchy is printed.
    """
    tree = self._build_tree(top_k=top_k, max_depth=max_depth)
    console = Console()
    console.print(tree)

`turftopic.hierarchical.DivisibleTopicNode` `dataclass`

Bases: TopicNode

Node for a topic in a topic hierarchy that can be subdivided.

Source code in turftopic/hierarchical.py

@dataclass
class DivisibleTopicNode(TopicNode):
    """Node for a topic in a topic hierarchy that can be subdivided."""

    def clear(self):
        """Deletes children of the given node."""
        self.children = None
        return self

    def divide(self, n_subtopics: int, **kwargs):
        """Divides current node into smaller subtopics.
        Only works when the underlying model is a divisive hierarchical model.

        Parameters
        ----------
        n_subtopics: int
            Number of topics to divide the topic into.
        """
        try:
            self.children = self.model.divide_topic(
                node=self, n_subtopics=n_subtopics, **kwargs
            )
        except AttributeError as e:
            raise AttributeError(
                "Looks like your model is not a divisive hierarchical model."
            ) from e
        return self

    def divide_children(self, n_subtopics: int, **kwargs):
        """Divides all children of the current node to smaller topics.
        Only works when the underlying model is a divisive hierarchical model.

        Parameters
        ----------
        n_subtopics: int
            Number of topics to divide the topics into.
        """
        if self.children is None:
            raise ValueError(
                "Current Node is a leaf, children can't be subdivided."
            )
        for child in self.children:
            child.divide(n_subtopics, **kwargs)
        return self

    def __str__(self):
        tree = self._build_tree(top_k=10, max_depth=3)
        console = Console()
        with console.capture() as capture:
            console.print(tree)
        return capture.get()

    def __repr__(self):
        return str(self)

`clear()`

Deletes children of the given node.

Source code in turftopic/hierarchical.py

def clear(self):
    """Deletes children of the given node."""
    self.children = None
    return self

`divide(n_subtopics, **kwargs)`

Divides current node into smaller subtopics. Only works when the underlying model is a divisive hierarchical model.

Parameters:

Name	Type	Description	Default
`n_subtopics`	`int`	Number of topics to divide the topic into.	required

Source code in turftopic/hierarchical.py

def divide(self, n_subtopics: int, **kwargs):
    """Divides current node into smaller subtopics.
    Only works when the underlying model is a divisive hierarchical model.

    Parameters
    ----------
    n_subtopics: int
        Number of topics to divide the topic into.
    """
    try:
        self.children = self.model.divide_topic(
            node=self, n_subtopics=n_subtopics, **kwargs
        )
    except AttributeError as e:
        raise AttributeError(
            "Looks like your model is not a divisive hierarchical model."
        ) from e
    return self

`divide_children(n_subtopics, **kwargs)`

Divides all children of the current node to smaller topics. Only works when the underlying model is a divisive hierarchical model.

Parameters:

Name	Type	Description	Default
`n_subtopics`	`int`	Number of topics to divide the topics into.	required

Source code in turftopic/hierarchical.py

def divide_children(self, n_subtopics: int, **kwargs):
    """Divides all children of the current node to smaller topics.
    Only works when the underlying model is a divisive hierarchical model.

    Parameters
    ----------
    n_subtopics: int
        Number of topics to divide the topics into.
    """
    if self.children is None:
        raise ValueError(
            "Current Node is a leaf, children can't be subdivided."
        )
    for child in self.children:
        child.divide(n_subtopics, **kwargs)
    return self

`turftopic.models._hierarchical_clusters.ClusterNode`

Bases: TopicNode

Hierarchical Topic Node for clustering models. Supports merging topics based on a hierarchical merging strategy.

Source code in turftopic/models/_hierarchical_clusters.py

class ClusterNode(TopicNode):
    """Hierarchical Topic Node for clustering models.
    Supports merging topics based on a hierarchical merging strategy."""

    @classmethod
    def create_root(cls, model: ContextualModel, labels: np.ndarray):
        """Creates root node from a topic models' components and topic importances in documents."""
        classes = np.sort(np.unique(labels))
        document_topic_matrix = safe_binarize(labels, classes=classes)
        children = []
        for topic_id, doc_top in zip(classes, document_topic_matrix.T):
            children.append(
                cls(
                    model,
                    path=(topic_id,),
                    document_topic_vector=doc_top,
                    children=None,
                )
            )
        res = cls(
            model,
            path=(),
            word_importance=None,
            document_topic_vector=None,
            children=children,
        )
        res.estimate_components()
        return res

    def join_topics(
        self, to_join: Sequence[int], joint_id: Optional[int] = None
    ):
        """Joins a number of topics into a new topic with a given ID.

        Parameters
        ----------
        to_join: Sequence of int
            Children in the hierarchy to join (IDs indicate the last element of the path).
        joint_id: int, default None
            ID to give to the joint topic. By default, this will be the topic with the smallest ID.
        """
        if self.children is None:
            raise TypeError("Node doesn't have children, can't merge.")
        if len(set(to_join)) < len(to_join):
            raise ValueError(
                f"You can't join a cluster with itself: {to_join}"
            )
        if joint_id is None:
            joint_id = min(to_join)
        children = [self[i] for i in to_join]
        joint_membership = np.stack(
            [child.document_topic_vector for child in children]
        )
        joint_membership = np.sum(joint_membership, axis=0)
        child_ids = [child.path[-1] for child in children]
        joint_node = TopicNode(
            model=self.model,
            children=children,
            document_topic_vector=joint_membership,
            path=(*self.path, joint_id),
        )
        for child in joint_node:
            child._append_path(joint_id)
        self.children = [
            child for child in self.children if child.path[-1] not in child_ids
        ] + [joint_node]
        component_map = self._estimate_children_components()
        for child in self.children:
            child.word_importance = component_map[child.path[-1]]

    def estimate_components(self) -> np.ndarray:
        component_map = self._estimate_children_components()
        for child in self.children:
            child.word_importance = component_map[child.path[-1]]
        return self.components_

    @property
    def labels_(self) -> np.ndarray:
        topic_document_membership = np.stack(
            [child.document_topic_vector for child in self.children]
        )
        labels = np.argmax(topic_document_membership, axis=0)
        strength = np.max(topic_document_membership, axis=0)
        # documents that are not in this part of the hierarchy are treated as outliers
        labels[strength == 0] = -1
        return np.array(
            [self.children[label].path[-1] for label in labels if label != -1]
        )

    def _estimate_children_components(self) -> dict[int, np.ndarray]:
        """Estimates feature importances based on a fitted clustering."""
        clusters = np.unique(self.labels_)
        classes = np.sort(clusters)
        labels = self.labels_
        topic_vectors = self.model._calculate_topic_vectors(
            classes=classes, labels=labels
        )
        document_topic_matrix = safe_binarize(labels, classes=classes)
        if self.model.feature_importance == "soft-c-tf-idf":
            components = soft_ctf_idf(
                document_topic_matrix, self.model.doc_term_matrix
            )  # type: ignore
        if self.model.feature_importance == "fighting-words":
            components = fighting_words(
                document_topic_matrix, self.model.doc_term_matrix
            )  # type: ignore
        elif self.model.feature_importance in ["centroid", "linear"]:
            if not hasattr(self.model, "vocab_embeddings"):
                self.model.vocab_embeddings = self.model.encode_documents(
                    self.model.vectorizer.get_feature_names_out()
                )  # type: ignore
                if (
                    self.model.vocab_embeddings.shape[1]
                    != topic_vectors.shape[1]
                ):
                    raise ValueError(
                        NOT_MATCHING_ERROR.format(
                            n_dims=topic_vectors.shape[1],
                            n_word_dims=self.model.vocab_embeddings.shape[1],
                        )
                    )
            if self.model.feature_importance == "centroid":
                components = cluster_centroid_distance(
                    topic_vectors,
                    self.model.vocab_embeddings,
                )
            else:
                components = linear_classifier(
                    document_topic_matrix,
                    self.model.embeddings,
                    self.model.vocab_embeddings,
                )
        elif self.model.feature_importance == "bayes":
            components = bayes_rule(
                document_topic_matrix, self.model.doc_term_matrix
            )
        else:
            components = ctf_idf(
                document_topic_matrix, self.model.doc_term_matrix
            )
        return dict(zip(classes, components))

    def _merge_clusters(self, linkage_matrix: np.ndarray):
        classes = self.classes_
        max_class = len(classes[classes != -1])
        for i_cluster, (left, right, *_) in enumerate(linkage_matrix):
            self.join_topics(
                [int(left), int(right)], int(max_class + i_cluster)
            )

    def _calculate_linkage(
        self, n_reduce_to: int, method: str = "average", metric: str = "cosine"
    ) -> np.ndarray:
        if method not in VALID_LINKAGE_METHODS:
            raise ValueError(
                f"Linkage method has to be one of: {VALID_LINKAGE_METHODS}, but got {method} instead."
            )
        classes = self.classes_
        labels = self.labels_
        topic_sizes = np.array([np.sum(labels == label) for label in classes])
        topic_representations = self.model.topic_representations
        if method == "smallest":
            return smallest_linkage(
                n_reduce_to=n_reduce_to,
                topic_vectors=topic_representations,
                topic_sizes=topic_sizes,
                classes=classes,
                metric=metric,
            )
        else:
            n_classes = len(classes[classes != -1])
            topic_vectors = topic_representations[classes != -1]
            n_reductions = n_classes - n_reduce_to
            cond_dist = pdist(topic_vectors, metric=metric)
            # Making the algorithm more numerically stable
            if metric == "cosine":
                cond_dist[~np.isfinite(cond_dist)] = -1
            return linkage(cond_dist, method=method)[:n_reductions]

    def reduce_topics(
        self, n_reduce_to: int, method: str = "average", metric: str = "cosine"
    ):
        n_topics = np.sum(self.classes_ != -1)
        if n_topics <= n_reduce_to:
            warnings.warn(
                f"Number of clusters is already {n_topics} <= {n_reduce_to}, nothing to do."
            )
            return
        linkage_matrix = self._calculate_linkage(
            n_reduce_to, method=method, metric=metric
        )
        self.linkage_matrix_ = linkage_matrix
        self._merge_clusters(linkage_matrix)

`create_root(model, labels)` `classmethod`

Creates root node from a topic models' components and topic importances in documents.

Source code in turftopic/models/_hierarchical_clusters.py

@classmethod
def create_root(cls, model: ContextualModel, labels: np.ndarray):
    """Creates root node from a topic models' components and topic importances in documents."""
    classes = np.sort(np.unique(labels))
    document_topic_matrix = safe_binarize(labels, classes=classes)
    children = []
    for topic_id, doc_top in zip(classes, document_topic_matrix.T):
        children.append(
            cls(
                model,
                path=(topic_id,),
                document_topic_vector=doc_top,
                children=None,
            )
        )
    res = cls(
        model,
        path=(),
        word_importance=None,
        document_topic_vector=None,
        children=children,
    )
    res.estimate_components()
    return res

`join_topics(to_join, joint_id=None)`

Joins a number of topics into a new topic with a given ID.

Parameters:

Name	Type	Description	Default
`to_join`	`Sequence[int]`	Children in the hierarchy to join (IDs indicate the last element of the path).	required
`joint_id`	`Optional[int]`	ID to give to the joint topic. By default, this will be the topic with the smallest ID.	`None`

Source code in turftopic/models/_hierarchical_clusters.py

def join_topics(
    self, to_join: Sequence[int], joint_id: Optional[int] = None
):
    """Joins a number of topics into a new topic with a given ID.

    Parameters
    ----------
    to_join: Sequence of int
        Children in the hierarchy to join (IDs indicate the last element of the path).
    joint_id: int, default None
        ID to give to the joint topic. By default, this will be the topic with the smallest ID.
    """
    if self.children is None:
        raise TypeError("Node doesn't have children, can't merge.")
    if len(set(to_join)) < len(to_join):
        raise ValueError(
            f"You can't join a cluster with itself: {to_join}"
        )
    if joint_id is None:
        joint_id = min(to_join)
    children = [self[i] for i in to_join]
    joint_membership = np.stack(
        [child.document_topic_vector for child in children]
    )
    joint_membership = np.sum(joint_membership, axis=0)
    child_ids = [child.path[-1] for child in children]
    joint_node = TopicNode(
        model=self.model,
        children=children,
        document_topic_vector=joint_membership,
        path=(*self.path, joint_id),
    )
    for child in joint_node:
        child._append_path(joint_id)
    self.children = [
        child for child in self.children if child.path[-1] not in child_ids
    ] + [joint_node]
    component_map = self._estimate_children_components()
    for child in self.children:
        child.word_importance = component_map[child.path[-1]]

Hierarchical Topic Modeling

1. Divisive/Top-down Hierarchical Modeling

2. Agglomerative/Bottom-up Hierarchical Modeling

API reference

turftopic.hierarchical.TopicNode dataclass

description: str property

level: int property

copy(deep=True)

create_root(model, components, document_topic_matrix) classmethod

cut(max_depth)

flatten()

get_words(top_k=10)

plot_tree()

print_tree(top_k=10, max_depth=None)

turftopic.hierarchical.DivisibleTopicNode dataclass

clear()

divide(n_subtopics, **kwargs)

divide_children(n_subtopics, **kwargs)

turftopic.models._hierarchical_clusters.ClusterNode

create_root(model, labels) classmethod

join_topics(to_join, joint_id=None)

`turftopic.hierarchical.TopicNode` `dataclass`

`description: str` `property`

`level: int` `property`

`copy(deep=True)`

`create_root(model, components, document_topic_matrix)` `classmethod`

`cut(max_depth)`

`flatten()`

`get_words(top_k=10)`

`plot_tree()`

`print_tree(top_k=10, max_depth=None)`

`turftopic.hierarchical.DivisibleTopicNode` `dataclass`

`clear()`

`divide(n_subtopics, **kwargs)`

`divide_children(n_subtopics, **kwargs)`

`turftopic.models._hierarchical_clusters.ClusterNode`

`create_root(model, labels)` `classmethod`

`join_topics(to_join, joint_id=None)`