Concept Induction (BETA)

Concept induction is the idea that higher-level concepts can be discovered and described in detail in corpora using the power of Large Language Models (Lam et al. 2024). These high-level concepts in corpora can also be discovered from particular angles, using seeds. The original study, and the Lloom package uses LLMs all the way, and therefore requires excessive computational resources, and aggressive down-sampling of the original corpus.

In order to account for this scalability issue, we use a seeded topic model (KeyNMF) to discover the concepts, and only use LLMs to describe and use them. This allows us to get similar results to Lloom with a fraction of the costs.

In addition, we allow users to generate a Concept Browser programmatically, with which these concepts and their related documents can be explored.

Figure 1: Concepts discovered on the political ideologies dataset.

Example Usage

The example bellow uses a synthetically generated political ideologies dataset, that we examine from the following angles:

Taxation
Stance on immigration
Environmental policy

We use an OpenAI analyzer and KeyNMF, with the paraphrase-MiniLM-L12-v2 embedding model. The code runs in about ten minutes.

Install dependencies and set API Key:

pip install turftopic[openai] datasets
export OPENAI_API_KEY="sk-<your API key here>"

import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

from turftopic import KeyNMF, create_concept_browser
from turftopic.analyzers import OpenAIAnalyzer

# Loading the dataset from huggingface
ds = load_dataset("JyotiNayak/political_ideologies", split="train")
corpus = list(ds["statement"])

# Embedding all documents in the corpus
encoder = SentenceTransformer("paraphrase-MiniLM-L12-v2")
embeddings = encoder.encode(corpus, show_progress_bar=True)

# Running separate seeded KeyNMF models for each tab and saving them
seeds = ["Taxation", "Stance on immigration", "Environmental policy"]
models = []
doc_topic = []
for seed in seeds:
    model = KeyNMF(
        3, encoder=encoder, seed_phrase=seed, seed_exponent=2, random_state=42
    )
    doc_topic_matrix = model.fit_transform(corpus, embeddings=embeddings)
    doc_topic.append(doc_topic_matrix)
    models.append(model)

# Calculating topic sizes
sizes = []
top_documents = []
topic_sizes = []
for doc_topic_matrix in doc_topic:
    # We say that if a document has at least five percent of the max importance
    # then it contains the topic
    rescaled = doc_topic_matrix / doc_topic_matrix.max()
    sizes = (rescaled >= 0.05).sum(axis=0)
    topic_sizes.append(sizes)
    # Finding representative documents for each topic
    docs = []
    for doc_t in rescaled.T:
        # Extracting top 10 documents for each topic
        top = np.argsort(-doc_t)[:10]
        # Making sure only those documents get in,
        # that we have marked to contain the topic
        top = top[doc_t[top] >= 0.05]
        docs.append([corpus[i] for i in top])
    top_documents.append(docs)
topic_sizes = np.stack(topic_sizes)

# Running topic analysis on all models using GPT-5-Nano
analyzer = OpenAIAnalyzer()
analysis_results = []
for model, docs in zip(models, top_documents):
    res = analyzer.analyze_topics(
        keywords=model.get_top_words(), documents=docs
    )
    analysis_results.append(res)

# Creating the concept browser:
browser = create_concept_browser(
    seeds=seeds,
    topic_names=[res.topic_names for res in analysis_results],
    keywords=[model.get_top_words() for model in models],
    topic_descriptions=[res.topic_descriptions for res in analysis_results],
    topic_sizes=topic_sizes,
    top_documents=top_documents,
)
browser.show()

See Figure 1 for the results

API reference

`turftopic._concept_browser.create_browser(seeds, topic_names, keywords, topic_descriptions, topic_sizes, top_documents)`

Creates a concept browser figure with which you can investigate concepts related to different seeds.

Parameters:

Name	Type	Description	Default
`seeds`	`list[str]`	Seed phrases used for the analysis.	required
`topic_names`	`list[list[str]]`	Names of the topics for each of the seed phrases.	required
`keywords`	`list[list[list[str]]]`	Keywords for each of the topics for each seed.	required
`topic_descriptions`	`list[list[str]]`	Descriptions of the topics for each of the seed phrases.	required
`topic_sizes`	`ndarray`	Sizes of the topics for each seed, preferably number of documents.	required
`top_documents`	`list[list[str]]`	Top documents for each of the topics for each seed.	required

Returns:

Type	Description
`HTMLFigure`	Interactive HTML figure that you can either display or save.

Source code in turftopic/_concept_browser.py

def create_browser(
    seeds: list[str],
    topic_names: list[list[str]],
    keywords: list[list[list[str]]],
    topic_descriptions: list[list[str]],
    topic_sizes: np.ndarray,
    top_documents: list[list[str]],
) -> HTMLFigure:
    """Creates a concept browser figure with which you can investigate concepts related to different seeds.

    Parameters
    ----------
    seeds: list[str]
        Seed phrases used for the analysis.
    topic_names: list[list[str]]
        Names of the topics for each of the seed phrases.
    keywords: list[list[list[str]]]
        Keywords for each of the topics for each seed.
    topic_descriptions: list[list[str]]
        Descriptions of the topics for each of the seed phrases.
    topic_sizes: np.ndarray
        Sizes of the topics for each seed, preferably number of documents.
    top_documents: list[list[str]]
        Top documents for each of the topics for each seed.

    Returns
    -------
    HTMLFigure
        Interactive HTML figure that you can either display or save.
    """
    html = HTML_WRAPPER.format(
        style=STYLE,
        body_content=render_widget(
            seeds,
            topic_names,
            keywords,
            topic_descriptions,
            topic_sizes,
            top_documents,
        ),
    )
    return HTMLFigure(html)