Skip to content

Concept Induction (BETA)

Concept induction is the idea that higher-level concepts can be discovered and described in detail in corpora using the power of Large Language Models (Lam et al. 2024). These high-level concepts in corpora can also be discovered from particular angles, using seeds. The original study, and the Lloom package uses LLMs all the way, and therefore requires excessive computational resources, and aggressive down-sampling of the original corpus.

In order to account for this scalability issue, we use a seeded topic model (KeyNMF) to discover the concepts, and only use LLMs to describe and use them. This allows us to get similar results to Lloom with a fraction of the costs.

In addition, we allow users to generate a Concept Browser programmatically, with which these concepts and their related documents can be explored.

Figure 1: Concepts discovered on the political ideologies dataset.

Example Usage

The example bellow uses a synthetically generated political ideologies dataset, that we examine from the following angles:

  • Taxation
  • Stance on immigration
  • Environmental policy

We use an OpenAI analyzer and KeyNMF, with the paraphrase-MiniLM-L12-v2 embedding model. The code runs in about ten minutes.

Install dependencies and set API Key:

pip install turftopic[openai] datasets
export OPENAI_API_KEY="sk-<your API key here>"
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

from turftopic import KeyNMF, create_concept_browser
from turftopic.analyzers import OpenAIAnalyzer

# Loading the dataset from huggingface
ds = load_dataset("JyotiNayak/political_ideologies", split="train")
corpus = list(ds["statement"])

# Embedding all documents in the corpus
encoder = SentenceTransformer("paraphrase-MiniLM-L12-v2")
embeddings = encoder.encode(corpus, show_progress_bar=True)

# Running separate seeded KeyNMF models for each tab and saving them
seeds = ["Taxation", "Stance on immigration", "Environmental policy"]
models = []
doc_topic = []
for seed in seeds:
    model = KeyNMF(
        3, encoder=encoder, seed_phrase=seed, seed_exponent=2, random_state=42
    )
    doc_topic_matrix = model.fit_transform(corpus, embeddings=embeddings)
    doc_topic.append(doc_topic_matrix)
    models.append(model)

# Calculating topic sizes
sizes = []
top_documents = []
topic_sizes = []
for doc_topic_matrix in doc_topic:
    # We say that if a document has at least five percent of the max importance
    # then it contains the topic
    rescaled = doc_topic_matrix / doc_topic_matrix.max()
    sizes = (rescaled >= 0.05).sum(axis=0)
    topic_sizes.append(sizes)
    # Finding representative documents for each topic
    docs = []
    for doc_t in rescaled.T:
        # Extracting top 10 documents for each topic
        top = np.argsort(-doc_t)[:10]
        # Making sure only those documents get in,
        # that we have marked to contain the topic
        top = top[doc_t[top] >= 0.05]
        docs.append([corpus[i] for i in top])
    top_documents.append(docs)
topic_sizes = np.stack(topic_sizes)

# Running topic analysis on all models using GPT-5-Nano
analyzer = OpenAIAnalyzer()
analysis_results = []
for model, docs in zip(models, top_documents):
    res = analyzer.analyze_topics(
        keywords=model.get_top_words(), documents=docs
    )
    analysis_results.append(res)

# Creating the concept browser:
browser = create_concept_browser(
    seeds=seeds,
    topic_names=[res.topic_names for res in analysis_results],
    keywords=[model.get_top_words() for model in models],
    topic_descriptions=[res.topic_descriptions for res in analysis_results],
    topic_sizes=topic_sizes,
    top_documents=top_documents,
)
browser.show()

See Figure 1 for the results

API reference

turftopic._concept_browser.create_browser(seeds, topic_names, keywords, topic_descriptions, topic_sizes, top_documents)

Creates a concept browser figure with which you can investigate concepts related to different seeds.

Parameters:

Name Type Description Default
seeds list[str]

Seed phrases used for the analysis.

required
topic_names list[list[str]]

Names of the topics for each of the seed phrases.

required
keywords list[list[list[str]]]

Keywords for each of the topics for each seed.

required
topic_descriptions list[list[str]]

Descriptions of the topics for each of the seed phrases.

required
topic_sizes ndarray

Sizes of the topics for each seed, preferably number of documents.

required
top_documents list[list[str]]

Top documents for each of the topics for each seed.

required

Returns:

Type Description
HTMLFigure

Interactive HTML figure that you can either display or save.

Source code in turftopic/_concept_browser.py
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
def create_browser(
    seeds: list[str],
    topic_names: list[list[str]],
    keywords: list[list[list[str]]],
    topic_descriptions: list[list[str]],
    topic_sizes: np.ndarray,
    top_documents: list[list[str]],
) -> HTMLFigure:
    """Creates a concept browser figure with which you can investigate concepts related to different seeds.

    Parameters
    ----------
    seeds: list[str]
        Seed phrases used for the analysis.
    topic_names: list[list[str]]
        Names of the topics for each of the seed phrases.
    keywords: list[list[list[str]]]
        Keywords for each of the topics for each seed.
    topic_descriptions: list[list[str]]
        Descriptions of the topics for each of the seed phrases.
    topic_sizes: np.ndarray
        Sizes of the topics for each seed, preferably number of documents.
    top_documents: list[list[str]]
        Top documents for each of the topics for each seed.

    Returns
    -------
    HTMLFigure
        Interactive HTML figure that you can either display or save.
    """
    html = HTML_WRAPPER.format(
        style=STYLE,
        body_content=render_widget(
            seeds,
            topic_names,
            keywords,
            topic_descriptions,
            topic_sizes,
            top_documents,
        ),
    )
    return HTMLFigure(html)