Concept Induction (BETA)
Concept induction is the idea that higher-level concepts can be discovered and described in detail in corpora using the power of Large Language Models (Lam et al. 2024). These high-level concepts in corpora can also be discovered from particular angles, using seeds. The original study, and the Lloom package uses LLMs all the way, and therefore requires excessive computational resources, and aggressive down-sampling of the original corpus.
In order to account for this scalability issue, we use a seeded topic model (KeyNMF) to discover the concepts, and only use LLMs to describe and use them. This allows us to get similar results to Lloom with a fraction of the costs.
In addition, we allow users to generate a Concept Browser programmatically, with which these concepts and their related documents can be explored.
Example Usage
The example bellow uses a synthetically generated political ideologies dataset, that we examine from the following angles:
- Taxation
- Stance on immigration
- Environmental policy
We use an OpenAI analyzer and KeyNMF, with the paraphrase-MiniLM-L12-v2
embedding model.
The code runs in about ten minutes.
Install dependencies and set API Key:
pip install turftopic[openai] datasets
export OPENAI_API_KEY="sk-<your API key here>"
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from turftopic import KeyNMF, create_concept_browser
from turftopic.analyzers import OpenAIAnalyzer
# Loading the dataset from huggingface
ds = load_dataset("JyotiNayak/political_ideologies", split="train")
corpus = list(ds["statement"])
# Embedding all documents in the corpus
encoder = SentenceTransformer("paraphrase-MiniLM-L12-v2")
embeddings = encoder.encode(corpus, show_progress_bar=True)
# Running separate seeded KeyNMF models for each tab and saving them
seeds = ["Taxation", "Stance on immigration", "Environmental policy"]
models = []
doc_topic = []
for seed in seeds:
model = KeyNMF(
3, encoder=encoder, seed_phrase=seed, seed_exponent=2, random_state=42
)
doc_topic_matrix = model.fit_transform(corpus, embeddings=embeddings)
doc_topic.append(doc_topic_matrix)
models.append(model)
# Calculating topic sizes
sizes = []
top_documents = []
topic_sizes = []
for doc_topic_matrix in doc_topic:
# We say that if a document has at least five percent of the max importance
# then it contains the topic
rescaled = doc_topic_matrix / doc_topic_matrix.max()
sizes = (rescaled >= 0.05).sum(axis=0)
topic_sizes.append(sizes)
# Finding representative documents for each topic
docs = []
for doc_t in rescaled.T:
# Extracting top 10 documents for each topic
top = np.argsort(-doc_t)[:10]
# Making sure only those documents get in,
# that we have marked to contain the topic
top = top[doc_t[top] >= 0.05]
docs.append([corpus[i] for i in top])
top_documents.append(docs)
topic_sizes = np.stack(topic_sizes)
# Running topic analysis on all models using GPT-5-Nano
analyzer = OpenAIAnalyzer()
analysis_results = []
for model, docs in zip(models, top_documents):
res = analyzer.analyze_topics(
keywords=model.get_top_words(), documents=docs
)
analysis_results.append(res)
# Creating the concept browser:
browser = create_concept_browser(
seeds=seeds,
topic_names=[res.topic_names for res in analysis_results],
keywords=[model.get_top_words() for model in models],
topic_descriptions=[res.topic_descriptions for res in analysis_results],
topic_sizes=topic_sizes,
top_documents=top_documents,
)
browser.show()
See Figure 1 for the results
API reference
turftopic._concept_browser.create_browser(seeds, topic_names, keywords, topic_descriptions, topic_sizes, top_documents)
Creates a concept browser figure with which you can investigate concepts related to different seeds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seeds |
list[str]
|
Seed phrases used for the analysis. |
required |
topic_names |
list[list[str]]
|
Names of the topics for each of the seed phrases. |
required |
keywords |
list[list[list[str]]]
|
Keywords for each of the topics for each seed. |
required |
topic_descriptions |
list[list[str]]
|
Descriptions of the topics for each of the seed phrases. |
required |
topic_sizes |
ndarray
|
Sizes of the topics for each seed, preferably number of documents. |
required |
top_documents |
list[list[str]]
|
Top documents for each of the topics for each seed. |
required |
Returns:
Type | Description |
---|---|
HTMLFigure
|
Interactive HTML figure that you can either display or save. |
Source code in turftopic/_concept_browser.py
402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 |
|