Topic Namers

Sometimes, especially when the number of topics grows large, it might be convenient to assign human-readable names to topics in an automated manner.

Turftopic allows you to accomplish this with a number of different topic namer models.

Large Language Models

Turftopic lets you utilise Large Language Models for generating human-readable topic names. This is done by instructing the language model to generate a topic name based on the keywords the topic model assigns as the most important for a given topic.

Running LLMs locally

You can use any LLM from the HuggingFace Hub to generate topic names on your own machine. The default in Turftopic is SmolLM, due to it's small size and speed, but we recommend using larger LLMs for higher quality topic names, especially in multilingual contexts.

from turftopic import KeyNMF
from turftopic.namers import LLMTopicNamer

model = KeyNMF(10).fit(corpus)

namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
model.rename_topics(namer)

model.print_topics()

Topic ID	Topic Name	Highest Ranking
0	Windows NT	windows, dos, os, ms, microsoft, unix, nt, memory, program, apps
1	Theism vs. Atheism	atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith
2	"486 Motherboard"	motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance
3	Disk Drives	disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot
4	Ethics	morality, moral, objective, immoral, morals, subjective, morally, society, animals, species
5	Christianity	christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical
6	modem-port-serial-connect-uart-pc-9600	modem, port, serial, modems, ports, uart, pc, connect, fax, 9600
7	"Graphics Card"	card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors
8	File Manager	file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip
9	Printer and Fonts	printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints

Using OpenAI's LLMs

You might not have the computational resources to run a high-quality LLM locally. Luckily Turftopic allows you to use OpenAI's chat models for topic naming too!

Info

You will also need to install the openai Python package.

pip install openai
export OPENAI_API_KEY="sk-<your key goes here>"

from turftopic.namers import OpenAITopicNamer

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)
model.print_topics()

Topic ID	Topic Name	Highest Ranking
0	Operating Systems and Software	windows, dos, os, ms, microsoft, unix, nt, memory, program, apps
1	Atheism and Belief Systems	atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith
2	Computer Architecture and Performance	motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance
3	Storage Technologies	disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot
4	Moral Philosophy and Ethics	morality, moral, objective, immoral, morals, subjective, morally, society, animals, species
5	Christian Faith and Beliefs	christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical
6	Serial Modem Connectivity	modem, port, serial, modems, ports, uart, pc, connect, fax, 9600
7	Graphics Card Drivers	card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors
8	Windows File Management	file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip
9	Printer Font Management	printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints

Prompting

Since these namers use chat-finetuned LLMs you can freely define custom prompts for topic name generation:

from turftopic.namers import OpenAITopicNamer

system_prompt = """
You are a topic namer. When the user gives you a set of keywords, you respond with a name for the topic they describe.
You only repond briefly with the name of the topic, and nothing else.
"""

prompt_template = """
You will be tasked with naming a topic.
Based on the keywords, create a short label that best summarizes the topics.
Only respond with a short, human readable topic name and nothing else.

The topic is described by the following set of keywords: {keywords}.
"""

namer = OpenAITopicNamer("gpt-4o-mini", prompt_template=prompt_template, system_prompt=system_prompt)

N-gram Patterns

You can also name topics based on the semantically closest n-grams from the corpus to the topic descriptions. This method typically results in lower quality names, but might be good enough for your use case.

from turftopic.namers import NgramTopicNamer

namer = NgramTopicNamer(corpus, encoder="all-MiniLM-L6-v2")
model.rename_topics(namer)
model.print_topics()

Topic ID	Topic Name	Highest Ranking
0	windows and dos	windows, dos, os, ms, microsoft, unix, nt, memory, program, apps
1	many atheists out there	atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith
2	hardware and software	motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance
3	floppy disk drives and	disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot
4	morality is subjective	morality, moral, objective, immoral, morals, subjective, morally, society, animals, species
5	the christian bible	christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical
6	the serial port	modem, port, serial, modems, ports, uart, pc, connect, fax, 9600
7	the video card	card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors
8	the file manager	file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip
9	the print manager	printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints

API Reference

`turftopic.namers.base.TopicNamer`

Bases: ABC

Source code in turftopic/namers/base.py

class TopicNamer(ABC):
    @abstractmethod
    def name_topic(
        self,
        keywords: list[str],
    ) -> str:
        """Names one topics based on top descriptive terms.

        Parameters
        ----------
        keywords: list[str]
            Top K highest ranking terms on the topic.

        Returns
        -------
        str
            Topic name returned by the namer.
        """
        pass

    def name_topics(
        self,
        keywords: list[list[str]],
    ) -> list[str]:
        """Names all topics based on top descriptive terms.

        Parameters
        ----------
        keywords: list[list[str]]
            Top K highest ranking terms on the topics.

        Returns
        -------
        list[str]
            Topic names returned by the namer.
        """
        names = []
        for keys in track(keywords, description="Naming topics..."):
            names.append(self.name_topic(keys))
        return names

`name_topic(keywords)` `abstractmethod`

Names one topics based on top descriptive terms.

Parameters:

Name	Type	Description	Default
`keywords`	`list[str]`	Top K highest ranking terms on the topic.	required

Returns:

Type	Description
`str`	Topic name returned by the namer.

Source code in turftopic/namers/base.py

@abstractmethod
def name_topic(
    self,
    keywords: list[str],
) -> str:
    """Names one topics based on top descriptive terms.

    Parameters
    ----------
    keywords: list[str]
        Top K highest ranking terms on the topic.

    Returns
    -------
    str
        Topic name returned by the namer.
    """
    pass

`name_topics(keywords)`

Names all topics based on top descriptive terms.

Parameters:

Name	Type	Description	Default
`keywords`	`list[list[str]]`	Top K highest ranking terms on the topics.	required

Returns:

Type	Description
`list[str]`	Topic names returned by the namer.

Source code in turftopic/namers/base.py

def name_topics(
    self,
    keywords: list[list[str]],
) -> list[str]:
    """Names all topics based on top descriptive terms.

    Parameters
    ----------
    keywords: list[list[str]]
        Top K highest ranking terms on the topics.

    Returns
    -------
    list[str]
        Topic names returned by the namer.
    """
    names = []
    for keys in track(keywords, description="Naming topics..."):
        names.append(self.name_topic(keys))
    return names

`turftopic.namers.hf_transformers.LLMTopicNamer`

Bases: TopicNamer

Name topics with an instruction-finetuned LLM, e.g. Zephyr-7b-beta

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Model to load from :hugs: Hub.	`'HuggingFaceTB/SmolLM2-1.7B-Instruct'`
`prompt_template`	`str`	Prompt template to use when no negative terms are specified.	`DEFAULT_PROMPT`
`system_prompt`	`str`	System prompt to use for the language model.	`DEFAULT_SYSTEM_PROMPT`
`device`	`str`	Device to run the model on.	`'cpu'`

Source code in turftopic/namers/hf_transformers.py

class LLMTopicNamer(TopicNamer):
    """Name topics with an instruction-finetuned LLM, e.g. Zephyr-7b-beta

    Parameters
    ----------
    model_name: str, default 'HuggingFaceTB/SmolLM2-1.7B-Instruct'
        Model to load from :hugs: Hub.
    prompt_template: str
        Prompt template to use when no negative terms are specified.
    system_prompt: str
        System prompt to use for the language model.
    device: str, default 'cpu'
        Device to run the model on.
    """

    def __init__(
        self,
        model_name: str = "HuggingFaceTB/SmolLM2-1.7B-Instruct",
        prompt_template: str = DEFAULT_PROMPT,
        system_prompt: str = DEFAULT_SYSTEM_PROMPT,
        device: str = "cpu",
    ):
        self.model_name = model_name
        self.prompt_template = prompt_template
        self.system_prompt = system_prompt
        self.device = device
        self.pipe = pipeline(
            "text-generation", self.model_name, device=self.device
        )

    def name_topic(
        self,
        keywords: list[list[str]],
    ) -> str:
        prompt = self.prompt_template.format(keywords=", ".join(keywords))
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": prompt},
        ]
        response = self.pipe(messages, max_new_tokens=24)[0]["generated_text"][
            -1
        ]
        label = response["content"]
        return label

`turftopic.namers.openai.OpenAITopicNamer`

Bases: TopicNamer

Name topics with an OpenAI model.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	OpenAI model to use.	`'gpt-4o-mini'`
`prompt_template`	`str`	Prompt template to use when no negative terms are specified.	`DEFAULT_PROMPT`
`system_prompt`	`str`	System prompt to use for the language model.	`DEFAULT_SYSTEM_PROMPT`

Source code in turftopic/namers/openai.py

class OpenAITopicNamer(TopicNamer):
    """Name topics with an OpenAI model.

    Parameters
    ----------
    model_name: str, default 'gpt-4o-mini'
        OpenAI model to use.
    prompt_template: str
        Prompt template to use when no negative terms are specified.
    system_prompt: str
        System prompt to use for the language model.
    """

    def __init__(
        self,
        model_name: str = "gpt-4o-mini",
        prompt_template: str = DEFAULT_PROMPT,
        system_prompt: str = DEFAULT_SYSTEM_PROMPT,
    ):
        self.client = openai.OpenAI()
        self.model_name = model_name
        self.prompt_template = prompt_template
        self.system_prompt = system_prompt

    def name_topic(
        self,
        keywords: list[list[str]],
    ) -> str:
        prompt = self.prompt_template.format(keywords=", ".join(keywords))
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": prompt},
        ]
        response = self.client.chat.completions.create(
            messages=messages,
            model=self.model_name,
        )
        return response.choices[0].message.content

`turftopic.namers.ngram.NgramTopicNamer`

Bases: TopicNamer

Retrieves the most similar n-grams from a corpus using an encoder model to the topic descriptions, these will be assigned as topic names.

Parameters:

Name	Type	Description	Default
`corpus`	`Iterable[str]`	Corpus to take n-grams from.	required
`encoder`	`Union[SentenceTransformer, str]`	Model to encode documents/terms, all-MiniLM-L6-v2 is the default.	`'sentence-transformers/all-MiniLM-L6-v2'`
`ngram_range`	`tuple[int, int]`	The lower and upper boundary of the range of n-values for different word n-grams to be extracted.	`(3, 4)`
`max_features`	`Optional[int]`	Top n-grams to keep, if None, all are kept.	`8000`
`vectorizer`	`Optional[CountVectorizer]`	Vectorizer used for n-gram extraction. Can be used to prune or filter the vocabulary.	`None`

Source code in turftopic/namers/ngram.py

class NgramTopicNamer(TopicNamer):
    """Retrieves the most similar n-grams from a corpus using an encoder model
    to the topic descriptions, these will be assigned as topic names.

    Parameters
    ----------
    corpus: Iterable[str]
        Corpus to take n-grams from.
    encoder: str or Encoder, default 'all-MiniLM-L6-v2'
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    ngram_range: tuple[int, int], default (2,5)
        The lower and upper boundary of the range of n-values for different word n-grams to be extracted.
    max_features: Optional[int], default 8000
        Top n-grams to keep, if None, all are kept.
    vectorizer: CountVectorizer, default None
        Vectorizer used for n-gram extraction.
        Can be used to prune or filter the vocabulary.
    """

    def __init__(
        self,
        corpus: Iterable[str],
        encoder: Union[
            SentenceTransformer, str
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        ngram_range: tuple[int, int] = (3, 4),
        max_features: Optional[int] = 8000,
        vectorizer: Optional[CountVectorizer] = None,
    ):
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        if vectorizer is None:
            self.vectorizer = CountVectorizer(
                ngram_range=ngram_range,
                max_features=max_features,
            )
        else:
            self.vectorizer = vectorizer
        console = Console()
        with console.status("Fitting namer") as status:
            status.update("Collecting n-grams")
            self.vectorizer.fit(corpus)
            self.ngrams = self.vectorizer.get_feature_names_out()
            console.log("N-grams learned")
            status.update("Encoding n-grams")
            if self.is_encoder_promptable:
                self.ngram_embeddings = self.encoder_.encode(
                    self.ngrams, prompt_name="passage"
                )
            else:
                self.ngram_embeddings = self.encoder_.encode(self.ngrams)
            console.log("N-grams encoded")

    @property
    def is_encoder_promptable(self) -> bool:
        prompts = getattr(self.encoder_, "prompts", None)
        if prompts is None:
            return False
        if ("query" in prompts) and ("passage" in prompts):
            return True

    def name_topic(
        self,
        keywords: list[list[str]],
    ) -> str:
        query = ", ".join(keywords)
        if self.is_encoder_promptable:
            query_embedding = self.encoder_.encode(
                [query], prompt_name="query"
            )
        else:
            query_embedding = self.encoder_.encode([query])
        similarities = cosine_similarity(
            query_embedding, self.ngram_embeddings
        )
        similarities = np.ravel(similarities)
        name = self.ngrams[np.argmax(similarities)]
        return name

Topic Namers

Large Language Models

Running LLMs locally

Using OpenAI's LLMs

Prompting

N-gram Patterns

API Reference

turftopic.namers.base.TopicNamer

name_topic(keywords) abstractmethod

name_topics(keywords)

turftopic.namers.hf_transformers.LLMTopicNamer

turftopic.namers.openai.OpenAITopicNamer

turftopic.namers.ngram.NgramTopicNamer

`turftopic.namers.base.TopicNamer`

`name_topic(keywords)` `abstractmethod`

`name_topics(keywords)`

`turftopic.namers.hf_transformers.LLMTopicNamer`

`turftopic.namers.openai.OpenAITopicNamer`

`turftopic.namers.ngram.NgramTopicNamer`