Skip to content

Topic Namers

Sometimes, especially when the number of topics grows large, it might be convenient to assign human-readable names to topics in an automated manner.

Turftopic allows you to accomplish this with a number of different topic namer models.

Large Language Models

Turftopic lets you utilise Large Language Models for generating human-readable topic names. This is done by instructing the language model to generate a topic name based on the keywords the topic model assigns as the most important for a given topic.

Running LLMs locally

You can use any LLM from the HuggingFace Hub to generate topic names on your own machine. The default in Turftopic is SmolLM, due to it's small size and speed, but we recommend using larger LLMs for higher quality topic names, especially in multilingual contexts.

from turftopic import KeyNMF
from turftopic.namers import LLMTopicNamer

model = KeyNMF(10).fit(corpus)

namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
model.rename_topics(namer)

model.print_topics()
Topic ID Topic Name Highest Ranking
0 Windows NT windows, dos, os, ms, microsoft, unix, nt, memory, program, apps
1 Theism vs. Atheism atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith
2 "486 Motherboard" motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance
3 Disk Drives disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot
4 Ethics morality, moral, objective, immoral, morals, subjective, morally, society, animals, species
5 Christianity christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical
6 modem-port-serial-connect-uart-pc-9600 modem, port, serial, modems, ports, uart, pc, connect, fax, 9600
7 "Graphics Card" card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors
8 File Manager file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip
9 Printer and Fonts printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints

Using OpenAI's LLMs

You might not have the computational resources to run a high-quality LLM locally. Luckily Turftopic allows you to use OpenAI's chat models for topic naming too!

Info

You will also need to install the openai Python package.

pip install openai
export OPENAI_API_KEY="sk-<your key goes here>"

from turftopic.namers import OpenAITopicNamer

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)
model.print_topics()
Topic ID Topic Name Highest Ranking
0 Operating Systems and Software windows, dos, os, ms, microsoft, unix, nt, memory, program, apps
1 Atheism and Belief Systems atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith
2 Computer Architecture and Performance motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance
3 Storage Technologies disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot
4 Moral Philosophy and Ethics morality, moral, objective, immoral, morals, subjective, morally, society, animals, species
5 Christian Faith and Beliefs christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical
6 Serial Modem Connectivity modem, port, serial, modems, ports, uart, pc, connect, fax, 9600
7 Graphics Card Drivers card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors
8 Windows File Management file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip
9 Printer Font Management printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints

Prompting

Since these namers use chat-finetuned LLMs you can freely define custom prompts for topic name generation:

from turftopic.namers import OpenAITopicNamer

system_prompt = """
You are a topic namer. When the user gives you a set of keywords, you respond with a name for the topic they describe.
You only repond briefly with the name of the topic, and nothing else.
"""

prompt_template = """
You will be tasked with naming a topic.
Based on the keywords, create a short label that best summarizes the topics.
Only respond with a short, human readable topic name and nothing else.

The topic is described by the following set of keywords: {keywords}.
"""

namer = OpenAITopicNamer("gpt-4o-mini", prompt_template=prompt_template, system_prompt=system_prompt)

N-gram Patterns

You can also name topics based on the semantically closest n-grams from the corpus to the topic descriptions. This method typically results in lower quality names, but might be good enough for your use case.

from turftopic.namers import NgramTopicNamer

namer = NgramTopicNamer(corpus, encoder="all-MiniLM-L6-v2")
model.rename_topics(namer)
model.print_topics()
Topic ID Topic Name Highest Ranking
0 windows and dos windows, dos, os, ms, microsoft, unix, nt, memory, program, apps
1 many atheists out there atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith
2 hardware and software motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance
3 floppy disk drives and disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot
4 morality is subjective morality, moral, objective, immoral, morals, subjective, morally, society, animals, species
5 the christian bible christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical
6 the serial port modem, port, serial, modems, ports, uart, pc, connect, fax, 9600
7 the video card card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors
8 the file manager file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip
9 the print manager printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints

API Reference

turftopic.namers.base.TopicNamer

Bases: ABC

Source code in turftopic/namers/base.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
class TopicNamer(ABC):
    @abstractmethod
    def name_topic(
        self,
        keywords: list[str],
    ) -> str:
        """Names one topics based on top descriptive terms.

        Parameters
        ----------
        keywords: list[str]
            Top K highest ranking terms on the topic.

        Returns
        -------
        str
            Topic name returned by the namer.
        """
        pass

    def name_topics(
        self,
        keywords: list[list[str]],
    ) -> list[str]:
        """Names all topics based on top descriptive terms.

        Parameters
        ----------
        keywords: list[list[str]]
            Top K highest ranking terms on the topics.

        Returns
        -------
        list[str]
            Topic names returned by the namer.
        """
        names = []
        for keys in track(keywords, description="Naming topics..."):
            names.append(self.name_topic(keys))
        return names

name_topic(keywords) abstractmethod

Names one topics based on top descriptive terms.

Parameters:

Name Type Description Default
keywords list[str]

Top K highest ranking terms on the topic.

required

Returns:

Type Description
str

Topic name returned by the namer.

Source code in turftopic/namers/base.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
@abstractmethod
def name_topic(
    self,
    keywords: list[str],
) -> str:
    """Names one topics based on top descriptive terms.

    Parameters
    ----------
    keywords: list[str]
        Top K highest ranking terms on the topic.

    Returns
    -------
    str
        Topic name returned by the namer.
    """
    pass

name_topics(keywords)

Names all topics based on top descriptive terms.

Parameters:

Name Type Description Default
keywords list[list[str]]

Top K highest ranking terms on the topics.

required

Returns:

Type Description
list[str]

Topic names returned by the namer.

Source code in turftopic/namers/base.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def name_topics(
    self,
    keywords: list[list[str]],
) -> list[str]:
    """Names all topics based on top descriptive terms.

    Parameters
    ----------
    keywords: list[list[str]]
        Top K highest ranking terms on the topics.

    Returns
    -------
    list[str]
        Topic names returned by the namer.
    """
    names = []
    for keys in track(keywords, description="Naming topics..."):
        names.append(self.name_topic(keys))
    return names

turftopic.namers.hf_transformers.LLMTopicNamer

Bases: TopicNamer

Name topics with an instruction-finetuned LLM, e.g. Zephyr-7b-beta

Parameters:

Name Type Description Default
model_name str

Model to load from :hugs: Hub.

'HuggingFaceTB/SmolLM2-1.7B-Instruct'
prompt_template str

Prompt template to use when no negative terms are specified.

DEFAULT_PROMPT
system_prompt str

System prompt to use for the language model.

DEFAULT_SYSTEM_PROMPT
device str

Device to run the model on.

'cpu'
Source code in turftopic/namers/hf_transformers.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class LLMTopicNamer(TopicNamer):
    """Name topics with an instruction-finetuned LLM, e.g. Zephyr-7b-beta

    Parameters
    ----------
    model_name: str, default 'HuggingFaceTB/SmolLM2-1.7B-Instruct'
        Model to load from :hugs: Hub.
    prompt_template: str
        Prompt template to use when no negative terms are specified.
    system_prompt: str
        System prompt to use for the language model.
    device: str, default 'cpu'
        Device to run the model on.
    """

    def __init__(
        self,
        model_name: str = "HuggingFaceTB/SmolLM2-1.7B-Instruct",
        prompt_template: str = DEFAULT_PROMPT,
        system_prompt: str = DEFAULT_SYSTEM_PROMPT,
        device: str = "cpu",
    ):
        self.model_name = model_name
        self.prompt_template = prompt_template
        self.system_prompt = system_prompt
        self.device = device
        self.pipe = pipeline(
            "text-generation", self.model_name, device=self.device
        )

    def name_topic(
        self,
        keywords: list[list[str]],
    ) -> str:
        prompt = self.prompt_template.format(keywords=", ".join(keywords))
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": prompt},
        ]
        response = self.pipe(messages, max_new_tokens=24)[0]["generated_text"][
            -1
        ]
        label = response["content"]
        return label

turftopic.namers.openai.OpenAITopicNamer

Bases: TopicNamer

Name topics with an OpenAI model.

Parameters:

Name Type Description Default
model_name str

OpenAI model to use.

'gpt-4o-mini'
prompt_template str

Prompt template to use when no negative terms are specified.

DEFAULT_PROMPT
system_prompt str

System prompt to use for the language model.

DEFAULT_SYSTEM_PROMPT
Source code in turftopic/namers/openai.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class OpenAITopicNamer(TopicNamer):
    """Name topics with an OpenAI model.

    Parameters
    ----------
    model_name: str, default 'gpt-4o-mini'
        OpenAI model to use.
    prompt_template: str
        Prompt template to use when no negative terms are specified.
    system_prompt: str
        System prompt to use for the language model.
    """

    def __init__(
        self,
        model_name: str = "gpt-4o-mini",
        prompt_template: str = DEFAULT_PROMPT,
        system_prompt: str = DEFAULT_SYSTEM_PROMPT,
    ):
        self.client = openai.OpenAI()
        self.model_name = model_name
        self.prompt_template = prompt_template
        self.system_prompt = system_prompt

    def name_topic(
        self,
        keywords: list[list[str]],
    ) -> str:
        prompt = self.prompt_template.format(keywords=", ".join(keywords))
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": prompt},
        ]
        response = self.client.chat.completions.create(
            messages=messages,
            model=self.model_name,
        )
        return response.choices[0].message.content

turftopic.namers.ngram.NgramTopicNamer

Bases: TopicNamer

Retrieves the most similar n-grams from a corpus using an encoder model to the topic descriptions, these will be assigned as topic names.

Parameters:

Name Type Description Default
corpus Iterable[str]

Corpus to take n-grams from.

required
encoder Union[SentenceTransformer, str]

Model to encode documents/terms, all-MiniLM-L6-v2 is the default.

'sentence-transformers/all-MiniLM-L6-v2'
ngram_range tuple[int, int]

The lower and upper boundary of the range of n-values for different word n-grams to be extracted.

(3, 4)
max_features Optional[int]

Top n-grams to keep, if None, all are kept.

8000
vectorizer Optional[CountVectorizer]

Vectorizer used for n-gram extraction. Can be used to prune or filter the vocabulary.

None
Source code in turftopic/namers/ngram.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
class NgramTopicNamer(TopicNamer):
    """Retrieves the most similar n-grams from a corpus using an encoder model
    to the topic descriptions, these will be assigned as topic names.

    Parameters
    ----------
    corpus: Iterable[str]
        Corpus to take n-grams from.
    encoder: str or Encoder, default 'all-MiniLM-L6-v2'
        Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
    ngram_range: tuple[int, int], default (2,5)
        The lower and upper boundary of the range of n-values for different word n-grams to be extracted.
    max_features: Optional[int], default 8000
        Top n-grams to keep, if None, all are kept.
    vectorizer: CountVectorizer, default None
        Vectorizer used for n-gram extraction.
        Can be used to prune or filter the vocabulary.
    """

    def __init__(
        self,
        corpus: Iterable[str],
        encoder: Union[
            SentenceTransformer, str
        ] = "sentence-transformers/all-MiniLM-L6-v2",
        ngram_range: tuple[int, int] = (3, 4),
        max_features: Optional[int] = 8000,
        vectorizer: Optional[CountVectorizer] = None,
    ):
        if isinstance(encoder, str):
            self.encoder_ = SentenceTransformer(encoder)
        else:
            self.encoder_ = encoder
        if vectorizer is None:
            self.vectorizer = CountVectorizer(
                ngram_range=ngram_range,
                max_features=max_features,
            )
        else:
            self.vectorizer = vectorizer
        console = Console()
        with console.status("Fitting namer") as status:
            status.update("Collecting n-grams")
            self.vectorizer.fit(corpus)
            self.ngrams = self.vectorizer.get_feature_names_out()
            console.log("N-grams learned")
            status.update("Encoding n-grams")
            if self.is_encoder_promptable:
                self.ngram_embeddings = self.encoder_.encode(
                    self.ngrams, prompt_name="passage"
                )
            else:
                self.ngram_embeddings = self.encoder_.encode(self.ngrams)
            console.log("N-grams encoded")

    @property
    def is_encoder_promptable(self) -> bool:
        prompts = getattr(self.encoder_, "prompts", None)
        if prompts is None:
            return False
        if ("query" in prompts) and ("passage" in prompts):
            return True

    def name_topic(
        self,
        keywords: list[list[str]],
    ) -> str:
        query = ", ".join(keywords)
        if self.is_encoder_promptable:
            query_embedding = self.encoder_.encode(
                [query], prompt_name="query"
            )
        else:
            query_embedding = self.encoder_.encode([query])
        similarities = cosine_similarity(
            query_embedding, self.ngram_embeddings
        )
        similarities = np.ravel(similarities)
        name = self.ngrams[np.argmax(similarities)]
        return name