Topic Namers
Sometimes, especially when the number of topics grows large, it might be convenient to assign human-readable names to topics in an automated manner.
Turftopic allows you to accomplish this with a number of different topic namer models.
Large Language Models
Turftopic lets you utilise Large Language Models for generating human-readable topic names. This is done by instructing the language model to generate a topic name based on the keywords the topic model assigns as the most important for a given topic.
Running LLMs locally
You can use any LLM from the HuggingFace Hub to generate topic names on your own machine. The default in Turftopic is SmolLM, due to it's small size and speed, but we recommend using larger LLMs for higher quality topic names, especially in multilingual contexts.
from turftopic import KeyNMF
from turftopic.namers import LLMTopicNamer
model = KeyNMF(10).fit(corpus)
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
model.rename_topics(namer)
model.print_topics()
Topic ID | Topic Name | Highest Ranking |
---|---|---|
0 | Windows NT | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
1 | Theism vs. Atheism | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
2 | "486 Motherboard" | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
3 | Disk Drives | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
4 | Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
5 | Christianity | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
6 | modem-port-serial-connect-uart-pc-9600 | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
7 | "Graphics Card" | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
8 | File Manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
9 | Printer and Fonts | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
Using OpenAI's LLMs
You might not have the computational resources to run a high-quality LLM locally. Luckily Turftopic allows you to use OpenAI's chat models for topic naming too!
Info
You will also need to install the openai
Python package.
pip install openai
export OPENAI_API_KEY="sk-<your key goes here>"
from turftopic.namers import OpenAITopicNamer
namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)
model.print_topics()
Topic ID | Topic Name | Highest Ranking |
---|---|---|
0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
Prompting
Since these namers use chat-finetuned LLMs you can freely define custom prompts for topic name generation:
from turftopic.namers import OpenAITopicNamer
system_prompt = """
You are a topic namer. When the user gives you a set of keywords, you respond with a name for the topic they describe.
You only repond briefly with the name of the topic, and nothing else.
"""
prompt_template = """
You will be tasked with naming a topic.
Based on the keywords, create a short label that best summarizes the topics.
Only respond with a short, human readable topic name and nothing else.
The topic is described by the following set of keywords: {keywords}.
"""
namer = OpenAITopicNamer("gpt-4o-mini", prompt_template=prompt_template, system_prompt=system_prompt)
N-gram Patterns
You can also name topics based on the semantically closest n-grams from the corpus to the topic descriptions. This method typically results in lower quality names, but might be good enough for your use case.
from turftopic.namers import NgramTopicNamer
namer = NgramTopicNamer(corpus, encoder="all-MiniLM-L6-v2")
model.rename_topics(namer)
model.print_topics()
Topic ID | Topic Name | Highest Ranking |
---|---|---|
0 | windows and dos | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
1 | many atheists out there | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
2 | hardware and software | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
3 | floppy disk drives and | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
4 | morality is subjective | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
5 | the christian bible | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
6 | the serial port | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
7 | the video card | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
8 | the file manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
9 | the print manager | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
API Reference
turftopic.namers.base.TopicNamer
Bases: ABC
Source code in turftopic/namers/base.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
name_topic(keywords)
abstractmethod
Names one topics based on top descriptive terms.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
list[str]
|
Top K highest ranking terms on the topic. |
required |
Returns:
Type | Description |
---|---|
str
|
Topic name returned by the namer. |
Source code in turftopic/namers/base.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
name_topics(keywords)
Names all topics based on top descriptive terms.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
list[list[str]]
|
Top K highest ranking terms on the topics. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
Topic names returned by the namer. |
Source code in turftopic/namers/base.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
turftopic.namers.hf_transformers.LLMTopicNamer
Bases: TopicNamer
Name topics with an instruction-finetuned LLM, e.g. Zephyr-7b-beta
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
Model to load from :hugs: Hub. |
'HuggingFaceTB/SmolLM2-1.7B-Instruct'
|
prompt_template
|
str
|
Prompt template to use when no negative terms are specified. |
DEFAULT_PROMPT
|
system_prompt
|
str
|
System prompt to use for the language model. |
DEFAULT_SYSTEM_PROMPT
|
device
|
str
|
Device to run the model on. |
'cpu'
|
Source code in turftopic/namers/hf_transformers.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
turftopic.namers.openai.OpenAITopicNamer
Bases: TopicNamer
Name topics with an OpenAI model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
OpenAI model to use. |
'gpt-4o-mini'
|
prompt_template
|
str
|
Prompt template to use when no negative terms are specified. |
DEFAULT_PROMPT
|
system_prompt
|
str
|
System prompt to use for the language model. |
DEFAULT_SYSTEM_PROMPT
|
Source code in turftopic/namers/openai.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
turftopic.namers.ngram.NgramTopicNamer
Bases: TopicNamer
Retrieves the most similar n-grams from a corpus using an encoder model to the topic descriptions, these will be assigned as topic names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
corpus
|
Iterable[str]
|
Corpus to take n-grams from. |
required |
encoder
|
Union[SentenceTransformer, str]
|
Model to encode documents/terms, all-MiniLM-L6-v2 is the default. |
'sentence-transformers/all-MiniLM-L6-v2'
|
ngram_range
|
tuple[int, int]
|
The lower and upper boundary of the range of n-values for different word n-grams to be extracted. |
(3, 4)
|
max_features
|
Optional[int]
|
Top n-grams to keep, if None, all are kept. |
8000
|
vectorizer
|
Optional[CountVectorizer]
|
Vectorizer used for n-gram extraction. Can be used to prune or filter the vocabulary. |
None
|
Source code in turftopic/namers/ngram.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|