Modifying and finetuning models
Some models in Turftopic can be flexibly modified after being fitted. This allows users to fit pretrained topic models to their specific use cases.
Naming/renaming topics
Topics can be freely renamed in all topic models. This can be beneficial when interpreting models, as it allows you to assign labels to the topics you've already looked at.
from turftopic import SemanticSignalSeparation
model = SemanticSignalSeparation(10).fit(corpus)
# you can specify a dict mapping IDs to names
model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
# or a list of topic names
model.rename_topics([f"Topic {i}" for i in range(10)])
You can also automatically name topics with a topic namer model.
from turftopic.namers import LLMTopicNamer
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
model.rename_topics(namer)
Changing the number of topics
Multiple models allow you to change the number of topics in a model after fitting them.
Refitting \(S^3\) with different number of topics
\(S^3\) models store all information that is needed to refit them using a different number of topics, iterations or random seed. This process is incredibly fast and allows you to explore semantics in a corpora on multiple levels of detail. Moreover, any model you load from a third party can be refitted at will.
from turftopic import load_model
model = load_model("hf_user/some_s3_model")
print(type(model))
# turftopic.models.decomp.SemanticSignalSeparation
print(len(model.topic_names))
# 10
model.refit(n_components=20, random_seed=42)
print(len(model.topic_names))
# 20
Merging topics in clustering models
Clustering models are very flexible in this regard, as they allow you to merge clusters after the model has been fitted.
Manual topic merging
You can merge topics manually in a clustering model by using the join_topics()
method:
from turftopic import ClusteringTopicModel
model = ClusteringTopicModel().fit(corpus)
# This will join topic 0, 5 and 4 into topic 0
model.join_topics([0,5,4])
Hierarchical merging
You can also merge clusters automatically into a desired number of topics.
This can be done with the reduce_topics()
method:
Info
For more info on topic merging methods, check out this page
model = ClusteringTopicModel().fit(corpus)
model.reduce_topics(n_reduce_to=20, reduction_method="smallest")
Finetuning models on a new corpus.
Currently, you can only finetune KeyNMF to a new corpus.
You can do this by using the partial_fit()
method on texts the model hasn't seen before:
from turftopic import load_model
model = load_model("pretrained_keynmf_model")
print(type(model))
# turftopic.models.keynmf.KeyNMF
new_corpus: list[str] = [...]
# Finetune the model to the new corpus
model.partial_fit(new_corpus)
model.to_disk("finetuned_model/")
Re-estimating word importance
Both \(S^3\) and Clustering models come with multiple ways of estimating the importance of words for topics. Since both of these models use post-hoc measures, these scores can be calculated without fitting a new model or refitting an old one. This allows you to play around with different types of feature importance estimation measures for the same model (same underlying clusters or axes).
Here's an example with \(S^3\):
from turftopic import SemanticSignalSeparation
model = SemanticSignalSeparation(5, feature_importance="combined").fit(corpus)
model.print_topics()
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Topic ID ┃ Highest Ranking ┃ Lowest Ranking ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0 │ hypocrisy, hypocritical, fallacy, debated, skeptics │ xfree86, emulator, codes, 9600, cd300 │
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
│ 1 │ spectrometer, dblspace, statistically, nutritional, makefile │ uh, um, yeah, hm, oh │
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
│ 2 │ bullpen, goaltenders, pitchers, goaltender, pitching │ intel, nsa, spying, encrypt, terrorism │
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
│ 3 │ espionage, wiretapping, cia, fbi, wiretaps │ agnosticism, agnostic, upgrading, affordable, cheaper │
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
│ 4 │ affordable, dealers, warrants, handguns, dealership │ semitic, theologians, judaism, persecuted, pagan │
└──────────┴──────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────┘
model.estimate_components("angular")
model.print_topics()
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Topic ID ┃ Highest Ranking ┃ Lowest Ranking ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0 │ hypocritical, debated, hypotheses, misconceptions, fallacy │ diagnostics, win31, modems, cd300, gd3004 │
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
│ 1 │ spectrometer, dblspace, statistically, makefile, nutritional │ ye, sub, naked, experiences, uh │
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
│ 2 │ bullpen, puckett, hitters, clemens, jenks │ encryption, encrypt, intel, cryptosystem, cryptosystems │
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
│ 3 │ journalists, cdc, chlorine, npr, briefing │ values, ratios, upgrading, calculations, inherit │
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
│ 4 │ handguns, warrants, warranty, reliability, handgun │ nutritional, metabolism, deuteronomy, pathology, hormone │
└──────────┴──────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┘
And one with clustering models:
Info
Remember, these are the same underlying clusters, just described in two different ways. For further details, check out this page
from turftopic import ClusteringTopicModel
model = ClusteringTopicModel(n_reduce_to=5, feature_importance="soft-c-tf-idf").fit(corpus)
model.print_topics()
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Topic ID ┃ Highest Ranking ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ -1 │ like, just, don, use, does, know, time, good, people, edu │
├──────────┼────────────────────────────────────────────────────────────────────────────┤
│ 0 │ people, said, god, president, mr, think, going, say, did, myers │
├──────────┼────────────────────────────────────────────────────────────────────────────┤
│ 1 │ max, g9v, b8f, a86, pl, 00, 145, 1d9, dos, 34u │
├──────────┼────────────────────────────────────────────────────────────────────────────┤
│ 2 │ msg, cancer, food, battery, water, candida, medical, vitamin, yeast, diet │
├──────────┼────────────────────────────────────────────────────────────────────────────┤
│ 3 │ 25, 55, pit, det, pts, la, bos, 03, 10, 11 │
├──────────┼────────────────────────────────────────────────────────────────────────────┤
│ 4 │ insurance, car, dog, radar, health, bike, helmet, private, detector, speed │
└──────────┴────────────────────────────────────────────────────────────────────────────┘
model.estimate_components(feature_importance="centroid")
model.print_topics()
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Topic ID ┃ Highest Ranking ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ -1 │ documented, concerns, dubious, obsolete, concern, alternative, et4000, complaints, cx, discussed │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 0 │ persecutions, persecution, condemning, condemnation, fundamentalists, persecuted, fundamentalism, │
│ │ theology, advocating, fundamentalist │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ xfree86, pcx, emulation, microsoft, hardware, emulator, x11r5, netware, workstations, chipset │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 2 │ contamination, fungal, precautions, harmful, poisoning, chemicals, treatments, toxicity, dangers, │
│ │ prevention │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 3 │ nhl, bullpen, goaltenders, standings, sabres, canucks, braves, mlb, flyers, playoffs │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 4 │ automotive, vehicle, vehicles, speeding, automobile, automobiles, driving, motorcycling, │
│ │ motorcycles, highways │
└──────────┴───────────────────────────────────────────────────────────────────────────#───────────────────────────┘