Topic discovery by factoring transformer embeddings

In older topic models, one would typically cluster or factorize bag-of-words matrices using LDA or NMF.

Info

See Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation in the scikit-learn docs or Topic modelling for short texts for more information on BoW topic modelling.

In contrast, modern topic models, like BERTopic cluster embeddings from sentence-transformers. While clustering them is easy, factorizing these embeddings in a nonnegative topic-space is not trivial, as the embeddings themselves are unbounded, and can take on positive or negative values.

In this tutorial we will look at how you can achieve this using Semi-Nonnegative Matrix Factorization.

Tip

This model is actually called SensTopic, and is implemented in the Turftopic Python library with much more complete functionality. If you intend to use this model in practice, you should probably use that implementation, they are based on the same SNMF model. This tutorial is strictly here to demonstrate how you can produce positive factors over unbounded data using SNMF.

Data loading

We will use a subset of the 20 Newsgroups dataset from scikit-learn for this tutorial:

from sklearn.datasets import fetch_20newsgroups

corpus = fetch_20newsgroups(
    subset="all",
    remove=("headers", "footers", "quotes"),
    categories=[
      "comp.graphics",
      "comp.os.ms-windows.misc",
      "comp.sys.ibm.pc.hardware",
      "comp.sys.mac.hardware",
      "comp.windows.x"
    ]
).data

Term extraction

In order for us to be able to estimate keyword importance for topics, we will need to extract all terms in the corpus. We will do this by getting the vocabulary of a fitted CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=10)
vectorizer.fit(corpus)
vocab = vectorizer.get_feature_names_out()

Producing transformer embeddings

from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = encoder.encode(corpus, show_progress_bar=True)
vocab_embeddings = encoder.encode(vocab, show_progress_bar=True)

from noloox.decomposition import SNMF

model = SNMF(n_components=10, sparsity=1.0)
doc_topic_matrix = model.fit_transform(embeddings)

topic_word_matrix = model.transform(vocab_embeddings).T

for i, comp in enumerate(topic_word_matrix):
    top = np.argsort(-comp)[:10]
    print(f"Topic {i}:", ", ".join(vocab[top]))

Topic ID	Top Words
Topic 0	modem, modems, connecting, telnet, ports, port, connect, ethernet, connects, connection
Topic 1	processors, processor, cpus, cpu, performance, benchmarks, pentium, cheaper, intel, efficient
Topic 2	vga, monitors, monitor, 640x480, displays, resolution, resolutions, lcd, 1280x1024, screen
Topic 3	uh, um, so, em, yeah, er, ah, but, oh, and
Topic 4	windows, win3, os, microsoft, openwin, win31, openwindows, netware, ms, executables
Topic 5	printing, printers, printer, prints, print, laserwriter, printed, laserjet, ink, deskjet
Topic 6	hdd, disk, harddisk, disks, drives, fdisk, seagate, hd, drive, partition
Topic 7	xcreatewindow, xtwindow, xtrealizewidget, xterminal, xterm, xtpointer, xserver, xservers, xt, xdm
Topic 8	bitmaps, colormaps, bitmap, imagemagick, colormap, animation, animations, imagewriter, xputimage, photoshop
Topic 9	archived, mailed, published, contact, subscription, contacted, archive, publications, incorporated, email