Skip to content

is a Python library containing reference implementations of a bunch of very useful unsupervised learning algorithms that you probably won't find elsewhere.

What is:

  • A collection of unsupervised machine learning algorithms
  • A scikit-learn compatible library
  • An educational resource containing worked examples and reference implementation

What isn't:

  • The most feature-complete or efficient implementation of these algorithms
  • A replacement for scikit-learn
  • An all-in-one machine learning framework
  • A library for complete Bayesian inference. Use a PPL like NumPyro, PyMC or Stan.

Basic usage

Install noloox from PyPI:

pip install noloox

Then you can load models from the library and use them the same way you would use scikit-learn.

from noloox.mixture import StudentsTMixture

model = StudentsTMixture(n_components=10)
cluster_labels = model.fit_predict(X)

Models

Model What do I use it for? JAX or NumPy? What algorithm? Tutorial
Peax Cluster 2D data where the number of clusters is unknown. NumPy Expectation-Maximization Finding the number of clusters in the data
SNMF Factor data, where you expect the factors to be non-negative, but the data is unbounded JAX Iterative updates Topic discovery by factoring transformer embeddings
WNMF NMF, but you don't want to weight all observations equally. NumPy Iterative updates -
StudentsTMixture/CauchyMixture Cluster continuous data in a way that is robust to outliers. JAX Expectation-Maximization Outlier-Robust Clustering
DirichletMultinomialMixture Cluster count data/Short-text topic modelling JAX Collapsed Gibbs Sampling Topic modelling for short texts and Clustering Count Data

Our philosophy and goals

  • Keep implementations simple and minimal, Minimal dependencies
  • Everything should either be implemented in NumPy or JAX. Preferably as many in JAX as possible.
  • Library structure should match sklearn standards, and all algorithms should be drop-in replacements for scikit-learn equivalents.
  • Under these restrictions, algorithms should be as fast as humanly possible

The wishlist:

There are a number of algorithms that would be nice to implement in the library. Contributions are very welcome.

  • ProdLDA, and amortized ProdLDA (CTMs) (without Flax)
  • Parametric-TSNE, possibly also Multi-scale Parametric-TSNE
  • DiRE
  • Infinite NMF
  • Latent Dirichlet Allocation with Gibbs Sampling
  • Gaussian LDA