is a Python library containing reference implementations of a bunch of very useful unsupervised learning algorithms that you probably won't find elsewhere.
What
is:
- A collection of unsupervised machine learning algorithms
- A scikit-learn compatible library
- An educational resource containing worked examples and reference implementation
What
isn't:
- The most feature-complete or efficient implementation of these algorithms
- A replacement for scikit-learn
- An all-in-one machine learning framework
- A library for complete Bayesian inference. Use a PPL like NumPyro, PyMC or Stan.
Basic usage
Install noloox from PyPI:
pip install noloox
Then you can load models from the library and use them the same way you would use scikit-learn.
from noloox.mixture import StudentsTMixture
model = StudentsTMixture(n_components=10)
cluster_labels = model.fit_predict(X)
Models
| Model | What do I use it for? | JAX or NumPy? | What algorithm? | Tutorial |
|---|---|---|---|---|
| Peax | Cluster 2D data where the number of clusters is unknown. | NumPy | Expectation-Maximization | Finding the number of clusters in the data |
| SNMF | Factor data, where you expect the factors to be non-negative, but the data is unbounded | JAX | Iterative updates | Topic discovery by factoring transformer embeddings |
| WNMF | NMF, but you don't want to weight all observations equally. | NumPy | Iterative updates | - |
| StudentsTMixture/CauchyMixture | Cluster continuous data in a way that is robust to outliers. | JAX | Expectation-Maximization | Outlier-Robust Clustering |
| DirichletMultinomialMixture | Cluster count data/Short-text topic modelling | JAX | Collapsed Gibbs Sampling | Topic modelling for short texts and Clustering Count Data |
Our philosophy and goals
- Keep implementations simple and minimal, Minimal dependencies
- Everything should either be implemented in NumPy or JAX. Preferably as many in JAX as possible.
- Library structure should match sklearn standards, and all algorithms should be drop-in replacements for scikit-learn equivalents.
- Under these restrictions, algorithms should be as fast as humanly possible
The
wishlist:
There are a number of algorithms that would be nice to implement in the library. Contributions are very welcome.
- ProdLDA, and amortized ProdLDA (CTMs) (without Flax)
- Parametric-TSNE, possibly also Multi-scale Parametric-TSNE
- DiRE
- Infinite NMF
- Latent Dirichlet Allocation with Gibbs Sampling
- Gaussian LDA