DirichletMultinomialMixture

noloox.mixture.DirichletMultinomialMixture

Bases: BaseEstimator, ClusterMixin, DensityMixin

Implementation of the Dirichlet Multinomial Mixture Model with Gibbs Sampling solver

Parameters:

Name Type Description Default
n_components int

Number of mixture components in the model.

required
n_iter int

Number of iterations during fitting. If you find your results are unsatisfactory, increase this number.

50
alpha float

Willingness of a document joining an empty cluster.

0.1
beta float

Willingness to join clusters, where the terms in the document are not present.

0.1
random_state Optional[int]

Random seed to use for reproducibility.

None

Attributes:

Name Type Description
components_ array of shape (n_components, n_vocab)

Describes all components of the topic distribution. Contains the amount each word has been assigned to each component during fitting.

n_features_in_ int

Number of total vocabulary items seen during fitting.

Source code in noloox/mixture/dmm.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
class DirichletMultinomialMixture(BaseEstimator, ClusterMixin, DensityMixin):
    """Implementation of the Dirichlet Multinomial Mixture Model with Gibbs Sampling
    solver

    Parameters
    ----------
    n_components: int
        Number of mixture components in the model.
    n_iter: int, default 50
        Number of iterations during fitting.
        If you find your results are unsatisfactory, increase this number.
    alpha: float, default 0.1
        Willingness of a document joining an empty cluster.
    beta: float, default 0.1
        Willingness to join clusters, where the terms in the document
        are not present.
    random_state: int, default None
        Random seed to use for reproducibility.

    Attributes
    ----------
    components_: array of shape (n_components, n_vocab)
        Describes all components of the topic distribution.
        Contains the amount each word has been assigned to each component
        during fitting.
    n_features_in_: int
        Number of total vocabulary items seen during fitting.
    """

    def __init__(
        self,
        n_components: int,
        n_iter: int = 50,
        alpha: float = 0.1,
        beta: float = 0.1,
        random_state: Optional[int] = None,
    ):
        super().__init__()
        self.n_components = n_components
        self.n_iter = n_iter
        self.alpha = alpha
        self.beta = beta
        self.random_state = random_state

    def get_params(self, deep: bool = False) -> dict:
        """Get parameters for this estimator.

        Parameters
        ----------
        deep: bool, default False
            Ignored, exists for sklearn compatibility.

        Returns
        -------
        dict
            Parameter names mapped to their values.

        Note
        ----
        Exists for sklearn compatibility.
        """
        return {
            "n_components": self.n_components,
            "n_iter": self.n_iter,
            "alpha": self.alpha,
            "beta": self.beta,
        }

    def fit_predict(self, X, y=None):
        """Fits the model using Gibbs Sampling. Detailed description of the
        algorithm in Yin and Wang (2014).

        Parameters
        ----------
        X: array-like of shape (n_samples, n_features)
            BOW matrix of corpus.
        y: None
            Ignored, exists for sklearn compatibility.

        Returns
        -------
        DirichletMultinomialMixture
            The fitted model.

        """
        if issparse(X):
            warnings.warn(
                "Sparse arrays are not yet supported. Implicitly converting to dense array."
            )
            X = np.asarray(X.todense())
        if self.random_state is not None:
            random_key = jax.random.key(self.random_state)
        else:
            random_key = jax.random.key(random.randint(0, 1000))
        random_key, self.components_, self.labels_, self.m_z, self.n_z = fit_model(
            random_key,
            self.n_components,
            self.n_iter,
            self.alpha,
            self.beta,
            X,
        )
        self.weights_ = np.asarray(self.m_z) / np.sum(self.m_z)
        self.components_ = np.asarray(self.components_)
        D, V = X.shape
        self._predict_proba = jax.vmap(
            lambda x: softmax(
                log_cond_prob(
                    self.m_z,
                    self.components_,
                    self.n_z,
                    x,
                    D,
                    self.n_components,
                    V,
                    self.alpha,
                    self.beta,
                )
            ),
        )

        return self.labels_

    def predict_proba(self, X) -> np.ndarray:
        """Predicts probabilities for each document belonging to each
        component.

        Parameters
        ----------
        X: array-like  of shape (n_samples, n_features)
            Document-term matrix.

        Returns
        -------
        array of shape (n_samples, n_components)
            Probabilities for each document belonging to each cluster.

        Raises
        ------
        NotFittedException
            If the model is not fitted, an exception will be raised
        """
        if not hasattr(self, "_predict_proba"):
            raise NotFittedError("Model not fitted yet, can't predict probabilities.")
        if issparse(X):
            warnings.warn(
                "Sparse arrays are not yet supported. Implicitly converting to dense array."
            )
            X = np.asarray(X.todense())
        p = self._predict_proba(X)
        return np.asarray(p)

    def fit(self, X, y=None):
        self.fit_predict(X, y)
        return self

    def transform(self, X) -> np.ndarray:
        """Alias for predict_proba()."""
        return self.predict_proba(X)

    def predict(self, X) -> np.ndarray:
        """Predicts cluster labels for a set of documents. Mainly exists for
        compatibility with density estimators in sklearn.

        Parameters
        ----------
        X: array-like  of shape (n_samples, n_features)
            Document-term matrix.

        Returns
        -------
        array of shape (n_samples,)
            Cluster label for each document.

        Raises
        ------
        NotFittedException
            If the model is not fitted, an exception will be raised
        """
        return np.argmax(self.predict_proba(X), axis=1)

    def fit_transform(
        self,
        X,
        y=None,
    ) -> np.ndarray:
        """Fits the model, then transforms the given data.

        Parameters
        ----------
        X: array-like of shape (n_samples, n_features)
            Document-term matrix.
        y: None
            Ignored, sklearn compatibility.

        Returns
        -------
        array of shape (n_samples, n_components)
            Probabilities for each document belonging to each cluster.
        """
        return self.fit(X).transform(X)

fit_predict(X, y=None)

Fits the model using Gibbs Sampling. Detailed description of the algorithm in Yin and Wang (2014).

Parameters:

Name Type Description Default
X

BOW matrix of corpus.

required
y

Ignored, exists for sklearn compatibility.

None

Returns:

Type Description
DirichletMultinomialMixture

The fitted model.

Source code in noloox/mixture/dmm.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def fit_predict(self, X, y=None):
    """Fits the model using Gibbs Sampling. Detailed description of the
    algorithm in Yin and Wang (2014).

    Parameters
    ----------
    X: array-like of shape (n_samples, n_features)
        BOW matrix of corpus.
    y: None
        Ignored, exists for sklearn compatibility.

    Returns
    -------
    DirichletMultinomialMixture
        The fitted model.

    """
    if issparse(X):
        warnings.warn(
            "Sparse arrays are not yet supported. Implicitly converting to dense array."
        )
        X = np.asarray(X.todense())
    if self.random_state is not None:
        random_key = jax.random.key(self.random_state)
    else:
        random_key = jax.random.key(random.randint(0, 1000))
    random_key, self.components_, self.labels_, self.m_z, self.n_z = fit_model(
        random_key,
        self.n_components,
        self.n_iter,
        self.alpha,
        self.beta,
        X,
    )
    self.weights_ = np.asarray(self.m_z) / np.sum(self.m_z)
    self.components_ = np.asarray(self.components_)
    D, V = X.shape
    self._predict_proba = jax.vmap(
        lambda x: softmax(
            log_cond_prob(
                self.m_z,
                self.components_,
                self.n_z,
                x,
                D,
                self.n_components,
                V,
                self.alpha,
                self.beta,
            )
        ),
    )

    return self.labels_

fit_transform(X, y=None)

Fits the model, then transforms the given data.

Parameters:

Name Type Description Default
X

Document-term matrix.

required
y

Ignored, sklearn compatibility.

None

Returns:

Type Description
array of shape (n_samples, n_components)

Probabilities for each document belonging to each cluster.

Source code in noloox/mixture/dmm.py
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
def fit_transform(
    self,
    X,
    y=None,
) -> np.ndarray:
    """Fits the model, then transforms the given data.

    Parameters
    ----------
    X: array-like of shape (n_samples, n_features)
        Document-term matrix.
    y: None
        Ignored, sklearn compatibility.

    Returns
    -------
    array of shape (n_samples, n_components)
        Probabilities for each document belonging to each cluster.
    """
    return self.fit(X).transform(X)

get_params(deep=False)

Get parameters for this estimator.

Parameters:

Name Type Description Default
deep bool

Ignored, exists for sklearn compatibility.

False

Returns:

Type Description
dict

Parameter names mapped to their values.

Note

Exists for sklearn compatibility.

Source code in noloox/mixture/dmm.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def get_params(self, deep: bool = False) -> dict:
    """Get parameters for this estimator.

    Parameters
    ----------
    deep: bool, default False
        Ignored, exists for sklearn compatibility.

    Returns
    -------
    dict
        Parameter names mapped to their values.

    Note
    ----
    Exists for sklearn compatibility.
    """
    return {
        "n_components": self.n_components,
        "n_iter": self.n_iter,
        "alpha": self.alpha,
        "beta": self.beta,
    }

predict(X)

Predicts cluster labels for a set of documents. Mainly exists for compatibility with density estimators in sklearn.

Parameters:

Name Type Description Default
X

Document-term matrix.

required

Returns:

Type Description
array of shape (n_samples,)

Cluster label for each document.

Raises:

Type Description
NotFittedException

If the model is not fitted, an exception will be raised

Source code in noloox/mixture/dmm.py
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
def predict(self, X) -> np.ndarray:
    """Predicts cluster labels for a set of documents. Mainly exists for
    compatibility with density estimators in sklearn.

    Parameters
    ----------
    X: array-like  of shape (n_samples, n_features)
        Document-term matrix.

    Returns
    -------
    array of shape (n_samples,)
        Cluster label for each document.

    Raises
    ------
    NotFittedException
        If the model is not fitted, an exception will be raised
    """
    return np.argmax(self.predict_proba(X), axis=1)

predict_proba(X)

Predicts probabilities for each document belonging to each component.

Parameters:

Name Type Description Default
X

Document-term matrix.

required

Returns:

Type Description
array of shape (n_samples, n_components)

Probabilities for each document belonging to each cluster.

Raises:

Type Description
NotFittedException

If the model is not fitted, an exception will be raised

Source code in noloox/mixture/dmm.py
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
def predict_proba(self, X) -> np.ndarray:
    """Predicts probabilities for each document belonging to each
    component.

    Parameters
    ----------
    X: array-like  of shape (n_samples, n_features)
        Document-term matrix.

    Returns
    -------
    array of shape (n_samples, n_components)
        Probabilities for each document belonging to each cluster.

    Raises
    ------
    NotFittedException
        If the model is not fitted, an exception will be raised
    """
    if not hasattr(self, "_predict_proba"):
        raise NotFittedError("Model not fitted yet, can't predict probabilities.")
    if issparse(X):
        warnings.warn(
            "Sparse arrays are not yet supported. Implicitly converting to dense array."
        )
        X = np.asarray(X.todense())
    p = self._predict_proba(X)
    return np.asarray(p)

transform(X)

Alias for predict_proba().

Source code in noloox/mixture/dmm.py
176
177
178
def transform(self, X) -> np.ndarray:
    """Alias for predict_proba()."""
    return self.predict_proba(X)