StudentsTMixture

noloox.mixture.StudentsTMixture

Bases: BaseEstimator, ClusterMixin, DensityMixin

Student's T Mixture Model. This class allows you to estimate a mixture of multivariate t-distributions over your data.

Parameters:

Name Type Description Default
n_components int

The number of mixture components.

required
tol float

The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.

1e-05
reg_covar float

Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.

1e-06
max_iter int

The number of EM iterations to perform.

1000
df

Degrees of freedom for the t-Distributions.

4.0
random_state

Random state for reproducibility.

None

Attributes:

Name Type Description
weights_ array-like of shape (n_components,)

The weights of each mixture components.

means_ array-like of shape (n_components, n_features)

The mean of each mixture component.

n_iter_ int

Number of step used by the best fit of EM to reach the convergence.

converged_ bool

True when convergence of the best fit of EM was reached, False otherwise.

Source code in noloox/mixture/tmm.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
class StudentsTMixture(BaseEstimator, ClusterMixin, DensityMixin):
    """Student's T Mixture Model.
    This class allows you to estimate a mixture of multivariate t-distributions over your data.

    Parameters
    ----------
    n_components: int
        The number of mixture components.
    tol: float, default=1e-5
        The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.
    reg_covar: float, default=1e-6
        Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.
    max_iter: int, default=1000
        The number of EM iterations to perform.
    df: float, default=4.0
        Degrees of freedom for the t-Distributions.
    random_state: int, default=None
        Random state for reproducibility.

    Attributes
    ----------
    weights_: array-like of shape (n_components,)
        The weights of each mixture components.
    means_: array-like of shape (n_components, n_features)
        The mean of each mixture component.
    n_iter_: int
        Number of step used by the best fit of EM to reach the convergence.
    converged_: bool
        True when convergence of the best fit of EM was reached, False otherwise.
    """

    def __init__(
        self,
        n_components: int,
        tol: float = 1e-5,
        reg_covar: float = 1e-06,
        max_iter: int = 1000,
        df=4.0,
        random_state=None,
    ):
        super().__init__()
        self.start_df_ = float(df)
        self.n_components = n_components
        self.tol = tol
        self.df = df
        self.reg_covar = reg_covar
        self.max_iter = max_iter
        self.random_state = random_state

    def _init_params(self, X):
        loc_ = (
            KMeans(self.n_components, random_state=self.random_state)
            .fit(X)
            .cluster_centers_
        )
        mix_weights_ = np.ones(self.n_components) / self.n_components
        default_scale_matrix = np.cov(X, rowvar=False)
        default_scale_matrix.flat[:: X.shape[1] + 1] += self.reg_covar
        if len(default_scale_matrix.shape) < 2:
            default_scale_matrix = default_scale_matrix.reshape(-1, 1)
        scale_ = np.stack(
            [default_scale_matrix for i in range(self.n_components)], axis=-1
        )
        scale_cholesky_ = [
            np.linalg.cholesky(scale_[:, :, i]) for i in range(self.n_components)
        ]
        scale_cholesky_ = np.stack(scale_cholesky_, axis=-1)
        return loc_, scale_, mix_weights_, scale_cholesky_

    @staticmethod
    def e_step(X, df_, loc_, scale_cholesky_, mix_weights_, sq_maha_dist):
        return _e_step(X, df_, loc_, scale_cholesky_, mix_weights_, sq_maha_dist)

    @staticmethod
    def m_step(X, resp, E_gamma, scale_, scale_cholesky_, df_, reg_covar):
        return _m_step(X, resp, E_gamma, scale_, scale_cholesky_, df_, reg_covar)

    def fit(self, X, y=None):
        """Estimate model parameters with the EM algorithm.

        Parameters
        ----------
        X: array-like of shape (n_samples, n_features)
            List of n_features-dimensional data points. Each row corresponds to a single data point.

        y: Ignored
            Not used, present for API consistency by convention.

        Returns
        -------
        self: StudentsTMixture
            The fitted mixture.
        """
        self.df_ = np.full((self.n_components), self.df, dtype=np.float64)
        loc_, scale_, mix_weights_, scale_cholesky_ = self._init_params(X)
        lower_bound = -np.inf
        sq_maha_dist = np.empty((X.shape[0], self.n_components))
        self.converged_ = True

        @jax.jit
        def step(loc_, mix_weights_, scale_cholesky_, scale_, sq_maha_dist):
            resp, E_gamma, current_bound = self.e_step(
                X, self.df_, loc_, scale_cholesky_, mix_weights_, sq_maha_dist
            )
            mix_weights_, loc_, scale_, scale_cholesky_ = self.m_step(
                X,
                resp,
                E_gamma,
                scale_,
                scale_cholesky_,
                self.df_,
                self.reg_covar,
            )
            return (
                loc_,
                mix_weights_,
                scale_cholesky_,
                scale_,
                sq_maha_dist,
                current_bound,
            )

        for iter in trange(self.max_iter, desc="Running EM"):
            loc_, mix_weights_, scale_cholesky_, scale_, sq_maha_dist, current_bound = (
                step(
                    loc_,
                    mix_weights_,
                    scale_cholesky_,
                    scale_,
                    sq_maha_dist,
                )
            )
            change = current_bound - lower_bound
            if abs(change) < self.tol:
                break
            lower_bound = current_bound
        else:
            self.converged_ = False
            warnings.warn(
                "tDistributionMixture did not converge, consider increasing the number of iterations."
            )
        self.n_iter_ = iter
        self.weights_ = mix_weights_
        self.means_ = loc_
        self.scale_cholesky_ = scale_cholesky_
        self.scale_ = scale_
        return self

    def predict_proba(self, X):
        """Evaluate the components' density for each sample.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            List of n_features-dimensional data points. Each row
            corresponds to a single data point.

        Returns
        -------
        resp : array, shape (n_samples, n_components)
            Density of each Student's T component for each sample in X.
        """
        sq_maha_dist = sq_maha_distance(X, self.means_, self.scale_cholesky_)
        loglik = get_loglikelihood(
            X, sq_maha_dist, self.df_, self.scale_cholesky_, self.weights_
        )
        weighted_loglik = loglik + np.log(self.weights_)[np.newaxis, :]
        with np.errstate(under="ignore"):
            loglik = weighted_loglik - logsumexp(weighted_loglik, axis=1)[:, np.newaxis]
        return np.exp(loglik)

    def predict(self, X):
        """Predict the labels for the data samples in X using trained model.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            List of n_features-dimensional data points. Each row
            corresponds to a single data point.

        Returns
        -------
        labels : array, shape (n_samples,)
            Component labels.
        """
        return np.argmax(self.predict_proba(X), axis=1)

    def fit_predict(self, X):
        """Estimate model parameters using X and predict the labels for X.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            List of n_features-dimensional data points. Each row
            corresponds to a single data point.

        y : Ignored
            Not used, present for API consistency by convention.

        Returns
        -------
        labels : array, shape (n_samples,)
            Component labels.
        """
        return self.fit(X).predict(X)

    def aic(self, X):
        """Akaike information criterion for the current model on the input X.

        Parameters
        ----------
        X: array of shape (n_samples, n_dimensions)
            The input samples.

        Returns
        -------
        aic: float
            The lower the better.
        """
        self.check_model()
        x = self.check_inputs(X)
        n_params = self.get_num_parameters()
        score = self.score(x, perform_model_checks=False)
        return 2 * n_params - 2 * score * X.shape[0]

    def bic(self, X):
        """Bayesian information criterion for the current model on the input X.

        Parameters
        ----------
        X: array of shape (n_samples, n_dimensions)
            The input samples.

        Returns
        -------
        bic: float
            The lower the better.
        """
        self.check_model()
        x = self.check_inputs(X)
        score = self.score(x, perform_model_checks=False)
        n_params = self.get_num_parameters()
        return n_params * np.log(x.shape[0]) - 2 * score * x.shape[0]

    def score_samples(self, X):
        """Compute the log-likelihood of each sample.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            List of n_features-dimensional data points. Each row
            corresponds to a single data point.

        Returns
        -------
        log_prob : array, shape (n_samples,)
            Log-likelihood of each sample in `X` under the current model.
        """
        sq_maha_dist = sq_maha_distance(X, self.means_, self.scale_cholesky_)
        loglik = get_loglikelihood(
            X, sq_maha_dist, self.df_, self.scale_cholesky_, self.weights_
        )
        weighted_loglik = loglik + np.log(self.weights_)[np.newaxis, :]
        return logsumexp(weighted_loglik, axis=1)

    def score(self, X, y=None):
        """Compute the per-sample average log-likelihood of the given data X.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_dimensions)
            List of n_features-dimensional data points. Each row
            corresponds to a single data point.

        y : Ignored
            Not used, present for API consistency by convention.

        Returns
        -------
        log_likelihood : float
            Log-likelihood of `X` under the Gaussian mixture model.
        """
        return np.mean(self.score_samples(X))

aic(X)

Akaike information criterion for the current model on the input X.

Parameters:

Name Type Description Default
X

The input samples.

required

Returns:

Name Type Description
aic float

The lower the better.

Source code in noloox/mixture/tmm.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
def aic(self, X):
    """Akaike information criterion for the current model on the input X.

    Parameters
    ----------
    X: array of shape (n_samples, n_dimensions)
        The input samples.

    Returns
    -------
    aic: float
        The lower the better.
    """
    self.check_model()
    x = self.check_inputs(X)
    n_params = self.get_num_parameters()
    score = self.score(x, perform_model_checks=False)
    return 2 * n_params - 2 * score * X.shape[0]

bic(X)

Bayesian information criterion for the current model on the input X.

Parameters:

Name Type Description Default
X

The input samples.

required

Returns:

Name Type Description
bic float

The lower the better.

Source code in noloox/mixture/tmm.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
def bic(self, X):
    """Bayesian information criterion for the current model on the input X.

    Parameters
    ----------
    X: array of shape (n_samples, n_dimensions)
        The input samples.

    Returns
    -------
    bic: float
        The lower the better.
    """
    self.check_model()
    x = self.check_inputs(X)
    score = self.score(x, perform_model_checks=False)
    n_params = self.get_num_parameters()
    return n_params * np.log(x.shape[0]) - 2 * score * x.shape[0]

fit(X, y=None)

Estimate model parameters with the EM algorithm.

Parameters:

Name Type Description Default
X

List of n_features-dimensional data points. Each row corresponds to a single data point.

required
y

Not used, present for API consistency by convention.

None

Returns:

Name Type Description
self StudentsTMixture

The fitted mixture.

Source code in noloox/mixture/tmm.py
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
def fit(self, X, y=None):
    """Estimate model parameters with the EM algorithm.

    Parameters
    ----------
    X: array-like of shape (n_samples, n_features)
        List of n_features-dimensional data points. Each row corresponds to a single data point.

    y: Ignored
        Not used, present for API consistency by convention.

    Returns
    -------
    self: StudentsTMixture
        The fitted mixture.
    """
    self.df_ = np.full((self.n_components), self.df, dtype=np.float64)
    loc_, scale_, mix_weights_, scale_cholesky_ = self._init_params(X)
    lower_bound = -np.inf
    sq_maha_dist = np.empty((X.shape[0], self.n_components))
    self.converged_ = True

    @jax.jit
    def step(loc_, mix_weights_, scale_cholesky_, scale_, sq_maha_dist):
        resp, E_gamma, current_bound = self.e_step(
            X, self.df_, loc_, scale_cholesky_, mix_weights_, sq_maha_dist
        )
        mix_weights_, loc_, scale_, scale_cholesky_ = self.m_step(
            X,
            resp,
            E_gamma,
            scale_,
            scale_cholesky_,
            self.df_,
            self.reg_covar,
        )
        return (
            loc_,
            mix_weights_,
            scale_cholesky_,
            scale_,
            sq_maha_dist,
            current_bound,
        )

    for iter in trange(self.max_iter, desc="Running EM"):
        loc_, mix_weights_, scale_cholesky_, scale_, sq_maha_dist, current_bound = (
            step(
                loc_,
                mix_weights_,
                scale_cholesky_,
                scale_,
                sq_maha_dist,
            )
        )
        change = current_bound - lower_bound
        if abs(change) < self.tol:
            break
        lower_bound = current_bound
    else:
        self.converged_ = False
        warnings.warn(
            "tDistributionMixture did not converge, consider increasing the number of iterations."
        )
    self.n_iter_ = iter
    self.weights_ = mix_weights_
    self.means_ = loc_
    self.scale_cholesky_ = scale_cholesky_
    self.scale_ = scale_
    return self

fit_predict(X)

Estimate model parameters using X and predict the labels for X.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

required
y Ignored

Not used, present for API consistency by convention.

required

Returns:

Name Type Description
labels (array, shape(n_samples))

Component labels.

Source code in noloox/mixture/tmm.py
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
def fit_predict(self, X):
    """Estimate model parameters using X and predict the labels for X.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        List of n_features-dimensional data points. Each row
        corresponds to a single data point.

    y : Ignored
        Not used, present for API consistency by convention.

    Returns
    -------
    labels : array, shape (n_samples,)
        Component labels.
    """
    return self.fit(X).predict(X)

predict(X)

Predict the labels for the data samples in X using trained model.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

required

Returns:

Name Type Description
labels (array, shape(n_samples))

Component labels.

Source code in noloox/mixture/tmm.py
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
def predict(self, X):
    """Predict the labels for the data samples in X using trained model.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        List of n_features-dimensional data points. Each row
        corresponds to a single data point.

    Returns
    -------
    labels : array, shape (n_samples,)
        Component labels.
    """
    return np.argmax(self.predict_proba(X), axis=1)

predict_proba(X)

Evaluate the components' density for each sample.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

required

Returns:

Name Type Description
resp (array, shape(n_samples, n_components))

Density of each Student's T component for each sample in X.

Source code in noloox/mixture/tmm.py
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
def predict_proba(self, X):
    """Evaluate the components' density for each sample.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        List of n_features-dimensional data points. Each row
        corresponds to a single data point.

    Returns
    -------
    resp : array, shape (n_samples, n_components)
        Density of each Student's T component for each sample in X.
    """
    sq_maha_dist = sq_maha_distance(X, self.means_, self.scale_cholesky_)
    loglik = get_loglikelihood(
        X, sq_maha_dist, self.df_, self.scale_cholesky_, self.weights_
    )
    weighted_loglik = loglik + np.log(self.weights_)[np.newaxis, :]
    with np.errstate(under="ignore"):
        loglik = weighted_loglik - logsumexp(weighted_loglik, axis=1)[:, np.newaxis]
    return np.exp(loglik)

score(X, y=None)

Compute the per-sample average log-likelihood of the given data X.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_dimensions)

List of n_features-dimensional data points. Each row corresponds to a single data point.

required
y Ignored

Not used, present for API consistency by convention.

None

Returns:

Name Type Description
log_likelihood float

Log-likelihood of X under the Gaussian mixture model.

Source code in noloox/mixture/tmm.py
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
def score(self, X, y=None):
    """Compute the per-sample average log-likelihood of the given data X.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_dimensions)
        List of n_features-dimensional data points. Each row
        corresponds to a single data point.

    y : Ignored
        Not used, present for API consistency by convention.

    Returns
    -------
    log_likelihood : float
        Log-likelihood of `X` under the Gaussian mixture model.
    """
    return np.mean(self.score_samples(X))

score_samples(X)

Compute the log-likelihood of each sample.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

required

Returns:

Name Type Description
log_prob (array, shape(n_samples))

Log-likelihood of each sample in X under the current model.

Source code in noloox/mixture/tmm.py
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
def score_samples(self, X):
    """Compute the log-likelihood of each sample.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        List of n_features-dimensional data points. Each row
        corresponds to a single data point.

    Returns
    -------
    log_prob : array, shape (n_samples,)
        Log-likelihood of each sample in `X` under the current model.
    """
    sq_maha_dist = sq_maha_distance(X, self.means_, self.scale_cholesky_)
    loglik = get_loglikelihood(
        X, sq_maha_dist, self.df_, self.scale_cholesky_, self.weights_
    )
    weighted_loglik = loglik + np.log(self.weights_)[np.newaxis, :]
    return logsumexp(weighted_loglik, axis=1)