apache-ignite
70 строк · 3.1 Кб
1// Licensed to the Apache Software Foundation (ASF) under one or more
2// contributor license agreements. See the NOTICE file distributed with
3// this work for additional information regarding copyright ownership.
4// The ASF licenses this file to You under the Apache License, Version 2.0
5// (the "License"); you may not use this file except in compliance with
6// the License. You may obtain a copy of the License at
7//
8// http://www.apache.org/licenses/LICENSE-2.0
9//
10// Unless required by applicable law or agreed to in writing, software
11// distributed under the License is distributed on an "AS IS" BASIS,
12// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13// See the License for the specific language governing permissions and
14// limitations under the License.
15= Gaussian mixture (GMM)
16
17A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.
18
19NOTE: You could think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.
20
21== Model
22
23This algorithm represents a soft clustering model where each cluster is a Gaussian distribution with its own mean value and covariation matrix. Such a model can predict a cluster using the maximum likelihood principle.
24
25It defines the labels by the following way:
26
27
28[source, java]
29----
30KMeansModel mdl = trainer.fit(
31ignite,
32dataCache,
33vectorizer
34);
35
36double clusterLabel = mdl.predict(inputVector);
37----
38
39
40== Trainer
41
42
43GMM is a unsupervised learning algorithm. The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can compute the Bayesian Information Criterion to assess the number of clusters in the data.
44
45Presently, Ignite ML supports a few parameters for the GMM classification algorithm:
46
47* `maxCountOfClusters ` - the number of possible clusters
48* `maxCountOfIterations ` - one stop criteria (the other one is epsilon)
49* `epsilon` - delta of convergence(delta between old and new centroid's values)
50* `countOfComponents` - the number of components
51* `maxLikelihoodDivergence` - maximum divergence between maximum of likelihood of vector in dataset and other for anomalies identification
52* `minElementsForNewCluster` - minimum required anomalies in terms of maxLikelihoodDivergence for creating new cluster
53* `minClusterProbability` - minimum cluster probability
54
55
56[source, java]
57----
58// Set up the trainer
59GmmTrainer trainer = new GmmTrainer(COUNT_OF_COMPONENTS);
60
61// Build the model
62GmmModel mdl = trainer
63.withMaxCountIterations(MAX_COUNT_ITERATIONS)
64.withMaxCountOfClusters(MAX_AMOUNT_OF_CLUSTERS)
65.fit(ignite, dataCache, vectorizer);
66----
67
68== Example
69
70To see how GMM clustering can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/GmmClusterizationExample.java[example] that is available on GitHub and delivered with every Apache Ignite distribution.
71
72