apache-ignite
80 строк · 3.3 Кб
1// Licensed to the Apache Software Foundation (ASF) under one or more
2// contributor license agreements. See the NOTICE file distributed with
3// this work for additional information regarding copyright ownership.
4// The ASF licenses this file to You under the Apache License, Version 2.0
5// (the "License"); you may not use this file except in compliance with
6// the License. You may obtain a copy of the License at
7//
8// http://www.apache.org/licenses/LICENSE-2.0
9//
10// Unless required by applicable law or agreed to in writing, software
11// distributed under the License is distributed on an "AS IS" BASIS,
12// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13// See the License for the specific language governing permissions and
14// limitations under the License.
15= K-Means Clustering
16
17K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters.
18
19== Model
20
21K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
22
23The model holds a vector of k centers and one of the distance metrics provided by the ML framework such as Euclidean, Hamming, Manhattan and etc.
24
25It creates the label as follows:
26
27
28
29[source, java]
30----
31KMeansModel mdl = trainer.fit(
32ignite,
33dataCache,
34vectorizer
35);
36
37
38double clusterLabel = mdl.predict(inputVector);
39----
40
41== Trainer
42
43
44KMeans is an unsupervised learning algorithm. It solves a clustering task which is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
45
46KMeans is a parametrized iterative algorithm which calculates the new means to be the centroids of the observations in the clusters on each iteration.
47
48Presently, Ignite supports a few parameters for the KMeans classification algorithm:
49
50* `k` - a number of possible clusters
51* `maxIterations` - one stop criteria (the other one is epsilon)
52* `epsilon` - delta of convergence (delta between old and new centroid's values)
53* `distance` - one of the distance metrics provided by the ML framework such as Euclidean, Hamming or Manhattan
54* `seed` - one of initialization parameters which helps to reproduce models (trainer has a random initialization step to get the first centroids)
55
56
57[source, java]
58----
59// Set up the trainer
60KMeansTrainer trainer = new KMeansTrainer()
61.withDistance(new EuclideanDistance())
62.withK(AMOUNT_OF_CLUSTERS)
63.withMaxIterations(MAX_ITERATIONS)
64.withEpsilon(PRECISION);
65
66// Build the model
67KMeansModel mdl = trainer.fit(
68ignite,
69dataCache,
70vectorizer
71);
72----
73
74
75== Example
76
77
78To see how K-Means clustering can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/KMeansClusterizationExample.java[example^] that is available on GitHub and delivered with every Apache Ignite distribution.
79
80The training dataset is the subset of the Iris dataset (classes with labels 1 and 2, which are presented linear separable two-classes dataset) which can be loaded from the https://archive.ics.uci.edu/ml/datasets/iris[UCI Machine Learning Repository].
81