apache-ignite

k-means-clustering.adoc
80 строк · 3.3 Кб
Перенос по словам
1
// Licensed to the Apache Software Foundation (ASF) under one or more
2
// contributor license agreements.  See the NOTICE file distributed with
3
// this work for additional information regarding copyright ownership.
4
// The ASF licenses this file to You under the Apache License, Version 2.0
5
// (the "License"); you may not use this file except in compliance with
6
// the License.  You may obtain a copy of the License at
7
//
8
// http://www.apache.org/licenses/LICENSE-2.0
9
//
10
// Unless required by applicable law or agreed to in writing, software
11
// distributed under the License is distributed on an "AS IS" BASIS,
12
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
// See the License for the specific language governing permissions and
14
// limitations under the License.
15
= K-Means Clustering
16

17
K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters.
18

19
== Model
20

21
K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
22

23
The model holds a vector of k centers and one of the distance metrics provided by the ML framework such as Euclidean, Hamming, Manhattan and etc.
24

25
It creates the label as follows:
26

27

28

29
[source, java]
30
----
31
KMeansModel mdl = trainer.fit(
32
    ignite,
33
    dataCache,
34
    vectorizer
35
);
36

37

38
double clusterLabel = mdl.predict(inputVector);
39
----
40

41
== Trainer
42

43

44
KMeans is an unsupervised learning algorithm. It solves a clustering task which is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
45

46
KMeans is a parametrized iterative algorithm which calculates the new means to be the centroids of the observations in the clusters on each iteration.
47

48
Presently, Ignite supports a few parameters for the KMeans classification algorithm:
49

50
* `k` - a number of possible clusters
51
* `maxIterations` - one stop criteria (the other one is epsilon)
52
* `epsilon` - delta of convergence (delta between old and new centroid's values)
53
* `distance` - one of the distance metrics provided by the ML framework such as Euclidean, Hamming or Manhattan
54
* `seed` - one of initialization parameters which helps to reproduce models (trainer has a random initialization step to get the first centroids)
55

56

57
[source, java]
58
----
59
// Set up the trainer
60
KMeansTrainer trainer = new KMeansTrainer()
61
   .withDistance(new EuclideanDistance())
62
   .withK(AMOUNT_OF_CLUSTERS)
63
   .withMaxIterations(MAX_ITERATIONS)
64
   .withEpsilon(PRECISION);
65

66
// Build the model
67
KMeansModel mdl = trainer.fit(
68
    ignite,
69
    dataCache,
70
    vectorizer
71
);
72
----
73

74

75
== Example
76

77

78
To see how K-Means clustering can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/KMeansClusterizationExample.java[example^] that is available on GitHub and delivered with every Apache Ignite distribution.
79

80
The training dataset is the subset of the Iris dataset (classes with labels 1 and 2, which are presented linear separable two-classes dataset) which can be loaded from the https://archive.ics.uci.edu/ml/datasets/iris[UCI Machine Learning Repository].
81
apache-ignite

Использование cookies