apache-ignite

split-the-dataset-on-test-and-train-datasets.adoc
64 строки · 2.8 Кб
Перенос по словам
1
// Licensed to the Apache Software Foundation (ASF) under one or more
2
// contributor license agreements.  See the NOTICE file distributed with
3
// this work for additional information regarding copyright ownership.
4
// The ASF licenses this file to You under the Apache License, Version 2.0
5
// (the "License"); you may not use this file except in compliance with
6
// the License.  You may obtain a copy of the License at
7
//
8
// http://www.apache.org/licenses/LICENSE-2.0
9
//
10
// Unless required by applicable law or agreed to in writing, software
11
// distributed under the License is distributed on an "AS IS" BASIS,
12
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
// See the License for the specific language governing permissions and
14
// limitations under the License.
15
= Split the dataset on test and train datasets
16

17
Data splitting is meant to split the data stored in a cache into two parts: the training part that is used to train the model, and the test part that is used to estimate the model quality.
18

19
All fit() methods has a special parameter to pass a filter condition to each cache.
20

21
[NOTE]
22
====
23
Due to distributed and lazy nature of dataset operations, the dataset split is the lazy operation too and could be defined as a filter condition that could be applied to the initial cache to form both, the train and test datasets.
24
====
25

26
In the example below the model is trained only on 75% of the initial dataset. The filter parameter value is the result of the `split.getTrainFilter()` that could continue with or reject the row from the initial dataset to handle it during the training.
27

28

29
[source, java]
30
----
31
// Define the cache.
32
IgniteCache<Integer, Vector> dataCache = ...;
33

34
// Define the percentage of the train sub-set of the initial dataset.
35
TrainTestSplit<Integer, Vector> split = new TrainTestDatasetSplitter<>().split(0.75);
36

37
IgniteModel<Vector, Double> mdl = trainer
38
  .fit(ignite, dataCache, split.getTrainFilter(), vectorizer);
39
----
40

41

42
The `split.getTestFilter()` could be used to validate the model on the test data.
43
Below is the example of working with the cache directly: printing the predicted and real regression value from the test sub-set of the initial dataset.
44

45

46
[source, java]
47
----
48
// Define the cache query and set the filter.
49
ScanQuery<Integer, Vector> qry = new ScanQuery<>();
50
qry.setFilter(split.getTestFilter());
51

52

53
try (QueryCursor<Cache.Entry<Integer, Vector>> observations = dataCache.query(qry)) {
54
    for (Cache.Entry<Integer, Vector> observation : observations) {
55
         Vector val = observation.getValue();
56
         Vector inputs = val.copyOfRange(1, val.size());
57
         double groundTruth = val.get(0);
58

59
         double prediction = mdl.predict(inputs);
60

61
         System.out.printf(">>> | %.4f\t\t| %.4f\t\t|\n", prediction, groundTruth);
62
    }
63
}
64
----
65

66

67
apache-ignite

Использование cookies