apache-ignite

Форк
0
/
new-metrics-system.adoc 
484 строки · 20.4 Кб
1
// Licensed to the Apache Software Foundation (ASF) under one or more
2
// contributor license agreements.  See the NOTICE file distributed with
3
// this work for additional information regarding copyright ownership.
4
// The ASF licenses this file to You under the Apache License, Version 2.0
5
// (the "License"); you may not use this file except in compliance with
6
// the License.  You may obtain a copy of the License at
7
//
8
// http://www.apache.org/licenses/LICENSE-2.0
9
//
10
// Unless required by applicable law or agreed to in writing, software
11
// distributed under the License is distributed on an "AS IS" BASIS,
12
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
// See the License for the specific language governing permissions and
14
// limitations under the License.
15
= Metrics System
16

17
:javaFile: {javaCodeDir}/ConfiguringMetrics.java
18

19
== Overview
20

21
This section explains the metrics system and how you can use it to monitor your cluster.
22
//the types of metrics and how to export them, but first let's explore the basic concepts of the metrics mechanism in Ignite.
23

24
Let's explore the basic concepts of the metrics system in Ignite.
25
First, there are different metrics.
26
Each metric has a name and a return value.
27
The return value can be a simple value like `String`, `long`, or `double`, or can represent a Java object.
28
Some metrics represent <<histograms>>.
29

30
And then there are different ways to export the metrics — what we call _exporters_.
31
To put it another way, the exporter are different ways you can access the metrics.
32
Each exporter always gives access to all available metrics.
33

34
Ignite includes the following exporters:
35

36
* JMX (default)
37
* SQL Views
38
* Log files
39
* OpenCensus
40

41
You can create a custom exporter by implementing the javadoc:org.apache.ignite.spi.metric.MetricExporterSpi[] interface.
42

43

44
== Metric Registries [[registry]]
45

46
Metrics are grouped into categories (called _registries_).
47
Each registry has a name.
48
The full name of a specific metric within the registry consists of the registry name followed by a dot, followed by the name of the metric: `<registry_name>.<metric_name>`.
49
For example, the registry for data storage metrics is called `io.datastorage`.
50
The metric that returns the storage size is called `io.datastorage.StorageSize`.
51

52
The list of all registries and the metrics they contain are described link:monitoring-metrics/new-metrics[here].
53

54
== Metric Exporters
55

56
If you want to enable metrics, configure one or multiple metric exporters in the node configuration.
57
This is a node-specific configuration, which means it enables metrics only on the node where it is specified.
58

59
[tabs]
60
--
61
tab:XML[]
62
[source, xml]
63
----
64
include::code-snippets/xml/metrics.xml[tags=ignite-config;!discovery, indent=0]
65
----
66

67
tab:Java[]
68

69
[source, java]
70
----
71
include::{javaFile}[tags=new-metric-framework, indent=0]
72
----
73

74
tab:C#/.NET[]
75

76
tab:C++[unsupported]
77
--
78

79
The following sections describe the exporters available in Ignite by default.
80

81

82
=== JMX
83

84
`org.apache.ignite.spi.metric.jmx.JmxMetricExporterSpi` exposes metrics via JMX beans.
85

86
[tabs]
87
--
88
tab:Java[]
89
[source, java]
90
----
91
include::{javaFile}[tags=metrics-filter, indent=0]
92
----
93

94
tab:C#/.NET[]
95

96
tab:C++[unsupported]
97
--
98

99
[NOTE]
100
====
101
This exporter is enabled by default if nothing is set with `IgniteConfiguration.setMetricExporterSpi(...)`.
102
====
103

104
[tabs]
105
--
106
tab:Java[]
107
[source, java]
108
----
109
include::{javaFile}[tags=disable-default-jmx-exporter, indent=0]
110
----
111

112
tab:C#/.NET[]
113

114
tab:C++[unsupported]
115
--
116

117
==== Enabling JMX for Ignite
118

119
By default, the JMX automatic configuration is disabled.
120
To enable it, configure the following environment variables:
121

122
* For `control.sh`, configure the `CONTROL_JVM_OPTS` variable
123
* For `ignite.sh`, configure the `JVM_OPTS` variable
124

125
For example:
126

127
[source,shell]
128
----
129
JVM_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=${JMX_PORT} \
130
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"
131
----
132

133

134
// link:monitoring-metrics/configuring-metrics[Configuring Metrics]
135

136
==== Understanding MBean's ObjectName
137

138
Every JMX Mbean has an https://docs.oracle.com/javase/8/docs/api/javax/management/ObjectName.html[ObjectName,window=_blank].
139
The ObjectName is used to identify the bean.
140
The ObjectName consists of a domain and a list of key properties, and can be represented as a string as follows:
141

142
   domain: key1 = value1 , key2 = value2
143

144
All Ignite metrics have the same domain: `org.apache.<classloaderId>` where the classloader ID is optional (omitted if you set `IGNITE_MBEAN_APPEND_CLASS_LOADER_ID=false`). In addition, each metric has two properties: `group` and `name`.
145
For example:
146

147
    org.apache:group=SPIs,name=TcpDiscoverySpi
148

149
This MBean provides various metrics related to node discovery.
150

151
The MBean ObjectName can be used to identify the bean in UI tools like JConsole.
152
For example, JConsole displays MBeans in a tree-like structure where all beans are first grouped by domain and then by the 'group' property:
153

154
image::images/jconsole.png[]
155

156
{sp}+
157

158
=== SQL View
159

160
`SqlViewMetricExporterSpi` is enabled by default, `SqlViewMetricExporterSpi` exposes metrics via the `SYS.METRICS` view.
161
Each metric is displayed as a single record.
162
You can use any supported SQL tool to view the metrics:
163

164
[source, shell,subs="attributes"]
165
----
166
> select name, value from SYS.METRICS where name LIKE 'cache.myCache.%';
167
+-----------------------------------+--------------------------------+
168
|                NAME               |             VALUE              |
169
+-----------------------------------+--------------------------------+
170
| cache.myCache.CacheTxRollbacks    | 0                              |
171
| cache.myCache.OffHeapRemovals     | 0                              |
172
| cache.myCache.QueryCompleted      | 0                              |
173
| cache.myCache.QueryFailed         | 0                              |
174
| cache.myCache.EstimatedRebalancingKeys | 0                         |
175
| cache.myCache.CacheEvictions      | 0                              |
176
| cache.myCache.CommitTime          | [J@2eb66498                    |
177
....
178
----
179

180
=== Log
181

182
`org.apache.ignite.spi.metric.log.LogExporterSpi` prints the metrics to the log file at regular intervals (1 min by default) at INFO level.
183

184
[tabs]
185
--
186
tab:XML[]
187

188
[source, xml]
189
----
190
include::code-snippets/xml/metrics.xml[tags=!*;ignite-config;log-exporter, indent=0]
191
----
192

193

194
tab:Java[]
195

196
If you use programmatic configuration, you can change the print frequency as follows:
197

198
[source, java]
199
----
200
include::{javaFile}[tags=log-exporter, indent=0]
201
----
202

203
tab:C#/.NET[]
204
tab:C++[]
205
--
206

207
=== OpenCensus
208

209
`org.apache.ignite.spi.metric.opencensus.OpenCensusMetricExporterSpi` adds integration with the OpenCensus library.
210

211
To use the OpenCensus exporter:
212

213
. link:setup#enabling-modules[Enable the 'ignite-opencensus' module].
214
. Add `org.apache.ignite.spi.metric.opencensus.OpenCensusMetricExporterSpi` to the list of exporters in the node configuration.
215
. Configure OpenCensus StatsCollector to export to a specific system. See link:{githubUrl}/examples/src/main/java/org/apache/ignite/examples/opencensus/OpenCensusMetricsExporterExample.java[OpenCensusMetricsExporterExample.java] for an example and OpenCensus documentation for additional information.
216

217

218
Configuration parameters:
219

220
* `filter` - predicate that filters metrics.
221
* `period` - export period.
222
* `sendInstanceName` - if enabled, a tag with the Ignite instance name is added to each metric.
223
* `sendNodeId` - if enabled, a tag with the Ignite node id is added to each metric.
224
* `sendConsistentId` - if enabled, a tag with the Ignite node consistent id is added to each metric.
225

226

227

228

229
== Histograms
230

231
The metrics that represent histograms are available in the JMX exporter only.
232
Histogram metrics are exported as a set of values where each value corresponds to a specific bucket and is available through a separate JMX bean attribute.
233
The attribute names of a histogram metric have the following format:
234

235
```
236
{metric_name}_{low_bound}_{high_bound}
237
```
238

239
where
240

241
* `{metric_name}` - the name of the metric.
242
* `{low_bound}` - start of the bound. `0` for the first bound.
243
* `{high_bound}` - end of the bound. `inf` for the last bound.
244

245

246
Example of the metric names if the bounds are [10,100]:
247

248
* `histogram_0_10` - less than 10.
249
* `histogram_10_100` - between 10 and 100.
250
* `histogram_100_inf` - more than 100.
251

252
== Common monitoring tasks
253
=== Monitoring the Amount of Data
254

255
If you do not use link:persistence/native-persistence[Native persistence] (i.e., all your data is kept in memory), you would want to monitor RAM usage.
256
If you use Native persistence, in addition to RAM, you should monitor the size of the data storage on disk.
257

258
The size of the data loaded into a node is available at different levels of aggregation. You can monitor for:
259

260
* The total size of the data the node keeps on disk or in RAM. This amount is the sum of the size of each configured data region (in the simplest case, only the default data region) plus the sizes of the system data regions.
261
* The size of a specific link:memory-configuration/data-regions[data region] on that node. The data region size is the sum of the sizes of all cache groups.
262
* The size of a specific cache/cache group on that node, including the backup partitions.
263

264

265
==== Allocated Space vs. Actual Size of Data
266

267
There is no way to get the exact size of the data (neither in RAM nor on disk). Instead, there are two ways to estimate it.
268

269
You can get the size of the space _allocated_ for storing the data.
270
(The "space" here refers either to the space in RAM or on disk depending on whether you use Native persistence or not.)
271
Space is allocated when the size of the storage gets full and more entries need to be added.
272
However, when you remove entries from caches, the space is not deallocated.
273
It is reused when new entries need to be added to the storage on subsequent write operations. Therefore, the allocated size does not decrease when you remove entries from the caches.
274
The allocated size is available at the level of data storage, data region, and cache group metrics.
275
The metric is called `TotalAllocatedSize`.
276

277
You can also get an estimate of the actual size of data by multiplying the number of link:memory-centric-storage#data-pages[data pages] in use by the fill factor. The fill factor is the ratio of the size of data in a page to the page size, averaged over all pages. The number of pages in use and the fill factor are available at the level of data <<Data Region Size,region metrics>>.
278

279
Add up the estimated size of all data regions to get the estimated total amount of data on the node.
280

281

282
:allocsize_note: Note that when Native persistence is disabled, this metric shows the total size of the allocated space in RAM.
283

284
==== Monitoring RAM Memory Usage
285
The amount of data in RAM can be monitored for each data region through the following metrics:
286

287
[{table_opts}]
288
|===
289
| Attribute | Type | Description | Scope
290

291
| PagesFillFactor| float | The average size of data in pages as a ratio of the page size. When Native persistence is enabled, this metric is applicable only to the persistent storage (i.e. pages on disk). | Node
292
| TotalUsedPages | long | The number of data pages that are currently in use. When Native persistence is enabled, this metric is applicable only to the persistent storage (i.e. pages on disk).| Node
293
| PhysicalMemoryPages |long | The number of the allocated pages in RAM. | Node
294
| PhysicalMemorySize |long |The size of the allocated space in RAM in bytes. | Node
295
|===
296

297
If you have multiple data regions, add up the sizes of all data regions to get the total size of the data on the node.
298

299
==== Monitoring Storage Size
300

301
Persistent storage, when enabled, saves all application data on disk.
302
The total amount of data each node keeps on disk consists of the persistent storage (application data), the link:persistence/native-persistence#write-ahead-log[WAL files], and link:persistence/native-persistence#wal-archive[WAL Archive] files.
303

304
===== Persistent Storage Size
305
To monitor the size of the persistent storage on disk, use the following metrics:
306

307
[{table_opts}]
308
|===
309
| Attribute | Type | Description | Scope
310
| TotalAllocatedSize | long  | The size of the space allocated on disk for the entire data storage (in bytes). {allocsize_note} | Node
311
| WalTotalSize | long | Total size of the WAL files in bytes, including the WAL archive files. | Node
312
| WalArchiveSegments | int | The number of WAL segments in the archive.  | Node
313
|===
314

315
===== Data Region Size
316

317
For each configured data region, Metrics collection for data regions are disabled by default. You can link:monitoring-metrics/configuring-metrics#enabling-data-region-metrics[enable it in the data region configuration.
318

319
The size of the data region on a node comprises the size of all partitions (including backup partitions) that this node owns for all caches in that data region.
320

321
[{table_opts}]
322
|===
323
| Attribute | Type | Description | Scope
324

325
| TotalAllocatedSize | long  | The size of the space allocated for this data region (in bytes). {allocsize_note} | Node
326
| PagesFillFactor| float | The average amount of data in pages as a ratio of the page size. | Node
327
| TotalUsedPages | long | The number of data pages that are currently in use. | Node
328
| PhysicalMemoryPages |long |The number of data pages in this data region held in RAM. | Node
329
| PhysicalMemorySize | long |The size of the allocated space in RAM in bytes.| Node
330
|===
331

332
===== Cache Group Size
333

334
If you don't use link:configuring-caches/cache-groups[cache groups], each cache will be its own group.
335

336
[{table_opts}]
337
|===
338
| Attribute | Type | Description | Scope
339
|TotalAllocatedSize |long | The amount of space allocated for the cache group on this node. | Node
340
|===
341

342
=== Monitoring Checkpointing Operations
343
Checkpointing may slow down cluster operations.
344
You may want to monitor how much time each checkpoint operation takes, so that you can tune the properties that affect checkpointing.
345
You may also want to monitor the disk performance to see if the slow-down is caused by external reasons.
346

347
See link:persistence/persistence-tuning#pages-writes-throttling[Pages Writes Throttling] and link:persistence/persistence-tuning#adjusting-checkpointing-buffer-size[Checkpointing Buffer Size] for performance tips.
348

349
[{table_opts}]
350
|===
351
| Attribute | Type | Description | Scope
352
| DirtyPages  | long | The number of pages in memory that have been changed but not yet synchronized to disk. Those will be written to disk during next checkpoint. | Node
353
|LastCheckpointDuration | long | The time in milliseconds it took to create the last checkpoint. | Node
354
|CheckpointBufferSize | long | The size of the checkpointing buffer. | Global
355
|===
356

357
=== Monitoring Rebalancing
358
link:data-rebalancing[Rebalancing] is the process of moving partitions between the cluster nodes so that the data is always distributed in a balanced manner. Rebalancing is triggered when a new node joins, or an existing node leaves the cluster.
359

360
If you have multiple caches, they will be rebalanced sequentially.
361
There are several metrics that you can use to monitor the progress of the rebalancing process for a specific cache.
362

363
In the metric system, link:monitoring-metrics/new-metrics#caches[Cache metrics]:
364
[{table_opts}]
365
|===
366
| Attribute | Type | Description | Scope
367
|RebalancingStartTime | long | This metric shows the time when rebalancing of local partitions started for the cache. This metric will return 0 if the local partitions do not participate in the rebalancing. The time is returned in milliseconds. | Node
368
| EstimatedRebalancingFinishTime | long | Expected time of completion of the rebalancing process. |  Node
369
| KeysToRebalanceLeft | long | The number of keys on the node that remain to be rebalanced.  You can monitor this metric to learn when the rebalancing process finishes.| Node
370
|===
371

372
=== Monitoring Topology
373
Topology refers to the set of nodes in a cluster. There are a number of metrics that expose the information about the topology of the cluster. If the topology changes too frequently or has a size that is different from what you expect, you may want to look into whether there are network problems.
374

375
[{table_opts}]
376
|===
377
| Attribute | Type | Description | Scope
378
| TotalServerNodes| long  |The number of server nodes in the cluster.| Global
379
| TotalClientNodes| long |The number of client nodes in the cluster. | Global
380
| TotalBaselineNodes | long | The number of nodes that are registered in the link:clustering/baseline-topology[baseline topology]. When a node goes down, it remains registered in the baseline topology and you need to remote it manually. |  Global
381
| ActiveBaselineNodes | long | The number of nodes that are currently active in the baseline topology.  |  Global
382
|===
383

384
[{table_opts}]
385
|===
386
| Attribute | Type | Description | Scope
387
| Coordinator | String | The node ID of the current coordinator node.| Global
388
| CoordinatorNodeFormatted|String a|
389
Detailed information about the coordinator node.
390
....
391
TcpDiscoveryNode [id=e07ad289-ff5b-4a73-b3d4-d323a661b6d4,
392
consistentId=fa65ff2b-e7e2-4367-96d9-fd0915529c25,
393
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.25.4.200],
394
sockAddrs=[mymachine.local/172.25.4.200:47500,
395
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500,
396
order=2, intOrder=2, lastExchangeTime=1568187777249, loc=false,
397
ver=8.7.5#20190520-sha1:d159cd7a, isClient=false]
398
....
399

400
| Global
401
|===
402

403
=== Monitoring Caches
404

405
See the new metric system, link:monitoring-metrics/new-metrics#caches[Cache metrics].
406

407
==== Monitoring Build and Rebuild Indexes
408

409
To get an estimate on how long it takes to rebuild cache indexes, you can use one of the metrics listed below:
410

411
. `IsIndexRebuildInProgress` - tells whether indexes are being built or rebuilt at the moment;
412
. `IndexBuildCountPartitionsLeft` - gives the remaining number of partitions (by cache group) for indexes to rebuild.
413

414
Note that the `IndexBuildCountPartitionsLeft` metric allows to estimate only an approximate number of indexes left to rebuild.
415
For a more accurate estimate, use the `IndexRebuildKeyProcessed` cache metric:
416

417
* Use `isIndexRebuildInProgress` to know whether the indexes are being rebuilt for the cache.
418

419
* Use `IndexRebuildKeysProcessed` to know the number of keys with rebuilt indexes. If the rebuilding is in progress, it gives a number of keys with indexes being rebuilt at the current moment. Otherwise, it gives a total number of the of keys with rebuilt indexes. The values are reset before the start of each rebuilding.
420

421
=== Monitoring Transactions
422
Note that if a transaction spans multiple nodes (i.e., if the keys that are changed as a result of the transaction execution are located on multiple nodes), the counters will increase on each node. For example, the 'TransactionsCommittedNumber' counter will increase on each node where the keys affected by the transaction are stored.
423

424
[{table_opts}]
425
|===
426
| Attribute | Type | Description | Scope
427
| LockedKeysNumber | long  | The number of keys locked on the node. | Node
428
| TransactionsCommittedNumber |long | The number of transactions that have been committed on the node  | Node
429
| TransactionsRolledBackNumber | long | The number of transactions that were rolled back. | Node
430
| OwnerTransactionsNumber | long |  The number of transactions initiated on the node. | Node
431
| TransactionsHoldingLockNumber | long | The number of open transactions that hold a lock on at least one key on the node.| Node
432
|===
433

434
=== Monitoring Snapshots
435

436
[{table_opts}]
437
|===
438
| Attribute | Type | Description | Scope
439
| LastSnapshotOperation |  | |
440
| LastSnapshotStartTime || |
441
| SnapshotInProgress | | |
442
|===
443

444
=== Monitoring Client Connections
445
Metrics related to JDBC/ODBC or thin client connections.
446

447
[{table_opts}]
448
|===
449
| Attribute | Type | Description | Scope
450
| Connections | java.util.List<String> a| A list of strings, each string containing information about a connection:
451

452
....
453
JdbcClient [id=4294967297, user=<anonymous>,
454
rmtAddr=127.0.0.1:39264, locAddr=127.0.0.1:10800]
455
....
456
| Node
457
|===
458

459

460
=== Monitoring Message Queues
461
When thread pools queues' are growing, it means that the node cannot keep up with the load, or there was an error while processing messages in the queue.
462
Continuous growth of the queue size can lead to OOM errors.
463

464
==== Communication Message Queue
465
The queue of outgoing communication messages contains communication messages that are waiting to be sent to other nodes.
466
If the size is growing, it means there is a problem.
467

468
[{table_opts}]
469
|===
470
| Attribute | Type | Description | Scope
471
| OutboundMessagesQueueSize  | int | The size of the queue of outgoing communication messages. | Node
472
|===
473

474
==== Discovery Messages Queue
475

476
The queue of discovery messages.
477

478
[{table_opts}]
479
|===
480
| Attribute | Type | Description | Scope
481
| MessageWorkerQueueSize | int | The size of the queue of discovery messages that are waiting to be sent to other nodes. | Node
482
|AvgMessageProcessingTime|long| Average message processing time. | Node
483
|===
484
--
485

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.