alert-autoconf

Форк
0

README.md

Alert-autoconf

Alerting auto configuration

This Utility uses a configuration file (alert.yaml), and automatically creates triggers and alerts.

Docker run: Apply create/change/delete commands for Triggers and Notifications

You need to be in the directory with the configuration file "./alerts.yaml"

docker run -v `pwd`:/conf registry.yourdomain.ru/alerting/alert:latest \
--user $LOGIN --password $PASSWORD \
--redis_token_storage "redis://$REDIS:6379/1"" \
--token "hands::$SERVICE::$ENV" \
--log-level DEBUG
--cluster $CLUSTER

$LOGIN - Moira account. It is best to use special "system" account
$PASSWORD - password
$REDIS - redis host for internal metadata
$SERVICE - the name of the service for which triggers and notifications will be created
$ENV - prod/staging/dev
$CLUSTER (optional) - We use this parameter when we have a new k8s cluster,
and we want to deploy previously prepared triggers for it

Validation config file (alert.yaml):

docker run -v `pwd`:/conf registry.yourdomain.ru/alerting/alert-validator:latest \
--log-level DEBUG
docker pull registry.yourdomain.ru/alerting/alert-validator:latest; \
find . -type f -iname 'alert.yaml' -print0 | \
xargs --null --max-args=1 --max-procs=1 --no-run-if-empty -I{} \
docker run -v `pwd`:/conf registry.yourdomain.ru/alerting/alert-validator:latest\
--log-level DEBUG \
--config "{}"

Tests run

docker run -ti --entrypoint=make registry.yourdomain.ru/alerting/alert:latest test

Config file describe

The configuration file consists of two required sections:

"triggers:" - configuring triggers
"alerting:" - setting of contacts for notification
---
version: "1"                  # version package "1" or "1.1"
prefix: "any-string-"         # a prefix is added for the names of triggers and tags
                              # (only for version: "1.1")

# Triggers section
triggers:
  - name: Trigger_1           # Trigger name
    targets:                  # Targets list
      - apps.services.my_service.api.query_rec_POST.request_time.200.count
      - apps.services.my_service.api.query_rec_POST.request_time.5*.count
    desc: Test description 1  
    dashboard: "http://grafana.yourdomain.ru/grafana/d/000000596/graphite-clickhouse?panelId=17&fullscreen"
                              # Link to a specific dashboard panel in Grafana
    is_pull_type: true        # Fetch data in Graphite without using Moira's internal storage. Flase by default.
    pending_interval:0        # The number of seconds that the trigger must spend in the new state 
                              # before Moira sends a notification on it
    warn_value: 15            # Execute mailing with the status' WARN
    error_value: 10           # Execute mailing with the status' ERROR 

                              # (If the value of the error_value field is greater than or equal to warn_value,
                              #  then the notification algorithm will be as follows: 
                              #  Execute 'Warning' or 'Error' - if the metric value is greater than the specified
                              # )

    ttl: 600                  # The number of seconds after which you need to make a mailing with the status from the 'ttl_state' field
    tags:                     # Subset of triggers
      - service-1
      - "cluster-{cluster}"   # Dynamic cluster name substitution (Optional!)
                              # (Strings that use {cluster} are recommended to be wrapped in quotes 
                              #  so that the yaml parser does not swear)
    ttl_state: OK             # OK | WARN | ERROR | NODATA | DEL
                              # All states are described here - http://moira.readthedocs.io/en/latest/user_guide/efficient.html

    parents:                  # List of parent triggers (optional)
      - name: "DC shutdown"   # The link to the parent is his name and a set of tags
        tags:
        - datacenters
        - global
      - name: "autoconf maintenance"
        tags:
        - autoconf

    expression: "t1 > 10 ? ERROR : (t2 > 4 ? WARN : OK)"
                              # Expression input.
                              # Supports govaluate and graphite syntax. (t1, t2, tN - in the order specified in targets)
                              # Described - http://moira.readthedocs.io/en/latest/user_guide/advanced.html

    time_start: "12:30"       # Start time of this trigger
    time_end: "23:59"         # End time of this trigger
    day_disable:              # Days in which triggers don't work (list)
      - Tue                   # Mon | Tue | Wed | Thu | Fri | Sat | Sun

  - name: Trigger_2
    targets:
      - movingAverage(apps.services.{cluster}.monitoring.modules.worker.*.handleMessage.time.count, 2)
    dashboard: grafana.yourdomain.ru/grafana/d/000000596/graphite-clickhouse?panelId=17&fullscreen
    is_pull_type: true
    ttl: 60
    ttl_state: ERROR
    expression: "t1 <= 0.3 ? ERROR : OK"
    tags:
      - sms
      - mail
      - slack


# Alerting section
alerting:
  # Simple notifications
  - tags:                            # List of tags used in trigger descriptions
      - service 1                    # It is enough to match one tag with a trigger tag
    time_start: "12:30"              # Application start time
    time_end: "23:50"                # Application end time
    day_disable:                     # Days in which this setting is ignored
      - Tue
    contacts:                        # List of available contacts
      - type: mail                   # type of notification. It is possible to specify the following values:
                                     # mail, send-sms, slack, jira
        value: qwe@qwe.qwe           # Contact value. The text in the field must be enclosed in quotation marks,
                                     # if special symbols #, @, +, etc. are used.

  # Notify with escalations
  - tags:
      - service_1
      - ERROR
    contacts:
      - type: slack
        value: #channel_service_1                 # Send a notification to the Slack channel of the service owners
      - type: send-sms
        value: +7123456789                        # Duty person service-1
    escalations:                                  # Escalation section. 
                                                  #  If it is described in the subscription, then all messages sent to Slack will have an "Acknoweledge" button
      - contacts:                                 # Escalation will be interrupted if the "Acknoweledge" button was pressed, 
          - type: jira                            #  or the trigger itself returned to the "OK" state during this time.
            value: monitoring_team_24x7
          - type: slack                           # Send a notification to the Slack team 24x7
            value: #channel_monitoring_team_24x7  
        offset_in_minutes: 5                      # Notification will be sent 5 minutes after the start

      - contacts:                     
          - type: send-sms             
            value: '+7123456789'                  # Team lead responsible for the service       
        offset_in_minutes: 10                     # Notifications will be sent 10 minutes after the start

Example configuration file:

valid_config.yaml

valid_config_blank.yaml

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.