MI Admin Guide

This document guides the readers on how to set up their own monitoring infrastructure containing Prometheus, Loki, Grafana and Alertmanager. Grafana instance shall also be protected from unauthorized access by integrating it to EFPF keycloak.

Deployment and Configuration

Note: The instructions are specific to EFPF project set up. The reader of this guide needs access to the EFPF internal repository. A generalized guideline is provided as part of Linksmart Blog.

Deployment

Please follow the instructions mentioned in the README.md file of https://gitlab.fit.fraunhofer.de/efpf-pilots/monitoring-toolkit.

Adding a target to Prometheus

Once exporters are setup following the instructions mentioned in the User guide, edit the conf/prometheus.yml to add the following configuration.

  - job_name: cadvisor_vm1
    scrape_interval: 5s
    static_configs:
      - targets: ['vm1:8080']

Creating Alerts

The monitoring infrastructure uses the in built alerting functionality of Prometheus and Loki for alerting purposes. The alerting for Prometheus can be seen in the Alerting Rules. Loki alerts can be created following the official guides. Currently, no log based alerting is setup for EFPF. The alerts created from these rules are then managed using Alertmanager.

Alerting Rules

Alerting rules are added to a dedicated monitoring-server/conf/prometheus_alert.rules file. The sample file used in EFPF is here.

groups:
- name: targets
  rules:
  - alert: monitor_service_down
    expr: up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Monitor service non-operational"
      description: "Service {{ $labels.instance }} is down."

- name: containers
  rules:
  - alert: dataspine_down
    expr: absent(container_memory_usage_bytes{name="dataspine_container"})
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dataspine down"
      description: "dataspine container is down for more than 30 seconds."

The first alert monitor_service_down is triggered whenever any of the prometheus exporters are down.

The second alert dataspine_down is triggered whenever the container named sample_cntainer is not runing.

The running alerting rules and their status can be seen on <prometheus_url>\alerts

Alert Manager

Prometheus Alert manager takes care of deduplicating, grouping and routing them to the right receivers. A simple alert manager config can be as follows:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'example.com:8825'
  smtp_from: 'efpf-monitor-no-reply@example.com'
  smtp_require_tls: false

# The root route on which each incoming alert enters.
route:
  receiver: 'an-exampel-team'
  group_by: [...]
  repeat_interval: 4h

receivers:
- name: 'an-exampel-team'
  email_configs:
  - to: 'an-example-mailing-list@example.com'

In the above example, Alertmanager sends sends each alerts only once in 4 hours. This can be expanded further to support filtering and grouping of alerts and route them based on pattern matching.

Further example on Alert manager configuration can be found here.

Visualization using Grafana

1. Setting up the data source

In order to visualize the prometheus metrics in Grafana, the Prometheus data source plugin in Grafana needs to be configured. First, login to Grafana as admin.

Then, follow the instuctions specified in the official documentation to setup the plugin

2. Adding Dashboard

see User guide

Next