Version:

Observability

Purpose

The following instructions describe the complete monitoring flow for your services in Kyma. You get the gist of monitoring applications, such as Prometheus, Grafana, and Alertmanager. You learn how and where you can observe and visualize your service metrics to monitor them for any alerting values.

Kyma comes with a Prometheus stack, which is designed and sized to monitor Kyma's system components. We recommend to set up an additional Prometheus stack to monitor your custom metrics.

All the tutorials use the monitoring-custom-metrics example and one of its services called sample-metrics-8081. This service exposes the cpu_temperature_celsius custom metric on the /metrics endpoint. This custom metric is the central element of the whole tutorial set. The metric value simulates the current processor temperature and changes randomly from 60 to 90 degrees Celsius. The alerting threshold in these tutorials is 75 degrees Celsius. If the temperature exceeds this value, the Grafana dashboard, PrometheusRule, and Alertmanager notifications you create inform you about this.

Sequence of tasks

The instructions cover the following tasks:

Deploy a custom Prometheus stack, in which you deploy the kube-prometheus-stack from the upstream Helm chart.
Observe application metrics, in which you redirect the cpu_temperature_celsius metric to the localhost and the Prometheus UI. You later observe how the metric value changes in the predefined 10 seconds interval in which Prometheus scrapes the metric values from the service's /metrics endpoint.
Create a Grafana dashboard, in which you create a Grafana dashboard of a Gauge type for the cpu_temperature_celsius metric. This dashboard shows explicitly when the CPU temperature is equal to or higher than the predefined threshold of 75 degrees Celsius, at which point the dashboard turns red.
Define alerting rules, in which you define the CPUTempHigh alerting rule by creating a PrometheusRule resource. Prometheus accesses the /metrics endpoint every 10 seconds and validates the current value of the cpu_temperature_celsius metric. If the value is equal to or higher than 75 degrees Celsius, Prometheus waits for 10 seconds to recheck it. If the value still exceeds the threshold, Prometheus triggers the rule. You can observe both the rule and the alert it generates on the Prometheus dashboard.