도입이유

프로젝트를 진행하던 중 어떻게 하면 모니터링을 효율적으로 사용할 수 있을까, 대시보드를 세분화 할까 ~~ 근데 Grafana를 하루종일 볼 수 없잖아 등등 커뮤니케이션을 하는 도중 Slack을 이용한 Alertmanager를 도입하기로 결정했습니다.

Alertmanager는 Slack을 포함한 다양한 수신자에게 경고를 보낼 수 있는 기능을 제공합니다. Kubernetes 클라우드 환경에서 운영되는 서비스와 애플리케이션의 상태를 Prometheus를 통해 모니터링하고 있는 상황에서 Prometheus가 제공하는 Alertmanager 사용하는것이 맞다고 판단했습니다. 또한, Grafana 대시보드를 지속적으로 모니터링하는 것이 실용적이지 않다고 판단했고 현재 진행하는 프로젝트는 비용적인 제한이 있습니다. 이런 맥락에서 Kubernetes의 다양한 구성 요소를 실시간으로 감시하고, 문제가 발생했을 때 즉시 알림을 받는 것은 매우 중요하다 생각했습니다.

결론적으로, Alertmanager의 도입은 우리 팀의 커뮤니케이션 방식, 모니터링 요구사항 및 비용적 한계를 모두 고려할 때 명확한 선택이라 생각하며, 이를 통해 시스템의 안정성을 보장하고, 신속한 대응이 가능해졌다고 생각합니다.

개념

Prometheus의 모니터링 시스템 경고 관리 도구이다.

알림의 집중 관리: Prometheus에서 생성된 알림을 관리합니다. 여러 Prometheus 서버에서 발생하는 알림을 하나의 Alertmanager 인스턴스가 처리할 수 있습니다.
알림 라우팅: Alertmanager는 다양한 알림을 받고 그것들을 정의된 수신자(예: 이메일, Slack, SMS 등)에게 전달합니다. 이는 팀이나 개인에게 관련된 알림을 적절하게 전달하는 데 유용합니다.
알림 억제 및 중복 제거: 서로 연관된 알림이나 반복되는 알림을 억제하거나 중복을 제거하여, 알림의 홍수를 방지합니다. 예를 들어, 하나의 큰 문제가 여러 관련 알림을 발생시킬 경우, 이러한 알림들을 하나로 합쳐서 전달할 수 있습니다.
알림 조건 설정: 사용자는 Prometheus의 Query Language(PromQL)를 사용하여 복잡한 알림 규칙을 정의할 수 있습니다. 이 규칙은 특정 조건에 따라 알림을 트리거하는 데 사용됩니다.
고가용성: Alertmanager는 고가용성 설정을 지원하여, 여러 인스턴스를 통해 안정적인 알림 서비스를 제공할 수 있습니다.

구성방법

새 워크스페이스를 개설합니다.

Slack 에서 사용할 Emali을 입력합니다.

새 워크스페이스 이름을 설정합니다.

AlertManeger를 수신할 전용 채널을 생성합니다.

Webhook URL을 추가합니다

설치 방법

helm을 통해 kube-prometheus-stack을 설치해 줍니다.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stack

values.yaml 설정 파일에 Slack Webhook URL 등록

alertmanager:
  config:
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/xxx'
    route:
      group_by: ['alertmanager']
      group_wait: 10s
      group_interval: 1m
      repeat_interval: 5m
      receiver: 'slack-notifications'
      routes:
      - receiver: 'slack-notifications'
        matchers:
          - alertname =~ "InfoInhibitor|Watchdog"
    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - channel: '#k8s-alert'
        send_resolved: true
        text: |
          *Alert:* {{ .CommonLabels.alertname }}
          *Severity:* {{ .CommonLabels.severity }}
          *Instance:* {{ .CommonLabels.instance }}
          *Description:* {{ .CommonAnnotations.description }}
          *Summary:* {{ .CommonAnnotations.summary }}
        title: '[{{.Status | toUpper}}] {{ .CommonLabels.alertname }}'

✔ slack_api_url:

◾ Slack API에서 생성한 Slack Webhook URL을 등록

✔ receivers:

◾ 전송 채널로 슬랙을 사용하기 위해 'slack-notification'를 등록, 이메일은 'email-config'

✔ slack_configs:

◾ 경고 메시지를 받는 slack 채널 이름을 등록

사용자 정의 prometheusrules 정책 설정

아래는 애플리케이션에 적용할 수 있는 다양한 시스템 경고 정책 모범 사례 제공하는 웹페이지.

https://samber.github.io/awesome-prometheus-alerts/rules#kubernetes

Awesome Prometheus alerts

Collection of alerting rules

samber.github.io

- alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 20% inodes left.
      expr: |-
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1m
      labels:
        alertname: NodeFilesystemAlmostOutOfFiles
        severity: warning
    - alert: HighMemoryUsage
      annotations:
        description: '{{ $labels.instance }} has memory usage over 80% (current value:
          {{ $value }}%)'
        summary: High Memory Usage
      expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
        * 100 > 80
      for: 5m
      labels:
        alertname: HighMemoryUsage
        instance: '{{ $labels.instance}}'
        severity: warning

NodeFilesystemAlmostOutOfFiles

목적: 파일 시스템의 inode 사용량이 80% 이상인 경우 경고합니다.
세부사항: node_filesystem_files_free (사용 가능한 inode 수)와 node_filesystem_files (총 inode 수)의 비율이 20% 미만일 때 알림이 발생합니다. 이는 파일 시스템에 사용 가능한 inode가 거의 없음을 의미합니다.
조건: 파일 시스템이 읽기 전용(read-only) 상태가 아닌 경우에만 적용됩니다.
알림 지속 시간: 1분 동안 조건이 충족되면 알림이 발생합니다.

HighMemoryUsage:

목적: 시스템의 메모리 사용량이 80% 이상인 경우 경고합니다.
세부사항: 사용 가능한 메모리(node_memory_MemAvailable_bytes)와 전체 메모리(node_memory_MemTotal_bytes)의 차이를 전체 메모리로 나눈 값이 80%를 초과할 때 알림이 발생합니다.
알림 지속 시간: 5분 동안 조건이 충족되면 알림이 발생합니다.

이 두가지를 추가하였다.