Prometheus and Grafana Monitoring Stack Setup

Why Monitor Your Homelab?

Monitoring helps you:

Architecture Overview

We’ll set up:

  1. Prometheus: Metrics collection and storage
  2. Grafana: Visualization and dashboards
  3. Node Exporter: System metrics from Linux hosts
  4. cAdvisor: Container metrics
  5. AlertManager: Alert management

Docker Compose Setup

Directory Structure

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   ├── alerts.yml
│   └── rules/
└── grafana/
    ├── provisioning/
    │   └── dashboards/
    └── dashboards/

Docker Compose File

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - "9090:9090"
    networks:
      - monitoring
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: admin_password
      GF_INSTALL_PLUGINS: 'grafana-piechart-panel'
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    networks:
      - monitoring
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    networks:
      - monitoring
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    volumes:
      - ./prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    networks:
      - monitoring
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    driver: bridge

Prometheus Configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'homelab'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "alerts.yml"
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          instance: 'homelab'

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'docker'
    static_configs:
      - targets: ['localhost:9323']

  # Add external targets
  - job_name: 'external-servers'
    static_configs:
      - targets: ['192.168.1.10:9100']
        labels:
          instance: 'server1'
      - targets: ['192.168.1.11:9100']
        labels:
          instance: 'server2'

Alert Rules

Create prometheus/alerts.yml:

groups:
  - name: system
    interval: 30s
    rules:
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
        for: 5m
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85%"

      - alert: HighCPUUsage
        expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
        for: 5m
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"

      - alert: DiskFull
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes) < 0.1
        for: 5m
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Less than 10% free disk space"

      - alert: ServiceDown
        expr: up{job="node"} == 0
        for: 2m
        annotations:
          summary: "{{ $labels.instance }} is down"

Grafana Setup

Access Grafana

  1. Open http://localhost:3000
  2. Login with admin/admin_password
  3. Change password immediately

Add Prometheus Data Source

  1. Settings → Data Sources → Add data source
  2. Choose Prometheus
  3. URL: http://prometheus:9090
  4. Save & Test

Import Dashboards

  1. Dashboards → Browse
  2. Click Import
  3. Enter dashboard ID (examples):
    • Node Exporter: 1860
    • Docker: 1229
    • Kubernetes: 6417

Create Custom Dashboard

  1. Dashboards → Create → Dashboard
  2. Add Panel
  3. Select Prometheus data source
  4. Enter PromQL query:
# CPU Usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk Usage
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes)) * 100

Useful PromQL Queries

# CPU usage per instance
rate(node_cpu_seconds_total[5m]) * 100

# Memory usage
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Network traffic
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# Container memory usage
container_memory_usage_bytes

# Container CPU usage
rate(container_cpu_usage_seconds_total[5m])

AlertManager Configuration

Create prometheus/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['instance', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://localhost:5001/'
        send_resolved: true

  - name: 'email'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.gmail.com:587'
        auth_username: '[email protected]'
        auth_password: 'app_password'

Long-Term Storage

For data retention beyond Prometheus' default, use:

# prometheus.yml
storage:
  tsdb:
    retention:
      time: 30d

Or integrate with external storage like:

Backup Strategy

#!/bin/bash
docker-compose exec -T prometheus kill -HUP 1
tar -czf prometheus_backup_$(date +%Y%m%d).tar.gz ./prometheus_data

Performance Optimization

  1. Scrape Interval: Increase if you don’t need real-time metrics
  2. Retention: Reduce storage for limited disk space
  3. Relabel Configs: Drop unnecessary labels
  4. Recording Rules: Pre-calculate expensive queries

Troubleshooting

Metrics Not Appearing

# Check targets
curl http://localhost:9090/api/v1/targets

# Check metrics available
curl http://localhost:9100/metrics | grep node_

High Disk Usage

# Check database size
du -sh prometheus_data/

# Reduce retention time
# Edit prometheus.yml: retention: 7d

Conclusion

This monitoring stack provides visibility into your homelab infrastructure. Start simple and expand as your monitoring needs grow.

Resources