Prometheus and Grafana Monitoring Stack Setup
Why Monitor Your Homelab?
Monitoring helps you:
- Identify performance bottlenecks
- Detect issues before they become problems
- Understand resource usage patterns
- Plan capacity upgrades
- Track application health
Architecture Overview
We’ll set up:
- Prometheus: Metrics collection and storage
- Grafana: Visualization and dashboards
- Node Exporter: System metrics from Linux hosts
- cAdvisor: Container metrics
- AlertManager: Alert management
Docker Compose Setup
Directory Structure
monitoring/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ ├── alerts.yml
│ └── rules/
└── grafana/
├── provisioning/
│ └── dashboards/
└── dashboards/
Docker Compose File
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- "9090:9090"
networks:
- monitoring
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: admin_password
GF_INSTALL_PLUGINS: 'grafana-piechart-panel'
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
networks:
- monitoring
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
networks:
- monitoring
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
networks:
- monitoring
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
volumes:
- ./prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
networks:
- monitoring
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
networks:
monitoring:
driver: bridge
Prometheus Configuration
Create prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'homelab'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alerts.yml"
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
labels:
instance: 'homelab'
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'docker'
static_configs:
- targets: ['localhost:9323']
# Add external targets
- job_name: 'external-servers'
static_configs:
- targets: ['192.168.1.10:9100']
labels:
instance: 'server1'
- targets: ['192.168.1.11:9100']
labels:
instance: 'server2'
Alert Rules
Create prometheus/alerts.yml:
groups:
- name: system
interval: 30s
rules:
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
for: 5m
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85%"
- alert: HighCPUUsage
expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 5m
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
- alert: DiskFull
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes) < 0.1
for: 5m
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Less than 10% free disk space"
- alert: ServiceDown
expr: up{job="node"} == 0
for: 2m
annotations:
summary: "{{ $labels.instance }} is down"
Grafana Setup
Access Grafana
- Open
http://localhost:3000 - Login with
admin/admin_password - Change password immediately
Add Prometheus Data Source
- Settings → Data Sources → Add data source
- Choose Prometheus
- URL:
http://prometheus:9090 - Save & Test
Import Dashboards
- Dashboards → Browse
- Click Import
- Enter dashboard ID (examples):
- Node Exporter:
1860 - Docker:
1229 - Kubernetes:
6417
- Node Exporter:
Create Custom Dashboard
- Dashboards → Create → Dashboard
- Add Panel
- Select Prometheus data source
- Enter PromQL query:
# CPU Usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk Usage
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes)) * 100
Useful PromQL Queries
# CPU usage per instance
rate(node_cpu_seconds_total[5m]) * 100
# Memory usage
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Network traffic
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# Container memory usage
container_memory_usage_bytes
# Container CPU usage
rate(container_cpu_usage_seconds_total[5m])
AlertManager Configuration
Create prometheus/alertmanager.yml:
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['instance', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receivers:
- name: 'default'
webhook_configs:
- url: 'http://localhost:5001/'
send_resolved: true
- name: 'email'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.gmail.com:587'
auth_username: '[email protected]'
auth_password: 'app_password'
Long-Term Storage
For data retention beyond Prometheus' default, use:
# prometheus.yml
storage:
tsdb:
retention:
time: 30d
Or integrate with external storage like:
- Thanos
- Cortex
- Victoria Metrics
Backup Strategy
#!/bin/bash
docker-compose exec -T prometheus kill -HUP 1
tar -czf prometheus_backup_$(date +%Y%m%d).tar.gz ./prometheus_data
Performance Optimization
- Scrape Interval: Increase if you don’t need real-time metrics
- Retention: Reduce storage for limited disk space
- Relabel Configs: Drop unnecessary labels
- Recording Rules: Pre-calculate expensive queries
Troubleshooting
Metrics Not Appearing
# Check targets
curl http://localhost:9090/api/v1/targets
# Check metrics available
curl http://localhost:9100/metrics | grep node_
High Disk Usage
# Check database size
du -sh prometheus_data/
# Reduce retention time
# Edit prometheus.yml: retention: 7d
Conclusion
This monitoring stack provides visibility into your homelab infrastructure. Start simple and expand as your monitoring needs grow.