Everything we've built so far is reactive. You open Grafana, you look at things, you notice problems. That's fine if you enjoy staring at dashboards — no judgment, we're all here for the same reason. But the real value of an observability stack is when it taps you on the shoulder at 2am and says "hey, that ZFS drive just went offline" before you wake up to find your media library gone.
This article is about building that tap on the shoulder. We're going to configure Grafana alerting to notify you via Discord and email when something genuinely needs attention — with enough tuning to avoid the opposite problem, where your phone buzzes every five minutes because nginx got a 404 from a bot somewhere and your threshold is set to "literally anything."
The goal is actionable alerts. Not noisy alerts. Not silent alerts. Actionable ones.
How Grafana Alerting Works
Three components:
- Alert Rules — the conditions being evaluated ("disk usage above 85%")
- Contact Points — where notifications go (Discord, email, etc.)
- Notification Policies — which alerts go where, and how often
We're provisioning all of this as code using Grafana's provisioning system. Alert rules live in YAML files on disk, load automatically when Grafana starts, and get version controlled in the git repo. The alternative is clicking through the UI to create rules manually — which works until you recreate the container and lose everything. We've been burned enough times in this project to know better.
Setting Up Contact Points
To get a Discord webhook URL:
- Open Discord and go to your server
- Edit a channel (or create a new
#homelab-alertschannel) - Go to Integrations → Webhooks → New Webhook
- Name it
Grafanaand copy the webhook URL
Add it to your .env file on Nexus:
DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/your/webhook/urlMake sure Grafana's environment variables include it so the reference in the contact points file resolves correctly:
grafana:
environment:
- DISCORD_WEBHOOK_URL=${DISCORD_WEBHOOK_URL}Create config/grafana/provisioning/alerting/contact-points.yaml:
apiVersion: 1
contactPoints:
- orgId: 1
name: Discord and Email
receivers:
- uid: discord_critical
type: discord
settings:
url: ${DISCORD_WEBHOOK_URL}
message: "**{{ .CommonAnnotations.summary }}**\n{{ .CommonAnnotations.description }}\n**Status:** {{ .Status | toUpper }}\n**Severity:** {{ .CommonLabels.severity | toUpper }}"
- uid: email_critical
type: email
settings:
addresses: [email protected]
- orgId: 1
name: Discord
receivers:
- uid: discord_warning
type: discord
settings:
url: ${DISCORD_WEBHOOK_URL}
message: "**{{ .CommonAnnotations.summary }}**\n{{ .CommonAnnotations.description }}\n**Status:** {{ .Status | toUpper }}\n**Severity:** {{ .CommonLabels.severity | toUpper }}"The explicit message: field in the contact point settings is important. Grafana's default Discord message template sends raw Go template syntax as literal text in some versions. The explicit field overrides the default and keeps your Discord notifications clean. That one cost me about two hours to figure out.
Notification Policies
Create config/grafana/provisioning/alerting/notification-policies.yaml:
apiVersion: 1
policies:
- orgId: 1
receiver: Discord
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- receiver: Discord and Email
matchers:
- severity = critical
repeat_interval: 4h
- receiver: Discord
matchers:
- severity = warning
repeat_interval: 12hThe repeat_interval settings matter more than you'd think. We learned this during a power outage that took down all three Proxmox nodes — with repeat_interval: 1h on critical alerts, Discord became a very loud place very quickly at 2am.
The policy above means:
- Critical alerts — notify immediately, remind every 4 hours if still firing
- Warning alerts — notify once, remind every 12 hours
Reasonable balance between "I need to know about this" and "please stop."
Prometheus Alert Rules
Create config/grafana/provisioning/alerting/prometheus-rules.yaml.
Before you use this file, get your Prometheus datasource UID:
curl -s -u admin:your_password http://localhost:3001/api/datasources | python3 -m json.tool | grep -E '"uid"|"name"'Replace YOUR_PROMETHEUS_UID throughout the file with the actual UID.
apiVersion: 1
groups:
- orgId: 1
name: host-alerts
folder: Infrastructure
interval: 1m
rules:
# Host Down
- uid: host_down
title: Host Down
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: up{job="node"}
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [1]
type: lt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: Alerting
execErrState: Alerting
for: 2m
annotations:
summary: "Host down"
description: "A host has been unreachable for more than 2 minutes - check Prometheus targets"
labels:
severity: critical
# Disk Space Warning - Boot Drives
- uid: disk_warning_boot
title: Disk Space Warning - Boot Drive
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: |
(1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
/ node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [75]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 5m
annotations:
summary: "Disk space warning - boot drive"
description: "A boot drive is above 75% capacity - check node_exporter metrics"
labels:
severity: warning
# Disk Space Critical - Boot Drives
- uid: disk_critical_boot
title: Disk Space Critical - Boot Drive
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: |
(1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
/ node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [90]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 5m
annotations:
summary: "Disk space critical - boot drive"
description: "A boot drive is above 90% capacity - immediate attention required"
labels:
severity: critical
# Disk Space Warning - Log Storage
- uid: disk_warning_logs
title: Disk Space Warning - Log Storage
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: |
(1 - node_filesystem_avail_bytes{instance="nexus", mountpoint="/media/disk1"}
/ node_filesystem_size_bytes{instance="nexus", mountpoint="/media/disk1"}) * 100
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [70]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 5m
annotations:
summary: "Log storage space warning"
description: "Log storage on Nexus is above 70% - consider adjusting retention settings"
labels:
severity: warning
# Disk Space Critical - Log Storage
- uid: disk_critical_logs
title: Disk Space Critical - Log Storage
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: |
(1 - node_filesystem_avail_bytes{instance="nexus", mountpoint="/media/disk1"}
/ node_filesystem_size_bytes{instance="nexus", mountpoint="/media/disk1"}) * 100
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [85]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 5m
annotations:
summary: "Log storage space critical"
description: "Log storage on Nexus is above 85% - reduce retention or expand storage"
labels:
severity: critical
# Disk Space Warning - Media Pool
- uid: disk_warning_mediapool
title: Disk Space Warning - Media Pool
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: |
(1 - node_filesystem_avail_bytes{instance="vault", mountpoint="/MediaPool"}
/ node_filesystem_size_bytes{instance="vault", mountpoint="/MediaPool"}) * 100
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [80]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 5m
annotations:
summary: "Media pool space warning"
description: "MediaPool on Vault is above 80% - time to think about expansion"
labels:
severity: warning
# Disk Space Critical - Media Pool
- uid: disk_critical_mediapool
title: Disk Space Critical - Media Pool
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: |
(1 - node_filesystem_avail_bytes{instance="vault", mountpoint="/MediaPool"}
/ node_filesystem_size_bytes{instance="vault", mountpoint="/MediaPool"}) * 100
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [90]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 5m
annotations:
summary: "Media pool space critical"
description: "MediaPool on Vault is above 90% - immediate attention required"
labels:
severity: critical
# ZFS Pool Health
- uid: zfs_pool_health
title: ZFS Pool Degraded
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: node_zfs_zpool_state{instance="vault"}
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [1]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: Alerting
execErrState: Alerting
for: 1m
annotations:
summary: "ZFS pool degraded"
description: "MediaPool on Vault is no longer in ONLINE state - immediate attention required"
labels:
severity: critical
# High Memory Usage
- uid: high_memory
title: High Memory Usage
condition: C
data:
- refId: A
relativeTimeRange:
from: 600
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 600
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 600
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [90]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 10m
annotations:
summary: "High memory usage"
description: "A host has been above 90% memory usage for more than 10 minutes"
labels:
severity: warning
# High CPU Usage
- uid: high_cpu
title: High CPU Usage
condition: C
data:
- refId: A
relativeTimeRange:
from: 900
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 900
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 900
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [85]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 15m
annotations:
summary: "High CPU usage"
description: "A host has been above 85% CPU for more than 15 minutes"
labels:
severity: warning
# Pi-hole Down
- uid: pihole_down
title: Pi-hole Down
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_PROMETHEUS_UID
model:
expr: up{job="pihole"}
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [1]
type: lt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: Alerting
execErrState: Alerting
for: 2m
annotations:
summary: "Pi-hole down"
description: "A Pi-hole instance has stopped reporting - DNS resolution may be affected"
labels:
severity: criticalLoki Alert Rules
Create config/grafana/provisioning/alerting/loki-rules.yaml. Get your Loki datasource UID the same way as Prometheus above and replace YOUR_LOKI_UID throughout.
apiVersion: 1
groups:
- orgId: 1
name: loki-alerts
folder: Logs
interval: 5m
rules:
# Error Rate Spike
- uid: error_rate_spike
title: Error Rate Spike
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_LOKI_UID
model:
expr: |
sum by (container_name, host) (
count_over_time({job="docker", container_name!="grafana"}
|~ "(?i)(error|exception|fatal|panic)"
!= "401"
!= "Unauthorized"
!= "token needs to be rotated"
!= "Ignoring invalid configuration option"
!= "Error parsing filter"
[5m])
)
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [25]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 5m
annotations:
summary: "Error rate spike detected"
description: "A container has logged more than 25 errors in the last 5 minutes - check Grafana Explore with query: {job=\"docker\"} |~ \"(?i)(error|exception|fatal|panic)\""
labels:
severity: warning
# Plex Transcoding Error
- uid: plex_transcode_error
title: Plex Transcoding Error
isPaused: false
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: YOUR_LOKI_UID
model:
expr: |
count_over_time({job="plex"}
|~ "(?i)(transcoder exited with error|transcode.*failed|failed.*transcode|error starting transcode|transcoder crashed)" [5m])
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
expression: A
reducer: last
settings:
mode: dropNN
type: reduce
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [1]
type: gt
operator:
type: and
query:
params: [B]
reducer:
type: last
expression: B
type: threshold
noDataState: NoData
execErrState: Error
for: 0m
annotations:
summary: "Plex transcoding error detected"
description: "Plex has logged transcoding failures in the last 5 minutes - check Grafana Explore with query: {job=\"plex\"} |~ \"(?i)(transcoder exited with error|transcode.*failed)\""
labels:
severity: warningApplying the Config
Make sure the provisioning directory is mounted in Grafana's compose:
grafana:
volumes:
- /media/disk1/logStorage/grafana:/var/lib/grafana
- ./config/grafana/provisioning:/etc/grafana/provisioning:roSet permissions on the provisioning directory:
sudo chmod -R 777 /home/youruser/docker-projects/logStack/config/grafana/Restart Grafana:
docker compose restart grafanaCheck the logs to confirm provisioning succeeded:
docker compose logs grafana | grep -i "provision\|alert\|error" | tail -20Looking for:
logger=provisioning.alerting msg="finished to provision alerting"Without errors between start and finish.
Verifying in the UI
Go to Alerting → Alert Rules. You should see two folders — Infrastructure and Logs — with rules inside them. All rules should show Normal state with ok health after a few evaluation cycles.
If any show Error health, click View to see the error message. The most common cause is a datasource UID mismatch. Double check the UIDs in your YAML files against what Grafana has configured.
Tuning Your Alerts
Out of the box the error rate spike alert will probably fire immediately. Here's what triggered false positives in my setup and how I handled each one:
Grafana's own 401 errors — if Grafana is behind a Cloudflare tunnel, session token rotation generates a steady stream of 401 errors. These match the error filter. Exclude Grafana entirely with container_name!="grafana" and filter out 401 and Unauthorized strings.
nginx 404 errors — any nginx container with internet exposure logs [error] for every 404. Bots constantly probe for robots.txt, favicon.ico, /.git/index, and other files that don't exist. Harmless but noisy. Worth noting: if you have port forwarding rules on your router pointing at these containers, turn them off. Route everything through Cloudflare tunnels instead. You'll be amazed how much noise disappears.
Application-specific noise — Ghost logs a MySQL2 warning as error level on every database connection. Completely harmless, extremely chatty. Add it to the exclusions.
The general exclusion pattern in LogQL:
{job="docker", container_name!="grafana"}
|~ "(?i)(error|exception|fatal|panic)"
!= "401"
!= "Unauthorized"
!= "your noisy string here"Raising the threshold is also valid. A threshold of 25 errors per 5 minutes means a container has to be genuinely misbehaving before you hear about it.
Testing Delivery
Grafana has a built-in test feature:
- Go to Alerting → Contact Points
- Click the test button next to any receiver
- Click Send test notification
Check Discord and your inbox. If the Discord message shows raw Go template syntax instead of rendered content, you have a duplicate contact point — one created manually and one from provisioning. Find and delete the duplicate via the API:
curl -s -u admin:your_password http://localhost:3001/api/v1/provisioning/contact-points | python3 -m json.tool | grep -E '"uid"|"name"'
curl -X DELETE -u admin:your_password http://localhost:3001/api/v1/provisioning/contact-points/THE_DUPLICATE_UIDWhere We Are
- ✅ Discord notifications for all alert severities
- ✅ Email notifications for critical alerts
- ✅ Host down detection with 2 minute grace period
- ✅ Disk space warnings and critical alerts for all drives
- ✅ ZFS pool health monitoring
- ✅ High CPU and memory alerts with sensible thresholds
- ✅ Pi-hole availability monitoring
- ✅ Error rate spike detection with noise filtering
- ✅ Sane repeat intervals that won't wake your household during a power outage
The Series
- Introduction & Architecture –– Stop Flying Blind, Series Introduction
- Setting Up the Core Stack — Loki, Grafana, and Fluent Bit on your main host
- Shipping Logs from Multiple Hosts — expanding Fluent Bit across your network
- Metrics with Prometheus — node_exporter, Pi-hole metrics, and Proxmox monitoring
- Alerting — getting notified when things actually break
- Lessons Learned — everything that went wrong and how we fixed it
In the final article we pull back and talk about everything that went wrong during this build — the version tag that didn't exist, the AppArmor permissions saga, the GELF logs firing into the void, and the Proxmox firewall that kept eating our iptables rules.
One more coffee. You're almost there.