Metrics
Collecting, aggregating, and alerting on application and infrastructure metrics
Metrics are numerical measurements collected over time that provide insight into system behavior. Unlike logs which record discrete events, metrics aggregate data to show trends, patterns, and anomalies.
Types of Metrics
The Four Golden Signals
From Google's SRE book, these are the essential metrics for any service:
| Signal | Description | Example |
|---|---|---|
| Latency | Time to service a request | P50: 45ms, P99: 200ms |
| Traffic | Demand on the system | 1,500 requests/second |
| Errors | Rate of failed requests | 0.5% error rate |
| Saturation | How "full" the system is | 75% CPU, 80% memory |
RED Method (Request-focused)
For request-driven services:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Request latency distribution
USE Method (Resource-focused)
For infrastructure:
- Utilization: Resource busy percentage
- Saturation: Amount of work queued
- Errors: Count of error events
Metric Types
Counter
Monotonically increasing value (can only go up):
// Examples of counters
http_requests_total
errors_total
user_signups_total
// Usage
const requestCounter = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
});
// Increment
requestCounter.inc({ method: 'GET', path: '/api/users', status: '200' });Gauge
Value that can go up or down:
// Examples of gauges
active_connections
queue_size
temperature
// Usage
const activeConnections = new Gauge({
name: 'active_connections',
help: 'Number of active connections',
});
// Set value
activeConnections.set(42);
activeConnections.inc();
activeConnections.dec();Histogram
Distribution of values in buckets:
// Examples of histograms
request_duration_seconds
response_size_bytes
// Usage
const requestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
// Observe value
requestDuration.observe({ method: 'GET', path: '/api/users' }, 0.045);
// Or use timer
const timer = requestDuration.startTimer({ method: 'GET', path: '/api/users' });
// ... do work ...
timer(); // Records durationSummary
Similar to histogram, but calculates quantiles on client side:
const requestDuration = new Summary({
name: 'http_request_duration_seconds',
help: 'Request duration in seconds',
percentiles: [0.5, 0.9, 0.95, 0.99],
});Implementation with Prometheus
Node.js Example
import express from 'express';
import { collectDefaultMetrics, Registry, Counter, Histogram } from 'prom-client';
const register = new Registry();
// Collect default Node.js metrics (memory, CPU, etc.)
collectDefaultMetrics({ register });
// Custom metrics
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status'],
registers: [register],
});
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [register],
});
// Middleware to track metrics
app.use((req, res, next) => {
const timer = httpRequestDuration.startTimer({
method: req.method,
path: req.route?.path || req.path,
});
res.on('finish', () => {
timer();
httpRequestsTotal.inc({
method: req.method,
path: req.route?.path || req.path,
status: res.statusCode.toString(),
});
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.send(await register.metrics());
});Output Format
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/users",status="200"} 1523
http_requests_total{method="POST",path="/api/users",status="201"} 45
http_requests_total{method="GET",path="/api/users",status="500"} 12
# HELP http_request_duration_seconds HTTP request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.01"} 100
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.05"} 450
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.1"} 980
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="+Inf"} 1523
http_request_duration_seconds_sum{method="GET",path="/api/users"} 76.5
http_request_duration_seconds_count{method="GET",path="/api/users"} 1523Business Metrics
Beyond technical metrics, track business-relevant data:
// User activity
const userSignups = new Counter({
name: 'user_signups_total',
help: 'Total user signups',
labelNames: ['source', 'plan'],
});
const activeUsers = new Gauge({
name: 'active_users',
help: 'Currently active users',
});
// E-commerce
const ordersTotal = new Counter({
name: 'orders_total',
help: 'Total orders placed',
labelNames: ['status', 'payment_method'],
});
const orderValue = new Histogram({
name: 'order_value_dollars',
help: 'Order value distribution',
buckets: [10, 25, 50, 100, 250, 500, 1000],
});
// Healthcare
const appointmentsScheduled = new Counter({
name: 'appointments_scheduled_total',
help: 'Total appointments scheduled',
labelNames: ['type', 'department'],
});
const waitTime = new Histogram({
name: 'patient_wait_time_minutes',
help: 'Patient wait time distribution',
buckets: [5, 10, 15, 30, 45, 60, 90],
});Labels and Cardinality
Label Best Practices
Labels add dimensions to metrics but increase cardinality:
// Good: Limited, bounded label values
http_requests_total{method="GET", status="200", endpoint="/users"}
// Bad: High cardinality (unbounded values)
http_requests_total{user_id="user-12345", timestamp="1705320000"}Cardinality Guidelines
| Label Type | Good Example | Bad Example |
|---|---|---|
| HTTP Method | GET, POST, PUT | N/A |
| Status Code | 200, 400, 500 | N/A |
| Endpoint | /users, /orders | /users/12345 (user ID) |
| Environment | prod, staging | N/A |
| Service | api, worker | N/A |
Rule of thumb: Total cardinality = product of all label values
- 4 methods × 5 status codes × 10 endpoints = 200 series (good)
- 4 methods × 1M user IDs = 4M series (bad!)
Prometheus Configuration
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alerts/*.yml'
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api:3000']
metrics_path: '/metrics'
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: trueAlerting Rules
Alert Definition
# alerts/api-alerts.yml
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
> 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.job }} has been down for more than 1 minute"Alert Routing
# alertmanager.yml
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'xxx'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'PromQL Queries
Common Queries
# Request rate (per second)
rate(http_requests_total[5m])
# Error rate (percentage)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Requests per endpoint
sum by (path) (rate(http_requests_total[5m]))
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes
* 100
# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)SLIs and SLOs
Service Level Indicators (SLIs)
# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
# Latency SLI (% of requests under 200ms)
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))Service Level Objectives (SLOs)
| Service | SLI | SLO |
|---|---|---|
| API | Availability | 99.9% |
| API | P95 Latency | < 200ms |
| Database | Query Success | 99.99% |
| Queue | Processing Latency | P99 < 30s |
Best Practices
Do
- Use the four golden signals as a starting point
- Keep label cardinality low
- Set meaningful alert thresholds
- Document what each metric measures
- Aggregate metrics before querying when possible
- Use recording rules for complex queries
Don't
- Create high-cardinality metrics
- Alert on every minor variation
- Ignore metric naming conventions
- Skip units in metric names
- Store timestamps as label values
- Create alerts without runbooks
Related Resources
Compliance
This section fulfills ISO 13485 requirements for monitoring and measurement (8.2.3, 8.2.4) and data analysis (8.4), and ISO 27001 requirements for monitoring activities (A.8.16) and capacity management (A.8.6).
How is this guide?
Last updated on