Metrics

Metrics are numerical measurements collected over time that provide insight into system behavior. Unlike logs which record discrete events, metrics aggregate data to show trends, patterns, and anomalies.

Types of Metrics

The Four Golden Signals

From Google's SRE book, these are the essential metrics for any service:

Signal	Description	Example
Latency	Time to service a request	P50: 45ms, P99: 200ms
Traffic	Demand on the system	1,500 requests/second
Errors	Rate of failed requests	0.5% error rate
Saturation	How "full" the system is	75% CPU, 80% memory

RED Method (Request-focused)

For request-driven services:

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency distribution

USE Method (Resource-focused)

For infrastructure:

Utilization: Resource busy percentage
Saturation: Amount of work queued
Errors: Count of error events

Metric Types

Counter

Monotonically increasing value (can only go up):

// Examples of counters
http_requests_total
errors_total
user_signups_total

// Usage
const requestCounter = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
});

// Increment
requestCounter.inc({ method: 'GET', path: '/api/users', status: '200' });

Gauge

Value that can go up or down:

// Examples of gauges
active_connections
queue_size
temperature

// Usage
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
});

// Set value
activeConnections.set(42);
activeConnections.inc();
activeConnections.dec();

Histogram

Distribution of values in buckets:

// Examples of histograms
request_duration_seconds
response_size_bytes

// Usage
const requestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

// Observe value
requestDuration.observe({ method: 'GET', path: '/api/users' }, 0.045);

// Or use timer
const timer = requestDuration.startTimer({ method: 'GET', path: '/api/users' });
// ... do work ...
timer();  // Records duration

Summary

Similar to histogram, but calculates quantiles on client side:

const requestDuration = new Summary({
  name: 'http_request_duration_seconds',
  help: 'Request duration in seconds',
  percentiles: [0.5, 0.9, 0.95, 0.99],
});

Implementation with Prometheus

Node.js Example

import express from 'express';
import { collectDefaultMetrics, Registry, Counter, Histogram } from 'prom-client';

const register = new Registry();

// Collect default Node.js metrics (memory, CPU, etc.)
collectDefaultMetrics({ register });

// Custom metrics
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [register],
});

// Middleware to track metrics
app.use((req, res, next) => {
  const timer = httpRequestDuration.startTimer({
    method: req.method,
    path: req.route?.path || req.path,
  });

  res.on('finish', () => {
    timer();
    httpRequestsTotal.inc({
      method: req.method,
      path: req.route?.path || req.path,
      status: res.statusCode.toString(),
    });
  });

  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.send(await register.metrics());
});

Output Format

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/users",status="200"} 1523
http_requests_total{method="POST",path="/api/users",status="201"} 45
http_requests_total{method="GET",path="/api/users",status="500"} 12

# HELP http_request_duration_seconds HTTP request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.01"} 100
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.05"} 450
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.1"} 980
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="+Inf"} 1523
http_request_duration_seconds_sum{method="GET",path="/api/users"} 76.5
http_request_duration_seconds_count{method="GET",path="/api/users"} 1523

Business Metrics

Beyond technical metrics, track business-relevant data:

// User activity
const userSignups = new Counter({
  name: 'user_signups_total',
  help: 'Total user signups',
  labelNames: ['source', 'plan'],
});

const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Currently active users',
});

// E-commerce
const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total orders placed',
  labelNames: ['status', 'payment_method'],
});

const orderValue = new Histogram({
  name: 'order_value_dollars',
  help: 'Order value distribution',
  buckets: [10, 25, 50, 100, 250, 500, 1000],
});

// Healthcare
const appointmentsScheduled = new Counter({
  name: 'appointments_scheduled_total',
  help: 'Total appointments scheduled',
  labelNames: ['type', 'department'],
});

const waitTime = new Histogram({
  name: 'patient_wait_time_minutes',
  help: 'Patient wait time distribution',
  buckets: [5, 10, 15, 30, 45, 60, 90],
});

Labels and Cardinality

Label Best Practices

Labels add dimensions to metrics but increase cardinality:

// Good: Limited, bounded label values
http_requests_total{method="GET", status="200", endpoint="/users"}

// Bad: High cardinality (unbounded values)
http_requests_total{user_id="user-12345", timestamp="1705320000"}

Cardinality Guidelines

Label Type	Good Example	Bad Example
HTTP Method	GET, POST, PUT	N/A
Status Code	200, 400, 500	N/A
Endpoint	/users, /orders	/users/12345 (user ID)
Environment	prod, staging	N/A
Service	api, worker	N/A

Rule of thumb: Total cardinality = product of all label values

4 methods × 5 status codes × 10 endpoints = 200 series (good)
4 methods × 1M user IDs = 4M series (bad!)

Prometheus Configuration

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alerts/*.yml'

scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['api:3000']
    metrics_path: '/metrics'

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Alerting Rules

Alert Definition

# alerts/api-alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.job }} has been down for more than 1 minute"

Alert Routing

# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

PromQL Queries

Common Queries

# Request rate (per second)
rate(http_requests_total[5m])

# Error rate (percentage)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Requests per endpoint
sum by (path) (rate(http_requests_total[5m]))

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes
* 100

# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

SLIs and SLOs

Service Level Indicators (SLIs)

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

# Latency SLI (% of requests under 200ms)
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

Service Level Objectives (SLOs)

Service	SLI	SLO
API	Availability	99.9%
API	P95 Latency	< 200ms
Database	Query Success	99.99%
Queue	Processing Latency	P99 < 30s

Best Practices

Do

Use the four golden signals as a starting point
Keep label cardinality low
Set meaningful alert thresholds
Document what each metric measures
Aggregate metrics before querying when possible
Use recording rules for complex queries

Don't

Create high-cardinality metrics
Alert on every minor variation
Ignore metric naming conventions
Skip units in metric names
Store timestamps as label values
Create alerts without runbooks

Compliance

This section fulfills ISO 13485 requirements for monitoring and measurement (8.2.3, 8.2.4) and data analysis (8.4), and ISO 27001 requirements for monitoring activities (A.8.16) and capacity management (A.8.6).

View full compliance matrix

On this page