Netspective Logo

Metrics

Collecting, aggregating, and alerting on application and infrastructure metrics

Metrics are numerical measurements collected over time that provide insight into system behavior. Unlike logs which record discrete events, metrics aggregate data to show trends, patterns, and anomalies.

Types of Metrics

The Four Golden Signals

From Google's SRE book, these are the essential metrics for any service:

SignalDescriptionExample
LatencyTime to service a requestP50: 45ms, P99: 200ms
TrafficDemand on the system1,500 requests/second
ErrorsRate of failed requests0.5% error rate
SaturationHow "full" the system is75% CPU, 80% memory

RED Method (Request-focused)

For request-driven services:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Request latency distribution

USE Method (Resource-focused)

For infrastructure:

  • Utilization: Resource busy percentage
  • Saturation: Amount of work queued
  • Errors: Count of error events

Metric Types

Counter

Monotonically increasing value (can only go up):

// Examples of counters
http_requests_total
errors_total
user_signups_total

// Usage
const requestCounter = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
});

// Increment
requestCounter.inc({ method: 'GET', path: '/api/users', status: '200' });

Gauge

Value that can go up or down:

// Examples of gauges
active_connections
queue_size
temperature

// Usage
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
});

// Set value
activeConnections.set(42);
activeConnections.inc();
activeConnections.dec();

Histogram

Distribution of values in buckets:

// Examples of histograms
request_duration_seconds
response_size_bytes

// Usage
const requestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

// Observe value
requestDuration.observe({ method: 'GET', path: '/api/users' }, 0.045);

// Or use timer
const timer = requestDuration.startTimer({ method: 'GET', path: '/api/users' });
// ... do work ...
timer();  // Records duration

Summary

Similar to histogram, but calculates quantiles on client side:

const requestDuration = new Summary({
  name: 'http_request_duration_seconds',
  help: 'Request duration in seconds',
  percentiles: [0.5, 0.9, 0.95, 0.99],
});

Implementation with Prometheus

Node.js Example

import express from 'express';
import { collectDefaultMetrics, Registry, Counter, Histogram } from 'prom-client';

const register = new Registry();

// Collect default Node.js metrics (memory, CPU, etc.)
collectDefaultMetrics({ register });

// Custom metrics
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [register],
});

// Middleware to track metrics
app.use((req, res, next) => {
  const timer = httpRequestDuration.startTimer({
    method: req.method,
    path: req.route?.path || req.path,
  });

  res.on('finish', () => {
    timer();
    httpRequestsTotal.inc({
      method: req.method,
      path: req.route?.path || req.path,
      status: res.statusCode.toString(),
    });
  });

  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.send(await register.metrics());
});

Output Format

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/users",status="200"} 1523
http_requests_total{method="POST",path="/api/users",status="201"} 45
http_requests_total{method="GET",path="/api/users",status="500"} 12

# HELP http_request_duration_seconds HTTP request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.01"} 100
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.05"} 450
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.1"} 980
http_request_duration_seconds_bucket{method="GET",path="/api/users",le="+Inf"} 1523
http_request_duration_seconds_sum{method="GET",path="/api/users"} 76.5
http_request_duration_seconds_count{method="GET",path="/api/users"} 1523

Business Metrics

Beyond technical metrics, track business-relevant data:

// User activity
const userSignups = new Counter({
  name: 'user_signups_total',
  help: 'Total user signups',
  labelNames: ['source', 'plan'],
});

const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Currently active users',
});

// E-commerce
const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total orders placed',
  labelNames: ['status', 'payment_method'],
});

const orderValue = new Histogram({
  name: 'order_value_dollars',
  help: 'Order value distribution',
  buckets: [10, 25, 50, 100, 250, 500, 1000],
});

// Healthcare
const appointmentsScheduled = new Counter({
  name: 'appointments_scheduled_total',
  help: 'Total appointments scheduled',
  labelNames: ['type', 'department'],
});

const waitTime = new Histogram({
  name: 'patient_wait_time_minutes',
  help: 'Patient wait time distribution',
  buckets: [5, 10, 15, 30, 45, 60, 90],
});

Labels and Cardinality

Label Best Practices

Labels add dimensions to metrics but increase cardinality:

// Good: Limited, bounded label values
http_requests_total{method="GET", status="200", endpoint="/users"}

// Bad: High cardinality (unbounded values)
http_requests_total{user_id="user-12345", timestamp="1705320000"}

Cardinality Guidelines

Label TypeGood ExampleBad Example
HTTP MethodGET, POST, PUTN/A
Status Code200, 400, 500N/A
Endpoint/users, /orders/users/12345 (user ID)
Environmentprod, stagingN/A
Serviceapi, workerN/A

Rule of thumb: Total cardinality = product of all label values

  • 4 methods × 5 status codes × 10 endpoints = 200 series (good)
  • 4 methods × 1M user IDs = 4M series (bad!)

Prometheus Configuration

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alerts/*.yml'

scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['api:3000']
    metrics_path: '/metrics'

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Alerting Rules

Alert Definition

# alerts/api-alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.job }} has been down for more than 1 minute"

Alert Routing

# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

PromQL Queries

Common Queries

# Request rate (per second)
rate(http_requests_total[5m])

# Error rate (percentage)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Requests per endpoint
sum by (path) (rate(http_requests_total[5m]))

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes
* 100

# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

SLIs and SLOs

Service Level Indicators (SLIs)

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

# Latency SLI (% of requests under 200ms)
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

Service Level Objectives (SLOs)

ServiceSLISLO
APIAvailability99.9%
APIP95 Latency< 200ms
DatabaseQuery Success99.99%
QueueProcessing LatencyP99 < 30s

Best Practices

Do

  • Use the four golden signals as a starting point
  • Keep label cardinality low
  • Set meaningful alert thresholds
  • Document what each metric measures
  • Aggregate metrics before querying when possible
  • Use recording rules for complex queries

Don't

  • Create high-cardinality metrics
  • Alert on every minor variation
  • Ignore metric naming conventions
  • Skip units in metric names
  • Store timestamps as label values
  • Create alerts without runbooks


Compliance

This section fulfills ISO 13485 requirements for monitoring and measurement (8.2.3, 8.2.4) and data analysis (8.4), and ISO 27001 requirements for monitoring activities (A.8.16) and capacity management (A.8.6).

View full compliance matrix

How is this guide?

Last updated on

On this page