Observability

Comprehensive observability practices - logging, metrics, tracing, and dashboards

Observability is the ability to understand the internal state of your system by examining its outputs. For regulated software, observability is crucial for compliance, incident response, and demonstrating system health to auditors.

The Three Pillars of Observability

Pillars of Observability

Pillar	Purpose	Example
Logs	Record discrete events	"User 123 logged in at 10:32:15"
Metrics	Measure aggregated data	"Average response time: 245ms"
Traces	Track request flow	"Request took 500ms: DB 300ms, API 150ms, Auth 50ms"

Why Observability Matters

For Software Quality

Debugging: Quickly identify and fix issues
Performance: Understand bottlenecks and optimize
Reliability: Detect problems before users report them
Capacity Planning: Predict resource needs

For Compliance

Requirement	How Observability Helps
Audit Trails	Logs provide evidence of system actions
Access Monitoring	Track who accessed what data when
Incident Response	Document and analyze security events
SLA Compliance	Metrics prove service level adherence
Change Tracking	Log all configuration and code changes

Observability Maturity Model

Level 1: Basic Logging

Console output to files
No structured format
Manual log analysis

Level 2: Centralized Logging

Logs aggregated in central system
Structured logging format
Basic search and filtering

Level 3: Metrics and Alerting

Key metrics collected
Dashboards for visualization
Alerts for critical thresholds

Level 4: Distributed Tracing

Request tracing across services
Latency breakdown by component
Root cause analysis capability

Level 5: Full Observability

Correlation across logs, metrics, traces
AI-powered anomaly detection
Proactive issue prediction

Core Principles

1. Measure What Matters

Focus on metrics that drive decisions:

The Four Golden Signals (Google SRE):
├── Latency      - How long requests take
├── Traffic      - How much demand on the system
├── Errors       - Rate of failed requests
└── Saturation   - How "full" the system is

2. Structured Data

Always use structured formats for machine parsing:

{
  "timestamp": "2024-01-15T10:32:15.123Z",
  "level": "info",
  "service": "user-service",
  "traceId": "abc123",
  "userId": "user-456",
  "action": "login",
  "duration_ms": 45,
  "success": true
}

3. Correlation

Enable linking related data across pillars:

Request ID: req-789
├── Logs:    5 log entries with traceId=req-789
├── Metrics: Response time 450ms, Status 200
└── Traces:  api-gateway → auth-service → user-db

4. Context is Key

Include relevant context with every observation:

Context	Example
Request ID	`traceId: "abc-123"`
User ID	`userId: "user-456"`
Environment	`env: "production"`
Version	`version: "1.2.3"`
Host	`host: "server-01"`

Observability Stack

Common Tool Combinations

Open Source Stack (PLG/LGTM):

Prometheus: Metrics collection
Loki: Log aggregation
Grafana: Visualization
Tempo/Jaeger: Distributed tracing

Cloud Provider Stacks:

AWS: CloudWatch, X-Ray
Azure: Application Insights, Monitor
GCP: Cloud Logging, Cloud Monitoring, Cloud Trace

Commercial Solutions:

Datadog, New Relic, Splunk, Elastic

Architecture Example

Observability Stack

Implementation Guidelines

For New Projects

Start with structured logging - Easy to add, immediate value
Add key metrics early - Response time, error rate, throughput
Implement health checks - /health and /ready endpoints
Add tracing when needed - When debugging distributed systems

For Existing Projects

Audit current state - What's being logged/measured?
Standardize formats - Consistent log structure
Add missing pillars - Usually metrics and traces
Create dashboards - Visualize key indicators

Compliance Considerations

Audit Logging Requirements

For regulated systems, certain events MUST be logged:

Event Type	Required Fields
Authentication	User ID, timestamp, success/failure, IP
Authorization	User ID, resource, action, decision
Data Access	User ID, record ID, action, timestamp
Data Modification	User ID, record ID, before/after values
System Changes	Admin ID, change type, timestamp

Log Retention

Regulation	Minimum Retention
HIPAA	6 years
SOX	7 years
GDPR	Purpose-dependent
PCI DSS	1 year (3 months online)

Data Protection in Logs

Never log:

Passwords or authentication tokens
Full credit card numbers
Social Security Numbers
Unencrypted PHI/PII

Do log:

Masked identifiers (last 4 digits)
Hashed values for correlation
Reference IDs, not actual data

Best Practices Summary

Do

Use structured, machine-parseable formats
Include correlation IDs across all pillars
Set up alerting for critical metrics
Document what you're measuring and why
Regularly review and update dashboards
Test your observability during incidents

Don't

Log sensitive data (PII, credentials, PHI)
Ignore log volume and storage costs
Alert on every minor issue (alert fatigue)
Assume logs are being collected
Skip local development observability
Treat observability as optional

Compliance

This section fulfills ISO 13485 requirements for control of records (4.2.4), process monitoring (8.2.3), product monitoring (8.2.4), data analysis (8.4), and corrective action (8.5.2), and ISO 27001 requirements for logging (A.8.15), monitoring activities (A.8.16), clock synchronization (A.8.17), evidence collection (A.5.28), and security event assessment (A.5.25).

View full compliance matrix

On this page