Observability
Comprehensive observability practices - logging, metrics, tracing, and dashboards
Observability is the ability to understand the internal state of your system by examining its outputs. For regulated software, observability is crucial for compliance, incident response, and demonstrating system health to auditors.
The Three Pillars of Observability
| Pillar | Purpose | Example |
|---|---|---|
| Logs | Record discrete events | "User 123 logged in at 10:32:15" |
| Metrics | Measure aggregated data | "Average response time: 245ms" |
| Traces | Track request flow | "Request took 500ms: DB 300ms, API 150ms, Auth 50ms" |
Why Observability Matters
For Software Quality
- Debugging: Quickly identify and fix issues
- Performance: Understand bottlenecks and optimize
- Reliability: Detect problems before users report them
- Capacity Planning: Predict resource needs
For Compliance
| Requirement | How Observability Helps |
|---|---|
| Audit Trails | Logs provide evidence of system actions |
| Access Monitoring | Track who accessed what data when |
| Incident Response | Document and analyze security events |
| SLA Compliance | Metrics prove service level adherence |
| Change Tracking | Log all configuration and code changes |
Observability Maturity Model
Level 1: Basic Logging
- Console output to files
- No structured format
- Manual log analysis
Level 2: Centralized Logging
- Logs aggregated in central system
- Structured logging format
- Basic search and filtering
Level 3: Metrics and Alerting
- Key metrics collected
- Dashboards for visualization
- Alerts for critical thresholds
Level 4: Distributed Tracing
- Request tracing across services
- Latency breakdown by component
- Root cause analysis capability
Level 5: Full Observability
- Correlation across logs, metrics, traces
- AI-powered anomaly detection
- Proactive issue prediction
Core Principles
1. Measure What Matters
Focus on metrics that drive decisions:
The Four Golden Signals (Google SRE):
├── Latency - How long requests take
├── Traffic - How much demand on the system
├── Errors - Rate of failed requests
└── Saturation - How "full" the system is2. Structured Data
Always use structured formats for machine parsing:
{
"timestamp": "2024-01-15T10:32:15.123Z",
"level": "info",
"service": "user-service",
"traceId": "abc123",
"userId": "user-456",
"action": "login",
"duration_ms": 45,
"success": true
}3. Correlation
Enable linking related data across pillars:
Request ID: req-789
├── Logs: 5 log entries with traceId=req-789
├── Metrics: Response time 450ms, Status 200
└── Traces: api-gateway → auth-service → user-db4. Context is Key
Include relevant context with every observation:
| Context | Example |
|---|---|
| Request ID | traceId: "abc-123" |
| User ID | userId: "user-456" |
| Environment | env: "production" |
| Version | version: "1.2.3" |
| Host | host: "server-01" |
Observability Stack
Common Tool Combinations
Open Source Stack (PLG/LGTM):
- Prometheus: Metrics collection
- Loki: Log aggregation
- Grafana: Visualization
- Tempo/Jaeger: Distributed tracing
Cloud Provider Stacks:
- AWS: CloudWatch, X-Ray
- Azure: Application Insights, Monitor
- GCP: Cloud Logging, Cloud Monitoring, Cloud Trace
Commercial Solutions:
- Datadog, New Relic, Splunk, Elastic
Architecture Example
Implementation Guidelines
For New Projects
- Start with structured logging - Easy to add, immediate value
- Add key metrics early - Response time, error rate, throughput
- Implement health checks -
/healthand/readyendpoints - Add tracing when needed - When debugging distributed systems
For Existing Projects
- Audit current state - What's being logged/measured?
- Standardize formats - Consistent log structure
- Add missing pillars - Usually metrics and traces
- Create dashboards - Visualize key indicators
Compliance Considerations
Audit Logging Requirements
For regulated systems, certain events MUST be logged:
| Event Type | Required Fields |
|---|---|
| Authentication | User ID, timestamp, success/failure, IP |
| Authorization | User ID, resource, action, decision |
| Data Access | User ID, record ID, action, timestamp |
| Data Modification | User ID, record ID, before/after values |
| System Changes | Admin ID, change type, timestamp |
Log Retention
| Regulation | Minimum Retention |
|---|---|
| HIPAA | 6 years |
| SOX | 7 years |
| GDPR | Purpose-dependent |
| PCI DSS | 1 year (3 months online) |
Data Protection in Logs
Never log:
- Passwords or authentication tokens
- Full credit card numbers
- Social Security Numbers
- Unencrypted PHI/PII
Do log:
- Masked identifiers (last 4 digits)
- Hashed values for correlation
- Reference IDs, not actual data
Best Practices Summary
Do
- Use structured, machine-parseable formats
- Include correlation IDs across all pillars
- Set up alerting for critical metrics
- Document what you're measuring and why
- Regularly review and update dashboards
- Test your observability during incidents
Don't
- Log sensitive data (PII, credentials, PHI)
- Ignore log volume and storage costs
- Alert on every minor issue (alert fatigue)
- Assume logs are being collected
- Skip local development observability
- Treat observability as optional
Related Resources
Compliance
This section fulfills ISO 13485 requirements for control of records (4.2.4), process monitoring (8.2.3), product monitoring (8.2.4), data analysis (8.4), and corrective action (8.5.2), and ISO 27001 requirements for logging (A.8.15), monitoring activities (A.8.16), clock synchronization (A.8.17), evidence collection (A.5.28), and security event assessment (A.5.25).
How is this guide?
Last updated on