Netspective Logo

Observability

Comprehensive observability practices - logging, metrics, tracing, and dashboards

Observability is the ability to understand the internal state of your system by examining its outputs. For regulated software, observability is crucial for compliance, incident response, and demonstrating system health to auditors.

The Three Pillars of Observability

Pillars of Observability

PillarPurposeExample
LogsRecord discrete events"User 123 logged in at 10:32:15"
MetricsMeasure aggregated data"Average response time: 245ms"
TracesTrack request flow"Request took 500ms: DB 300ms, API 150ms, Auth 50ms"

Why Observability Matters

For Software Quality

  • Debugging: Quickly identify and fix issues
  • Performance: Understand bottlenecks and optimize
  • Reliability: Detect problems before users report them
  • Capacity Planning: Predict resource needs

For Compliance

RequirementHow Observability Helps
Audit TrailsLogs provide evidence of system actions
Access MonitoringTrack who accessed what data when
Incident ResponseDocument and analyze security events
SLA ComplianceMetrics prove service level adherence
Change TrackingLog all configuration and code changes

Observability Maturity Model

Level 1: Basic Logging

  • Console output to files
  • No structured format
  • Manual log analysis

Level 2: Centralized Logging

  • Logs aggregated in central system
  • Structured logging format
  • Basic search and filtering

Level 3: Metrics and Alerting

  • Key metrics collected
  • Dashboards for visualization
  • Alerts for critical thresholds

Level 4: Distributed Tracing

  • Request tracing across services
  • Latency breakdown by component
  • Root cause analysis capability

Level 5: Full Observability

  • Correlation across logs, metrics, traces
  • AI-powered anomaly detection
  • Proactive issue prediction

Core Principles

1. Measure What Matters

Focus on metrics that drive decisions:

The Four Golden Signals (Google SRE):
├── Latency      - How long requests take
├── Traffic      - How much demand on the system
├── Errors       - Rate of failed requests
└── Saturation   - How "full" the system is

2. Structured Data

Always use structured formats for machine parsing:

{
  "timestamp": "2024-01-15T10:32:15.123Z",
  "level": "info",
  "service": "user-service",
  "traceId": "abc123",
  "userId": "user-456",
  "action": "login",
  "duration_ms": 45,
  "success": true
}

3. Correlation

Enable linking related data across pillars:

Request ID: req-789
├── Logs:    5 log entries with traceId=req-789
├── Metrics: Response time 450ms, Status 200
└── Traces:  api-gateway → auth-service → user-db

4. Context is Key

Include relevant context with every observation:

ContextExample
Request IDtraceId: "abc-123"
User IDuserId: "user-456"
Environmentenv: "production"
Versionversion: "1.2.3"
Hosthost: "server-01"

Observability Stack

Common Tool Combinations

Open Source Stack (PLG/LGTM):

  • Prometheus: Metrics collection
  • Loki: Log aggregation
  • Grafana: Visualization
  • Tempo/Jaeger: Distributed tracing

Cloud Provider Stacks:

  • AWS: CloudWatch, X-Ray
  • Azure: Application Insights, Monitor
  • GCP: Cloud Logging, Cloud Monitoring, Cloud Trace

Commercial Solutions:

  • Datadog, New Relic, Splunk, Elastic

Architecture Example

Observability Stack


Implementation Guidelines

For New Projects

  1. Start with structured logging - Easy to add, immediate value
  2. Add key metrics early - Response time, error rate, throughput
  3. Implement health checks - /health and /ready endpoints
  4. Add tracing when needed - When debugging distributed systems

For Existing Projects

  1. Audit current state - What's being logged/measured?
  2. Standardize formats - Consistent log structure
  3. Add missing pillars - Usually metrics and traces
  4. Create dashboards - Visualize key indicators

Compliance Considerations

Audit Logging Requirements

For regulated systems, certain events MUST be logged:

Event TypeRequired Fields
AuthenticationUser ID, timestamp, success/failure, IP
AuthorizationUser ID, resource, action, decision
Data AccessUser ID, record ID, action, timestamp
Data ModificationUser ID, record ID, before/after values
System ChangesAdmin ID, change type, timestamp

Log Retention

RegulationMinimum Retention
HIPAA6 years
SOX7 years
GDPRPurpose-dependent
PCI DSS1 year (3 months online)

Data Protection in Logs

Never log:

  • Passwords or authentication tokens
  • Full credit card numbers
  • Social Security Numbers
  • Unencrypted PHI/PII

Do log:

  • Masked identifiers (last 4 digits)
  • Hashed values for correlation
  • Reference IDs, not actual data

Best Practices Summary

Do

  • Use structured, machine-parseable formats
  • Include correlation IDs across all pillars
  • Set up alerting for critical metrics
  • Document what you're measuring and why
  • Regularly review and update dashboards
  • Test your observability during incidents

Don't

  • Log sensitive data (PII, credentials, PHI)
  • Ignore log volume and storage costs
  • Alert on every minor issue (alert fatigue)
  • Assume logs are being collected
  • Skip local development observability
  • Treat observability as optional


Compliance

This section fulfills ISO 13485 requirements for control of records (4.2.4), process monitoring (8.2.3), product monitoring (8.2.4), data analysis (8.4), and corrective action (8.5.2), and ISO 27001 requirements for logging (A.8.15), monitoring activities (A.8.16), clock synchronization (A.8.17), evidence collection (A.5.28), and security event assessment (A.5.25).

View full compliance matrix

How is this guide?

Last updated on

On this page