Monitoring

Dubby exposes Prometheus metrics, structured logs, and audit trails out of the box. This page covers how to collect and use them. For the built-in admin dashboard UI, see Dashboard.

Prometheus metrics

The server exposes a /metrics endpoint in Prometheus text format. Metrics are enabled by default and can be disabled with the DUBBY_METRICS_ENABLED=false environment variable.

Scrape configuration

Point your Prometheus instance at the server pod on port 3000:

scrape_configs:
  - job_name: dubby
    scrape_interval: 15s
    static_configs:
      - targets: ['dubby-server:3000']

On Kubernetes with the Prometheus Operator, add pod annotations instead:

server:
  podAnnotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port: '3000'
    prometheus.io/path: '/metrics'

Or create a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dubby
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dubby
      app.kubernetes.io/component: server
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Available metrics

Streaming

Metric	Type	Description
`dubby_streaming_active_sessions`	Gauge	Current active playback sessions
`dubby_streaming_concurrent_users`	Gauge	Distinct users with active sessions
`dubby_streaming_sessions_created_total`	Counter	Total sessions by type (direct_play / transcode)
`dubby_streaming_session_duration_seconds`	Histogram	Session duration
`dubby_streaming_transcode_startup_seconds`	Histogram	Time to first segment
`dubby_streaming_seek_duration_seconds`	Histogram	Seek operation latency
`dubby_streaming_ffmpeg_processes`	Gauge	Active FFmpeg processes
`dubby_streaming_playback_tier_total`	Counter	Sessions by playback mode tier
`dubby_streaming_errors_total`	Counter	Errors by type (ffmpeg_crash, segment_timeout, etc.)
`dubby_streaming_bytes_served_total`	Counter	Total bytes delivered to clients

Workflows (background jobs)

Metric	Type	Description
`dubby_workflow_active`	Gauge	Currently running workflows by type
`dubby_workflow_queue_depth`	Gauge	Job queue depth by queue name
`dubby_workflow_completed_total`	Counter	Completed workflows by type and status
`dubby_workflow_duration_seconds`	Histogram	Workflow total duration
`dubby_workflow_step_duration_seconds`	Histogram	Individual step duration

API

Metric	Type	Description
`dubby_api_requests_total`	Counter	Total requests by router/procedure/status
`dubby_api_request_duration_seconds`	Histogram	Request duration
`dubby_api_errors_total`	Counter	Errors by router and error code

System

Metric	Type	Description
`dubby_system_memory_bytes`	Gauge	Process memory by type (rss, heap_used, heap_total)
`dubby_system_uptime_seconds`	Gauge	Process uptime
`dubby_system_cpu_seconds_total`	Gauge	CPU time by type (user, system)
`dubby_system_transcode_cache_bytes`	Gauge	Transcode cache disk usage

Library

Metric	Type	Description
`dubby_library_items_total`	Gauge	Library items by content type
`dubby_library_storage_bytes`	Gauge	Total library storage
`dubby_metadata_requests_total`	Counter	External metadata API requests by provider and status

Grafana

Import the metrics above into Grafana to build dashboards. Some useful panels:

Active streams: dubby_streaming_active_sessions
Transcode startup P95: histogram_quantile(0.95, rate(dubby_streaming_transcode_startup_seconds_bucket[5m]))
API error rate: rate(dubby_api_errors_total[5m])
Workflow queue depth: dubby_workflow_queue_depth — a growing queue means the worker is falling behind
Cache disk usage: dubby_system_transcode_cache_bytes — alert if approaching your disk limit

Alerting examples

# Prometheus alerting rules
groups:
  - name: dubby
    rules:
      - alert: DubbyDown
        expr: up{job="dubby"} == 0
        for: 1m
        annotations:
          summary: Dubby server is down

      - alert: DubbyHighTranscodeStartup
        expr: histogram_quantile(0.95, rate(dubby_streaming_transcode_startup_seconds_bucket[5m])) > 10
        for: 5m
        annotations:
          summary: Transcode startup P95 exceeds 10 seconds

      - alert: DubbyWorkflowQueueBacklog
        expr: dubby_workflow_queue_depth > 50
        for: 10m
        annotations:
          summary: Workflow queue depth has exceeded 50 for 10 minutes

      - alert: DubbyCacheDiskHigh
        expr: dubby_system_transcode_cache_bytes > 50e9
        for: 5m
        annotations:
          summary: Transcode cache exceeds 50 GB

Logging

Dubby uses structured logging (pino) with configurable level.

Configuration

Environment variable	Default	Options
`LOG_LEVEL`	`info`	`debug`, `info`, `warn`, `error`

Log format is determined by NODE_ENV: JSON in production, pretty-printed with color in development. There is no separate format toggle.

In production (NODE_ENV=production), each log line is a single JSON object suitable for ingestion by Loki, Elasticsearch, Datadog, or any structured log collector:

{
  "level": "info",
  "time": 1710000000000,
  "service": "streaming",
  "sessionId": "abc123",
  "msg": "session created"
}

In development, logs are human-readable with color (via pino-pretty).

Collecting logs on Kubernetes

Dubby logs to stdout, so any Kubernetes log collector works:

Loki + Promtail/Alloy — scrape pod logs by label
Fluentd / Fluent Bit — parse JSON logs natively
Datadog Agent — auto-discovers pod logs

No sidecar is needed. Example Promtail scrape config:

scrape_configs:
  - job_name: dubby
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace]
        regex: dubby
        action: keep
    pipeline_stages:
      - json:
          expressions:
            level: level
            service: service

Log levels

Level	What it includes
`error`	Unrecoverable failures (crashed FFmpeg, database errors)
`warn`	Recoverable issues (retry, timeout, missing metadata)
`info`	Key lifecycle events (session created, scan started, migration applied)
`debug`	Verbose internals (FFmpeg args, SQL queries, segment timing)

Audit logs

Dubby records security-relevant events to an audit log stored in the database. Audit logs are accessible to admins via the API and the admin UI.

Tracked events

Action	When it’s recorded
`login_attempt`	Every login (success or failure), with email and IP
`user_create`	New user account created
`user_delete`	User account deleted
`config_change`	Server configuration modified
`privacy_change`	Privacy settings updated
`data_access`	User data accessed
`data_export`	User data exported (GDPR)
`data_delete`	User data deleted (GDPR)
`external_request`	Outbound request to external service

Querying audit logs

REST API:

# List recent audit logs (admin only)
curl -H "Authorization: Bearer <token>" \
  https://dubby.example.com/api/v1/audit-logs?limit=50

# Filter by action
curl -H "Authorization: Bearer <token>" \
  https://dubby.example.com/api/v1/audit-logs?action=login_attempt

# Aggregate stats
curl -H "Authorization: Bearer <token>" \
  https://dubby.example.com/api/v1/audit-logs/stats

Retention

Audit log retention is configurable in the privacy settings:

config:
  privacy:
    auditRetentionDays: 90 # delete entries older than 90 days

Set to null (the default) to retain indefinitely.

Health endpoints

The server exposes health check endpoints used by Kubernetes probes and external monitoring:

Endpoint	Purpose	Checks
`GET /health/`	Basic health	Returns `{ status: "ok" }`
`GET /health/live`	Liveness probe	Always returns `{ status: "alive" }` unless the process is hanging
`GET /health/ready`	Readiness probe	Verifies database connectivity with `SELECT 1`

The readiness endpoint returns HTTP 503 during startup (before migrations complete) and during graceful shutdown, preventing traffic from reaching a pod that isn’t ready to serve requests.

These endpoints do not require authentication.