Monitoring
Dubby exposes Prometheus metrics, structured logs, and audit trails out of the box. This page covers how to collect and use them. For the built-in admin dashboard UI, see Dashboard.
Prometheus metrics
Section titled “Prometheus metrics”The server exposes a /metrics endpoint in Prometheus text format. Metrics are enabled by default and can be disabled with the DUBBY_METRICS_ENABLED=false environment variable.
Scrape configuration
Section titled “Scrape configuration”Point your Prometheus instance at the server pod on port 3000:
scrape_configs: - job_name: dubby scrape_interval: 15s static_configs: - targets: ['dubby-server:3000']On Kubernetes with the Prometheus Operator, add pod annotations instead:
server: podAnnotations: prometheus.io/scrape: 'true' prometheus.io/port: '3000' prometheus.io/path: '/metrics'Or create a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: dubbyspec: selector: matchLabels: app.kubernetes.io/name: dubby app.kubernetes.io/component: server endpoints: - port: http path: /metrics interval: 15sAvailable metrics
Section titled “Available metrics”Streaming
Section titled “Streaming”| Metric | Type | Description |
|---|---|---|
dubby_streaming_active_sessions | Gauge | Current active playback sessions |
dubby_streaming_concurrent_users | Gauge | Distinct users with active sessions |
dubby_streaming_sessions_created_total | Counter | Total sessions by type (direct_play / transcode) |
dubby_streaming_session_duration_seconds | Histogram | Session duration |
dubby_streaming_transcode_startup_seconds | Histogram | Time to first segment |
dubby_streaming_seek_duration_seconds | Histogram | Seek operation latency |
dubby_streaming_ffmpeg_processes | Gauge | Active FFmpeg processes |
dubby_streaming_playback_tier_total | Counter | Sessions by playback mode tier |
dubby_streaming_errors_total | Counter | Errors by type (ffmpeg_crash, segment_timeout, etc.) |
dubby_streaming_bytes_served_total | Counter | Total bytes delivered to clients |
Workflows (background jobs)
Section titled “Workflows (background jobs)”| Metric | Type | Description |
|---|---|---|
dubby_workflow_active | Gauge | Currently running workflows by type |
dubby_workflow_queue_depth | Gauge | Job queue depth by queue name |
dubby_workflow_completed_total | Counter | Completed workflows by type and status |
dubby_workflow_duration_seconds | Histogram | Workflow total duration |
dubby_workflow_step_duration_seconds | Histogram | Individual step duration |
| Metric | Type | Description |
|---|---|---|
dubby_api_requests_total | Counter | Total requests by router/procedure/status |
dubby_api_request_duration_seconds | Histogram | Request duration |
dubby_api_errors_total | Counter | Errors by router and error code |
System
Section titled “System”| Metric | Type | Description |
|---|---|---|
dubby_system_memory_bytes | Gauge | Process memory by type (rss, heap_used, heap_total) |
dubby_system_uptime_seconds | Gauge | Process uptime |
dubby_system_cpu_seconds_total | Gauge | CPU time by type (user, system) |
dubby_system_transcode_cache_bytes | Gauge | Transcode cache disk usage |
Library
Section titled “Library”| Metric | Type | Description |
|---|---|---|
dubby_library_items_total | Gauge | Library items by content type |
dubby_library_storage_bytes | Gauge | Total library storage |
dubby_metadata_requests_total | Counter | External metadata API requests by provider and status |
Grafana
Section titled “Grafana”Import the metrics above into Grafana to build dashboards. Some useful panels:
- Active streams:
dubby_streaming_active_sessions - Transcode startup P95:
histogram_quantile(0.95, rate(dubby_streaming_transcode_startup_seconds_bucket[5m])) - API error rate:
rate(dubby_api_errors_total[5m]) - Workflow queue depth:
dubby_workflow_queue_depth— a growing queue means the worker is falling behind - Cache disk usage:
dubby_system_transcode_cache_bytes— alert if approaching your disk limit
Alerting examples
Section titled “Alerting examples”# Prometheus alerting rulesgroups: - name: dubby rules: - alert: DubbyDown expr: up{job="dubby"} == 0 for: 1m annotations: summary: Dubby server is down
- alert: DubbyHighTranscodeStartup expr: histogram_quantile(0.95, rate(dubby_streaming_transcode_startup_seconds_bucket[5m])) > 10 for: 5m annotations: summary: Transcode startup P95 exceeds 10 seconds
- alert: DubbyWorkflowQueueBacklog expr: dubby_workflow_queue_depth > 50 for: 10m annotations: summary: Workflow queue depth has exceeded 50 for 10 minutes
- alert: DubbyCacheDiskHigh expr: dubby_system_transcode_cache_bytes > 50e9 for: 5m annotations: summary: Transcode cache exceeds 50 GBLogging
Section titled “Logging”Dubby uses structured logging (pino) with configurable level.
Configuration
Section titled “Configuration”| Environment variable | Default | Options |
|---|---|---|
LOG_LEVEL | info | debug, info, warn, error |
Log format is determined by NODE_ENV: JSON in production, pretty-printed with color in development. There is no separate format toggle.
In production (NODE_ENV=production), each log line is a single JSON object suitable for ingestion by Loki, Elasticsearch, Datadog, or any structured log collector:
{ "level": "info", "time": 1710000000000, "service": "streaming", "sessionId": "abc123", "msg": "session created"}In development, logs are human-readable with color (via pino-pretty).
Collecting logs on Kubernetes
Section titled “Collecting logs on Kubernetes”Dubby logs to stdout, so any Kubernetes log collector works:
- Loki + Promtail/Alloy — scrape pod logs by label
- Fluentd / Fluent Bit — parse JSON logs natively
- Datadog Agent — auto-discovers pod logs
No sidecar is needed. Example Promtail scrape config:
scrape_configs: - job_name: dubby kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_namespace] regex: dubby action: keep pipeline_stages: - json: expressions: level: level service: serviceLog levels
Section titled “Log levels”| Level | What it includes |
|---|---|
error | Unrecoverable failures (crashed FFmpeg, database errors) |
warn | Recoverable issues (retry, timeout, missing metadata) |
info | Key lifecycle events (session created, scan started, migration applied) |
debug | Verbose internals (FFmpeg args, SQL queries, segment timing) |
Audit logs
Section titled “Audit logs”Dubby records security-relevant events to an audit log stored in the database. Audit logs are accessible to admins via the API and the admin UI.
Tracked events
Section titled “Tracked events”| Action | When it’s recorded |
|---|---|
login_attempt | Every login (success or failure), with email and IP |
user_create | New user account created |
user_delete | User account deleted |
config_change | Server configuration modified |
privacy_change | Privacy settings updated |
data_access | User data accessed |
data_export | User data exported (GDPR) |
data_delete | User data deleted (GDPR) |
external_request | Outbound request to external service |
Querying audit logs
Section titled “Querying audit logs”REST API:
# List recent audit logs (admin only)curl -H "Authorization: Bearer <token>" \ https://dubby.example.com/api/v1/audit-logs?limit=50
# Filter by actioncurl -H "Authorization: Bearer <token>" \ https://dubby.example.com/api/v1/audit-logs?action=login_attempt
# Aggregate statscurl -H "Authorization: Bearer <token>" \ https://dubby.example.com/api/v1/audit-logs/statsRetention
Section titled “Retention”Audit log retention is configurable in the privacy settings:
config: privacy: auditRetentionDays: 90 # delete entries older than 90 daysSet to null (the default) to retain indefinitely.
Health endpoints
Section titled “Health endpoints”The server exposes health check endpoints used by Kubernetes probes and external monitoring:
| Endpoint | Purpose | Checks |
|---|---|---|
GET /health/ | Basic health | Returns { status: "ok" } |
GET /health/live | Liveness probe | Always returns { status: "alive" } unless the process is hanging |
GET /health/ready | Readiness probe | Verifies database connectivity with SELECT 1 |
The readiness endpoint returns HTTP 503 during startup (before migrations complete) and during graceful shutdown, preventing traffic from reaching a pod that isn’t ready to serve requests.
These endpoints do not require authentication.