Unified observability and monitoring for data & analytics
Datadog is a cloud-scale monitoring and observability platform that collects metrics, traces, and logs to help SREs and engineers detect, troubleshoot, and optimize systems; it suits teams from startups to enterprises and uses modular, per-product pricing starting with free tiers and pay-as-you-go plans.
Datadog is a cloud-native observability platform that monitors infrastructure, applications, logs, and user experience across hybrid and multi-cloud environments. Its core capability is correlating metrics, traces, and logs in a single platform to surface root cause, with real-time dashboards and alerting. Datadog’s differentiator is broad agent and integration coverage (400+ integrations) plus APM, logs, and metrics in one UI, serving SREs, DevOps, and platform engineering teams. Pricing is modular: there are free tiers for limited hosts/logs and pay-as-you-go plans per product so teams can scale observability costs.
Datadog is a software-as-a-service observability and monitoring platform founded by Olivier Pomel and Alexis Lê-Quôc and launched to serve cloud-native infrastructure teams. Positioned as an end-to-end observability suite, Datadog unifies infrastructure monitoring, application performance monitoring (APM), log management, synthetic and real user monitoring, and security signals into a single data platform. The core value proposition is to ingest telemetry—metrics, traces, and logs—at scale, correlate them automatically, and present actionable context so teams can reduce MTTD/MTTR across distributed systems.
Datadog’s product portfolio includes distinct but integrated capabilities. Infrastructure Monitoring captures host and container metrics with the Datadog Agent and supports custom metrics, tagging, and out-of-the-box dashboards. APM (Application Performance Monitoring) traces requests end-to-end and surfaces latency hotspots, service maps, and flame graphs; it supports sampling controls and distributed tracing for languages like Java, Python, Go, and Node.
Log Management ingests and indexes logs with retention and live-tail options, plus structured log processing pipelines and log rehydration; billing is based on ingested and indexed volumes. Synthetic Monitoring runs scripted API and browser checks for performance baselines, while Real User Monitoring (RUM) measures front-end performance and errors. Security features include Cloud SIEM and Runtime Application Security for threat detection.
Datadog’s pricing is modular and per-product, with usage-based meters for each service. There is a free tier for Infrastructure (limited host count with 1-day metric retention for some plans) and a free APM trial; Logging offers a free ingest/sample and limited indexing tier. Common published prices (subject to change) include Infrastructure Pro at a per-host, per-month rate, APM billed per host or per million spans, and Log Management billed per GB ingested and per GB indexed; Synthetic and RUM have per-check and per-session rates respectively.
Enterprise customers can purchase annual commitments and negotiated volume discounts; Datadog also offers cost controls like ingestion pipelines and retention settings to manage bills. Datadog is used by SREs and platform teams to monitor microservices and cloud infrastructure, by backend engineers to debug latency using APM traces, and by product teams to measure frontend performance with RUM. Examples: an SRE using Datadog APM to reduce P95 latency by tracing service calls, and a DevOps engineer using Infrastructure Monitoring plus Log Management to cut incident detection time.
For teams evaluating alternatives, Datadog is often compared to New Relic—Datadog emphasizes agent-based integrations and cross-product correlation, while New Relic packages telemetry differently and sometimes favors per-user or per-entity models.
Three capabilities that set Datadog apart from its nearest competitors.
Current tiers and what you get at each price point. Verified against the vendor's pricing page.
| Plan | Price | What you get | Best for |
|---|---|---|---|
| Free | Free | Limited host metrics, basic dashboards, 1-day metric retention on some plans | Small teams evaluating observability features |
| Pro / Pay-as-you-go | Varies by product (e.g., Infrastructure ~$18/host/month) | Per-host or per-GB billing; feature access depends on product | Growing teams needing full APM/logs per product |
| Enterprise / Annual | Custom (volume discounts negotiated) | Includes advanced security, SAML, longer retention options | Large orgs needing SLAs and enterprise controls |
Copy these into Datadog as-is. Each targets a different high-value workflow.
Role: You are a Datadog monitoring engineer. Constraints: produce a single Datadog monitor definition for host CPU usage that triggers on sustained spikes, include severity tags, a recovery condition, and limit noise with a short-term aggregation. Input (replace placeholders): service_name, env (prod/stage), host_tag. Output format: JSON object with fields: name, type, query, message, tags, options (thresholds, evaluation_delay, notify_no_data, renotify_interval). Example: show a monitor that alerts at >85% CPU for 5 minutes and warns at >70% for 10 minutes. Provide the exact monitor query and message payload ready to paste into Datadog API or UI.
Role: You are a platform observability designer. Constraints: produce a single-page Datadog dashboard design with no more than 6 widgets, include template variables (service, env), and ensure widgets work for both prod and staging. Output format: numbered widget list with widget type, title, Datadog query, visualization type, size, and brief why-it-matters note. Examples where useful: include a P95 latency timeseries, error rate, throughput, slow endpoint table, heatmap by region, and a resource saturation widget. Provide concrete Datadog query snippets (use metric names like trace.http.request.duration) that are ready to paste into widget queries.
Role: Act as an APM analyst. Constraints: analyze the last 30 minutes (parameterizable), return the top 5 spans by P95 latency for a given service_name, include average latency, p95, span count, example trace_id for reproduction, and one-sentence hypothesis per span. Output format: JSON array of objects [{span_name, avg_ms, p95_ms, sample_count, example_trace_id, hypothesis, suggested_fixes[]}]. Variable: service_name (replace when running). Examples: show span_name 'db.query' with p95=450ms and a suggested fix 'add index / connection pool tuning'.
Role: You are an SRE defining error budget policies. Constraints: produce one SLO YAML/JSON for availability or latency with objective (e.g., 99.9%), rolling window (30d), and two alert conditions (warning at 75% error budget spent, critical at 95% spent). Output format: YAML with fields: name, service, metric/query, objective, timeframe, thresholds (warning/critical), alert_messages (notify channels, runbook links). Variable: service_name and indicator (errors or p95_latency). Example: include a sample monitor message that mentions remaining error budget and links to the runbook.
Role: You are a senior SRE writing an incident runbook and postmortem template. Multi-step instruction: 1) Use the two few-shot examples below as style guides. 2) Produce a runbook with immediate mitigation steps, verification checks, escalation matrix, required Datadog queries/dashboards to open, and a checklist for on-call. 3) Produce a postmortem template with timeline, root cause analysis, impact, corrective actions, owner, and deadlines. Output format: Markdown with sections and actionable commands/queries. Examples: Example A: "DB connection pool exhaustion" runbook snippet; Example B: "Cache eviction cascade" runbook snippet. Now generate for incident: 'external API rate-limited responses skyrocketing for service_name'.
Role: Act as an observability cost-optimization lead. Multi-step instructions: 1) Given current_ingestion_gb_per_day (replace placeholder) and retention_days, analyze high-level cost drivers. 2) Recommend 6 prioritized actions (parsing, pipelines, exclusion filters, sample rules, archival, index management) with implementation steps, rough estimated GB/day savings (range), effort level, and risk. 3) Provide Datadog pipeline rules or example processors for the top 2 changes. Output format: JSON with keys: summary, assumptions, actions[] (name, estimated_savings_gb_range, effort_hours, risk, steps), pipeline_examples[]. Examples where useful: show a grok-like parsing rule and an exclusion filter for debug logs.
Choose Datadog over New Relic if you need broader native integrations and tighter cross-product telemetry correlation.