📊

Datadog

Unified observability and monitoring for data & analytics

Free | Freemium | Paid | Enterprise ⭐⭐⭐⭐☆ 4.4/5 📊 Data & Analytics 🕒 Updated
Visit Datadog ↗ Official website
Quick Verdict

Datadog is a cloud-scale monitoring and observability platform that collects metrics, traces, and logs to help SREs and engineers detect, troubleshoot, and optimize systems; it suits teams from startups to enterprises and uses modular, per-product pricing starting with free tiers and pay-as-you-go plans.

Datadog is a cloud-native observability platform that monitors infrastructure, applications, logs, and user experience across hybrid and multi-cloud environments. Its core capability is correlating metrics, traces, and logs in a single platform to surface root cause, with real-time dashboards and alerting. Datadog’s differentiator is broad agent and integration coverage (400+ integrations) plus APM, logs, and metrics in one UI, serving SREs, DevOps, and platform engineering teams. Pricing is modular: there are free tiers for limited hosts/logs and pay-as-you-go plans per product so teams can scale observability costs.

About Datadog

Datadog is a software-as-a-service observability and monitoring platform founded by Olivier Pomel and Alexis Lê-Quôc and launched to serve cloud-native infrastructure teams. Positioned as an end-to-end observability suite, Datadog unifies infrastructure monitoring, application performance monitoring (APM), log management, synthetic and real user monitoring, and security signals into a single data platform. The core value proposition is to ingest telemetry—metrics, traces, and logs—at scale, correlate them automatically, and present actionable context so teams can reduce MTTD/MTTR across distributed systems.

Datadog’s product portfolio includes distinct but integrated capabilities. Infrastructure Monitoring captures host and container metrics with the Datadog Agent and supports custom metrics, tagging, and out-of-the-box dashboards. APM (Application Performance Monitoring) traces requests end-to-end and surfaces latency hotspots, service maps, and flame graphs; it supports sampling controls and distributed tracing for languages like Java, Python, Go, and Node.

Log Management ingests and indexes logs with retention and live-tail options, plus structured log processing pipelines and log rehydration; billing is based on ingested and indexed volumes. Synthetic Monitoring runs scripted API and browser checks for performance baselines, while Real User Monitoring (RUM) measures front-end performance and errors. Security features include Cloud SIEM and Runtime Application Security for threat detection.

Datadog’s pricing is modular and per-product, with usage-based meters for each service. There is a free tier for Infrastructure (limited host count with 1-day metric retention for some plans) and a free APM trial; Logging offers a free ingest/sample and limited indexing tier. Common published prices (subject to change) include Infrastructure Pro at a per-host, per-month rate, APM billed per host or per million spans, and Log Management billed per GB ingested and per GB indexed; Synthetic and RUM have per-check and per-session rates respectively.

Enterprise customers can purchase annual commitments and negotiated volume discounts; Datadog also offers cost controls like ingestion pipelines and retention settings to manage bills. Datadog is used by SREs and platform teams to monitor microservices and cloud infrastructure, by backend engineers to debug latency using APM traces, and by product teams to measure frontend performance with RUM. Examples: an SRE using Datadog APM to reduce P95 latency by tracing service calls, and a DevOps engineer using Infrastructure Monitoring plus Log Management to cut incident detection time.

For teams evaluating alternatives, Datadog is often compared to New Relic—Datadog emphasizes agent-based integrations and cross-product correlation, while New Relic packages telemetry differently and sometimes favors per-user or per-entity models.

What makes Datadog different

Three capabilities that set Datadog apart from its nearest competitors.

  • Datadog offers 400+ native integrations that instrument services automatically via its Agent and APIs.
  • Telemetry correlation: Datadog stitches metrics, traces, and logs across products in a single UI.
  • Modular, per-product usage billing with per-host and per-GB meters and enterprise negotiation options.

Is Datadog right for you?

✅ Best for
  • SREs who need correlated metrics, traces, and logs for incident response
  • Platform engineers who need broad cloud and container integrations across environments
  • Backend engineers who need APM traces to reduce P95/P99 latency
  • DevOps teams who need Synthetic checks and RUM to monitor user experience
❌ Skip it if
  • Skip if you require open-source-only tooling and zero SaaS vendor lock-in.
  • Skip if you need strictly per-seat pricing and cannot accept per-host or per-GB billing.

✅ Pros

  • Wide integration catalog (400+ integrations) that accelerates instrumentation across infra and services
  • Unified UI that correlates metrics, traces, and logs for faster root-cause analysis
  • Modular telemetry billing lets teams pick only needed observability products and scale usage

❌ Cons

  • Can become expensive at scale due to per-host and per-GB pricing without careful data retention controls
  • High volume environments may need significant configuration to control ingestion and costs

Datadog Pricing Plans

Current tiers and what you get at each price point. Verified against the vendor's pricing page.

Plan Price What you get Best for
Free Free Limited host metrics, basic dashboards, 1-day metric retention on some plans Small teams evaluating observability features
Pro / Pay-as-you-go Varies by product (e.g., Infrastructure ~$18/host/month) Per-host or per-GB billing; feature access depends on product Growing teams needing full APM/logs per product
Enterprise / Annual Custom (volume discounts negotiated) Includes advanced security, SAML, longer retention options Large orgs needing SLAs and enterprise controls

Best Use Cases

  • Site Reliability Engineer using it to reduce incident MTTD by 50% via alerting and dashboards
  • Backend Engineer using it to cut P95 latency 30% by locating hotspots with APM traces
  • DevOps Lead using it to lower mean time to resolution by correlating logs with traces

Integrations

AWS Kubernetes Azure

How to Use Datadog

  1. 1
    Create and verify workspace
    Sign up at the Datadog site and verify your email. Choose a trial or Free plan, then create an Organization; success looks like access to the Datadog dashboard with a prompt to install the Agent.
  2. 2
    Install the Datadog Agent
    From Integrations → Agent, follow the platform-specific install (RPM/APT, Docker, or Helm). Confirm the Agent appears under Infrastructure → Hosts; success is visible host metrics populating charts.
  3. 3
    Enable APM and instrument services
    Install language-specific APM libraries (dd-trace for Java/Python/Node) and set DD_SERVICE and DD_ENV env vars. Verify traces show up under APM → Services with a service map and flame graphs.
  4. 4
    Create a dashboard and alert
    Go to Dashboards → New Dashboard, add widgets for host CPU, APM latency, and logs; then create a Monitor (e.g., Alert on high P95 latency). Success equals active dashboard panels and an alert test notification.

Ready-to-Use Prompts for Datadog

Copy these into Datadog as-is. Each targets a different high-value workflow.

Create CPU Spike Monitor
Detect and alert on CPU usage spikes
Role: You are a Datadog monitoring engineer. Constraints: produce a single Datadog monitor definition for host CPU usage that triggers on sustained spikes, include severity tags, a recovery condition, and limit noise with a short-term aggregation. Input (replace placeholders): service_name, env (prod/stage), host_tag. Output format: JSON object with fields: name, type, query, message, tags, options (thresholds, evaluation_delay, notify_no_data, renotify_interval). Example: show a monitor that alerts at >85% CPU for 5 minutes and warns at >70% for 10 minutes. Provide the exact monitor query and message payload ready to paste into Datadog API or UI.
Expected output: One JSON monitor definition (name, query, message, tags, options) ready for Datadog API/UI.
Pro tip: Use 'avg(last_5m)' with 'by{host}' rollup and include runbook links and pager severity in the message to reduce on-call confusion.
Service Latency Dashboard Layout
Visualize service latency across environments
Role: You are a platform observability designer. Constraints: produce a single-page Datadog dashboard design with no more than 6 widgets, include template variables (service, env), and ensure widgets work for both prod and staging. Output format: numbered widget list with widget type, title, Datadog query, visualization type, size, and brief why-it-matters note. Examples where useful: include a P95 latency timeseries, error rate, throughput, slow endpoint table, heatmap by region, and a resource saturation widget. Provide concrete Datadog query snippets (use metric names like trace.http.request.duration) that are ready to paste into widget queries.
Expected output: A list of 5–6 dashboard widgets with titles, queries, viz types, sizes, and short intent notes.
Pro tip: Add template variables for service and env and use conditional color thresholds to make problem states visible at a glance.
Summarize APM Trace Hotspots
Identify top latency hotspots in APM traces
Role: Act as an APM analyst. Constraints: analyze the last 30 minutes (parameterizable), return the top 5 spans by P95 latency for a given service_name, include average latency, p95, span count, example trace_id for reproduction, and one-sentence hypothesis per span. Output format: JSON array of objects [{span_name, avg_ms, p95_ms, sample_count, example_trace_id, hypothesis, suggested_fixes[]}]. Variable: service_name (replace when running). Examples: show span_name 'db.query' with p95=450ms and a suggested fix 'add index / connection pool tuning'.
Expected output: JSON array of up to 5 hotspot objects containing metrics, an example trace_id, hypothesis, and a short list of suggested fixes.
Pro tip: Ask Datadog to include deep links to the example trace and the trace flamegraph to accelerate triage.
Define SLO and Alerting Policy
Create SLO and error-budget alerts for service
Role: You are an SRE defining error budget policies. Constraints: produce one SLO YAML/JSON for availability or latency with objective (e.g., 99.9%), rolling window (30d), and two alert conditions (warning at 75% error budget spent, critical at 95% spent). Output format: YAML with fields: name, service, metric/query, objective, timeframe, thresholds (warning/critical), alert_messages (notify channels, runbook links). Variable: service_name and indicator (errors or p95_latency). Example: include a sample monitor message that mentions remaining error budget and links to the runbook.
Expected output: One YAML SLO definition including two alert thresholds and ready-to-deploy monitor messages.
Pro tip: Tie alert messages to an automated scheduling action (e.g., auto-create a postmortem ticket) to reduce lead time during high-severity breaches.
Generate Incident Runbook and Postmortem
Produce runbook and postmortem template for incidents
Role: You are a senior SRE writing an incident runbook and postmortem template. Multi-step instruction: 1) Use the two few-shot examples below as style guides. 2) Produce a runbook with immediate mitigation steps, verification checks, escalation matrix, required Datadog queries/dashboards to open, and a checklist for on-call. 3) Produce a postmortem template with timeline, root cause analysis, impact, corrective actions, owner, and deadlines. Output format: Markdown with sections and actionable commands/queries. Examples: Example A: "DB connection pool exhaustion" runbook snippet; Example B: "Cache eviction cascade" runbook snippet. Now generate for incident: 'external API rate-limited responses skyrocketing for service_name'.
Expected output: Markdown document containing a runnable incident runbook and a postmortem template tailored to the specified incident.
Pro tip: Include exact Datadog queries and the minimal set of users/teams to notify to avoid 'who to page' ambiguity during the incident.
Log Ingestion Cost Optimization Plan
Reduce Datadog log ingestion and retention costs
Role: Act as an observability cost-optimization lead. Multi-step instructions: 1) Given current_ingestion_gb_per_day (replace placeholder) and retention_days, analyze high-level cost drivers. 2) Recommend 6 prioritized actions (parsing, pipelines, exclusion filters, sample rules, archival, index management) with implementation steps, rough estimated GB/day savings (range), effort level, and risk. 3) Provide Datadog pipeline rules or example processors for the top 2 changes. Output format: JSON with keys: summary, assumptions, actions[] (name, estimated_savings_gb_range, effort_hours, risk, steps), pipeline_examples[]. Examples where useful: show a grok-like parsing rule and an exclusion filter for debug logs.
Expected output: JSON object with a summary, assumptions, and an array of 6 prioritized actions with estimated savings and implementation steps, plus 1–2 pipeline examples.
Pro tip: Start by measuring high-cardinality attributes and high-volume sources—dropping or extracting specific attributes often yields the largest cost reductions with minimal telemetry loss.

Datadog vs Alternatives

Bottom line

Choose Datadog over New Relic if you need broader native integrations and tighter cross-product telemetry correlation.

Frequently Asked Questions

How much does Datadog cost?+
Pricing is modular and usage-based. Datadog bills per product: Infrastructure typically has a per-host/month charge (e.g., Infrastructure Pro around the low tens per host), APM is billed per host or per million spans, and Log Management is billed per GB ingested/indexed. Enterprise pricing and discounts are available via annual contracts. Use retention and ingestion controls to manage costs.
Is there a free version of Datadog?+
Yes — Datadog offers free tiers and trials. There is a Free plan with limited host metrics and basic dashboards; individual products like APM and Logs have free trials or limited free ingestion/indexing tiers. Free tiers vary by product, so verify limits on host counts, retention, and indexed log volume before relying on them.
How does Datadog compare to New Relic?+
Datadog emphasizes agent-based integrations and cross-product telemetry correlation. New Relic offers a different pricing and packaging model and a unified data platform too; choose Datadog if you need broad native integrations and per-product modular billing, or New Relic if you prefer their usage/ingest pricing and user-centric UI.
What is Datadog best used for?+
Datadog is best for full-stack observability. It excels at correlating metrics, traces, and logs across cloud-native environments to reduce MTTD/MTTR, support APM workflows, run synthetic checks, and monitor frontend user experience with RUM and session replay.
How do I get started with Datadog?+
Start with the Datadog trial and install the Agent. Sign up at datadoghq.com, install the Agent (or Helm chart for Kubernetes), enable APM instrumentation with language libraries, and create dashboards and monitors; success is visible host metrics, traces, and sample logs in the UI.

More Data & Analytics Tools

Browse all Data & Analytics tools →
📊
Databricks
Unified Lakehouse for Data & Analytics-driven AI and BI
Updated Apr 21, 2026
📊
Snowflake
Cloud data platform for analytics-driven decision making
Updated Apr 21, 2026
📊
Microsoft Power BI
Turn data into decisions with enterprise-grade data analytics
Updated Apr 22, 2026