Skip to main content

Provider Health & Automatic Failover

Deepline continuously monitors 19 enrichment providers so your waterfalls never stall on a dead endpoint. When a provider goes down, waterfall execution skips it automatically. When it recovers, traffic resumes. You do not configure anything — it works out of the box on every enrichment request.

Primitives

PrimitiveValue
Providers monitored19 (13 API probes + 5 DNS probes + 1 mixed)
API probe timeout10 seconds
DNS probe timeout5 seconds
Degraded thresholdResponse time exceeds 5 seconds
State storageVercel KV
Transition alertsSlack (automatic on every status change)
Check frequencyCron-driven (runs on schedule, not per-request)
Failover behaviorAutomatic skip in waterfall when provider status is down

What provider health monitoring means for your enrichment reliability

Not every team needs to think about provider health. Here is how to decide whether this matters for your use case:
  • Production-critical pipelines. If your enrichment runs feed CRM syncs, outbound sequences, or lead scoring models, a single provider outage cascades into stale data and failed automations. Automatic rerouting keeps your pipeline producing results even when a provider is down. No retry logic to build, no status pages to monitor.
  • Multi-provider waterfalls. The typical email waterfall hits 4-6 providers in sequence. A provider returning HTTP 500s gets marked down within one health check cycle, and every subsequent waterfall skips it until recovery. No wasted credits on providers that are going to fail anyway.
  • Teams without dedicated DevOps. Most enrichment teams are 2-3 people. Nobody has time to poll 19 status pages and toggle provider configs when something breaks. Deepline handles monitoring, state management, alerting, and failover as one system. You get Slack notifications on status changes, and traffic reroutes without intervention.
  • Not relevant for single-provider setups. If your pipeline only calls Apollo for people search, health monitoring still runs — but failover has nowhere to reroute. You still get Slack alerts when Apollo goes down, but automatic rerouting only matters when your waterfall has multiple providers.

How health monitoring works

The health monitoring system has four layers: probes that test each provider, state management that tracks status over time, alerts that notify your team, and failover logic that acts on that state during waterfall execution.

Layer 1: Probes

Every health check cycle runs 19 probes concurrently using Promise.allSettled. Each probe targets a lightweight endpoint on the provider’s API — something that authenticates and responds without consuming credits or triggering rate limits. There are two probe types: API probes (13 providers) make an authenticated HTTP request to a real API endpoint. The request includes the provider’s API key, uses the correct auth mechanism (header, query parameter, or request body depending on the provider), and expects a successful HTTP response. The timeout is 10 seconds. If the response comes back with a non-2xx status code, the provider is marked down. If it succeeds but takes longer than 5 seconds, it is marked degraded. Providers using API probes: Apollo, Apify, Dropleads, HeyReach, Instantly, LeadMagic, Lemlist, Forager, ZeroBounce, Prospeo, People Data Labs, Deepline Native, Crustdata, and Hunter. DNS probes (5 providers) send an unauthenticated HEAD request to the provider’s API base URL. These are used for providers that do not expose a free, zero-side-effect health endpoint. The timeout is 5 seconds. A DNS probe tests that the hostname resolves and the server accepts TCP connections. Any HTTP response — even a 403 or 404 — counts as up, because the probe is testing reachability, not API correctness. Providers using DNS probes: Adyntel, Parallel, Exa, Google Search, and Icypeas. One provider (Icypeas) uses a mixed approach: it has an API key configured but uses a DNS probe because the API does not expose a free health-check endpoint.

Layer 2: State management

After all 19 probes complete, the system compares the current results against the previous results stored in Vercel KV under the key provider-health:latest. The comparison logic walks through every provider and checks whether its status changed since the last run. Transitions involving the skipped status are ignored — if a provider’s API key is not configured, it stays skipped and does not generate noise. When a transition is detected (for example, up to down, or degraded to up), the system records it with the provider name, previous status, new status, current latency, and any error message. The current probe results are then written back to Vercel KV, replacing the previous snapshot. This means KV always holds exactly one record: the most recent health state of all 19 providers.

Layer 3: Alerts

Every detected transition triggers a Slack notification. The alert includes the provider name, the direction of the change (e.g., “Apollo: up -> down”), the measured latency, and the error message if applicable. Alerts fire on every transition, in both directions. You get notified when a provider goes down, and you get notified when it comes back up. This gives your team a complete audit trail of provider reliability without checking dashboards.

Layer 4: Failover

During waterfall execution, Deepline reads the current provider health state from Vercel KV before building the execution plan. Any provider with a status of down is excluded from the waterfall chain for that request. This means:
  1. The waterfall does not waste time or credits calling a provider that is known to be unresponsive.
  2. The next healthy provider in the chain takes over, maintaining your enrichment hit rate.
  3. When the downed provider recovers (detected on the next health check cycle), it is automatically re-included in future waterfall executions.
The failover is transparent. You do not see different API responses or need to handle it in your code. The waterfall result comes back the same way it always does — you just get results faster because dead providers are skipped.

Status definitions

Each provider is assigned exactly one status after every health check cycle:
StatusMeaningWaterfall behavior
upProbe succeeded. Response time under 5 seconds.Provider is included in waterfall chains normally.
degradedProbe succeeded, but response time exceeded 5 seconds.Provider is still included in waterfall chains. Results may be slower than usual.
downProbe failed. HTTP error, timeout, or DNS resolution failure.Provider is skipped in waterfall chains until next successful health check.
skippedNo API key configured for this provider.Provider is excluded from health checks entirely. Not available for waterfall use.
The degraded status is a signal, not a circuit breaker. Providers running slow are still usable — they just take longer. The 5-second threshold was chosen because most enrichment API calls complete in 1-3 seconds under normal conditions. A response time above 5 seconds indicates the provider is under load or experiencing partial degradation, but still returning valid data. The down status is the circuit breaker. When a provider is down, waterfall execution treats it as if it does not exist for that request cycle. This is the key mechanism that prevents your pipeline from stalling on an unresponsive provider.

Frequently asked questions

Health checks are cron-driven. The check runs all 19 probes concurrently on each invocation. The exact schedule is configured in vercel.json as a Vercel Cron Job. Each cycle completes in under 15 seconds because all probes run in parallel with individual timeouts (10 seconds for API probes, 5 seconds for DNS probes).
No. API probes are designed to hit free, zero-side-effect endpoints: credit balance checks, campaign list endpoints, autocomplete queries with minimal input, and similar read-only calls. None of these endpoints consume enrichment credits on any provider we have tested.
If the KV read fails (for example, during a Vercel KV outage), the health check treats the situation as a first run — no previous state, so no transitions are detected and no alerts fire. The current results are still computed and an attempt is made to write them back to KV. Waterfall execution falls back to treating all providers as available, which is the safe default. You lose the failover optimization temporarily, but you do not lose enrichment capability.
Yes. The health check endpoint returns a full JSON report including every provider’s status, latency, probe type, and any error messages. The response also includes a summary object with counts of providers in each status (up, degraded, down, skipped) and a list of any transitions detected during that check cycle.
Every transition generates a Slack alert. If a provider is flapping, you will see a series of alternating alerts (“Apollo: up -> down”, “Apollo: down -> up”) which makes the instability visible. The waterfall respects the most recent health check result, so a provider that was down at the last check will be skipped until the next check finds it up again. There is no debounce or cooldown period — the system tracks the provider’s actual state, not a smoothed version of it.
No. A degraded provider still returns valid data — it just takes longer. Waterfall execution includes degraded providers normally. The status is informational: it tells you that a provider is slower than usual, which might indicate an emerging issue. If latency continues to increase past the timeout threshold, the next health check will mark the provider as down and failover will engage.
Not currently. The system is fully automated. If you know a provider is having issues before the next health check detects it, the impact is limited to one check cycle. In practice, provider outages are detected within minutes because the cron runs on a regular schedule.
All 19 providers in the probe list: Apollo, Apify, Dropleads, HeyReach, Instantly, LeadMagic, Lemlist, Forager, ZeroBounce, Prospeo, People Data Labs, Deepline Native, Crustdata, Hunter, Icypeas, Adyntel, Parallel, Exa, and Google Search. If a provider’s API key is not configured in your environment, its status will be skipped and it will not participate in health checks or waterfall execution.