How I keep 100+ scrapers alive

Running 100+ Playwright integrations across authenticated mortgage platforms sounds brittle — and it would be, if you treated every failure the same way. The trick isn’t writing better scrapers. The trick is accepting that failures are the default state and engineering around that.

This post walks through the error classification system that has kept a production data-sync pipeline running 3× daily for over a year, with zero manual interventions so far.

Failures are data, not exceptions

Most scraping code treats an error as an exception: catch it, maybe retry once, log it, move on. That approach works for tiny scripts. It falls apart the moment you’re running dozens of integrations at once, because every single failure mode is a policy decision in disguise:

A 403 from the target site might mean your session expired.
A 403 from the target site might mean your IP was rate-limited.
A 403 from the target site might mean the team rotated the SSO provider.

Those three causes have nothing in common except the status code. If you retry blindly, you’ll make the problem worse in at least two of the three. If you page a human on every 403, you’ll train them to ignore alerts. The answer is to classify first, then route.

The 9 classes

In the production system I maintain, every failure lands in one of nine buckets. Each bucket has a different recovery policy. Most of them don’t page a human — they just resolve themselves.

Full post coming soon. The remaining sections cover: how we detect authentication drift from inside a running pipeline, the checkpoint protocol that lets workflows resume mid-flight, and the retry budgeting strategy that prevents cascading failures from overwhelming target sites.