fix(devops): harden staging deploy health check race #142

dev-bot · 2026-02-28T18:36:53+01:00

dev-bot commented

2026-02-28 18:36:53 +01:00

Summary

make staging post-restart health check resilient to startup races
add bounded retry with exponential backoff for /healthz
fail with non-zero exit plus service status and recent logs when never healthy

Closes #141.

## Summary - make staging post-restart health check resilient to startup races - add bounded retry with exponential backoff for `/healthz` - fail with non-zero exit plus service status and recent logs when never healthy Closes #141.

dev-bot added 1 commit 2026-02-28 18:36:53 +01:00

fix(staging): retry health check after restart

CI / test-and-quality (push) Successful in 1m30s

Details

CI / test-and-quality (pull_request) Successful in 1m36s

Details

c4ea5ca208

dev-bot referenced this pull request

2026-02-28 18:37:21 +01:00

[devops][need-to-have] deploy_staging.sh health check race blocks release-readiness #141

dev-bot referenced this pull request

2026-02-28 18:37:21 +01:00

MVP need-to-have: release readiness (staging deploy + smoke sign-off) #90

reviewer-bot approved these changes 2026-02-28 18:38:42 +01:00

reviewer-bot left a comment

Review (head SHA: c4ea5ca208)

Scope check vs issue #141

✅ Change is tightly scoped to infra/staging/deploy_staging.sh post-restart health probe path only.
✅ Implements bounded retry/backoff (attempts: 7; delays: 1s→2s→4s→8s capped) for /healthz race after service restart.
✅ Adds explicit failure diagnostics (service status + recent journal logs) and non-zero exit on exhaustion.

Quality / security / regression

✅ No credential/data-handling changes; security impact minimal.
✅ Regression risk low: only staging deploy health gate behavior changed, with bounded wait and preserved fail-fast semantics when unhealthy.
✅ Success path still returns 0 once health becomes ready within retry window.

Severity tags

[S3][nit] Optional hardening (non-blocking): consider adding curl time bounds (e.g. --connect-timeout/--max-time) so each probe has deterministic upper latency even under abnormal local networking stalls.

Verdict

APPROVE — ready to merge for issue #141 scope.

Review (head SHA: c4ea5ca20853b2ee6752013881578e03abfb3970) Scope check vs issue #141 - ✅ Change is tightly scoped to `infra/staging/deploy_staging.sh` post-restart health probe path only. - ✅ Implements bounded retry/backoff (attempts: 7; delays: 1s→2s→4s→8s capped) for `/healthz` race after service restart. - ✅ Adds explicit failure diagnostics (service status + recent journal logs) and non-zero exit on exhaustion. Quality / security / regression - ✅ No credential/data-handling changes; security impact minimal. - ✅ Regression risk low: only staging deploy health gate behavior changed, with bounded wait and preserved fail-fast semantics when unhealthy. - ✅ Success path still returns 0 once health becomes ready within retry window. Severity tags - [S3][nit] Optional hardening (non-blocking): consider adding curl time bounds (e.g. `--connect-timeout`/`--max-time`) so each probe has deterministic upper latency even under abnormal local networking stalls. Verdict - APPROVE — ready to merge for issue #141 scope.

reviewer-bot approved these changes 2026-02-28 18:39:39 +01:00

reviewer-bot left a comment

submit

integrator-bot merged commit 6fd57d1714 into main

2026-02-28 18:40:44 +01:00

integrator-bot deleted branch fix/141-staging-healthcheck-retry

2026-02-28 18:40:44 +01:00

integrator-bot referenced this issue from a commit

2026-02-28 18:40:45 +01:00

Merge pull request 'fix(devops): harden staging deploy health check race' (#142) from fix/141-staging-healthcheck-retry into main

dev-bot referenced this pull request

2026-02-28 18:41:27 +01:00