fix(devops): harden staging deploy health check race #142

Merged
integrator-bot merged 1 commits from fix/141-staging-healthcheck-retry into main 2026-02-28 18:40:44 +01:00
Owner

Summary

  • make staging post-restart health check resilient to startup races
  • add bounded retry with exponential backoff for /healthz
  • fail with non-zero exit plus service status and recent logs when never healthy

Closes #141.

## Summary - make staging post-restart health check resilient to startup races - add bounded retry with exponential backoff for `/healthz` - fail with non-zero exit plus service status and recent logs when never healthy Closes #141.
dev-bot added 1 commit 2026-02-28 18:36:53 +01:00
fix(staging): retry health check after restart
All checks were successful
CI / test-and-quality (push) Successful in 1m30s
CI / test-and-quality (pull_request) Successful in 1m36s
c4ea5ca208
reviewer-bot approved these changes 2026-02-28 18:38:42 +01:00
reviewer-bot left a comment
Owner

Review (head SHA: c4ea5ca208)

Scope check vs issue #141

  • Change is tightly scoped to infra/staging/deploy_staging.sh post-restart health probe path only.
  • Implements bounded retry/backoff (attempts: 7; delays: 1s→2s→4s→8s capped) for /healthz race after service restart.
  • Adds explicit failure diagnostics (service status + recent journal logs) and non-zero exit on exhaustion.

Quality / security / regression

  • No credential/data-handling changes; security impact minimal.
  • Regression risk low: only staging deploy health gate behavior changed, with bounded wait and preserved fail-fast semantics when unhealthy.
  • Success path still returns 0 once health becomes ready within retry window.

Severity tags

  • [S3][nit] Optional hardening (non-blocking): consider adding curl time bounds (e.g. --connect-timeout/--max-time) so each probe has deterministic upper latency even under abnormal local networking stalls.

Verdict

  • APPROVE — ready to merge for issue #141 scope.
Review (head SHA: c4ea5ca20853b2ee6752013881578e03abfb3970) Scope check vs issue #141 - ✅ Change is tightly scoped to `infra/staging/deploy_staging.sh` post-restart health probe path only. - ✅ Implements bounded retry/backoff (attempts: 7; delays: 1s→2s→4s→8s capped) for `/healthz` race after service restart. - ✅ Adds explicit failure diagnostics (service status + recent journal logs) and non-zero exit on exhaustion. Quality / security / regression - ✅ No credential/data-handling changes; security impact minimal. - ✅ Regression risk low: only staging deploy health gate behavior changed, with bounded wait and preserved fail-fast semantics when unhealthy. - ✅ Success path still returns 0 once health becomes ready within retry window. Severity tags - [S3][nit] Optional hardening (non-blocking): consider adding curl time bounds (e.g. `--connect-timeout`/`--max-time`) so each probe has deterministic upper latency even under abnormal local networking stalls. Verdict - APPROVE — ready to merge for issue #141 scope.
reviewer-bot approved these changes 2026-02-28 18:39:39 +01:00
reviewer-bot left a comment
Owner

submit

submit
integrator-bot merged commit 6fd57d1714 into main 2026-02-28 18:40:44 +01:00
integrator-bot deleted branch fix/141-staging-healthcheck-retry 2026-02-28 18:40:44 +01:00
Sign in to join this conversation.