The problem
Upheal's AI mental health platform was scaling fast on a product engineering team with no spare DevOps capacity. Off-hours incidents waited until morning — a real risk for clinicians using the platform during patient sessions — disaster recovery had never been tested, and SOC2 attestation was becoming a hard prerequisite for enterprise deals on PHI-bearing data.
What we shipped
Layered CloudWatch monitoring with application-level signals (clinical-note API latency, AI queue depth, DB pool utilisation) into Slack-native alerting backed by a 15-minute on-call SLA. SOC2 readiness work hardened IAM with enforced MFA and quarterly reviews, centralised CloudTrail into object-locked S3, and forced all changes through version-controlled pipelines. Cross-region RDS replication, S3 versioning and quarterly DR exercises gave Upheal verifiable recovery procedures.
The outcome
Mean time to acknowledge dropped from hours to under 10 minutes and MTTR for common incidents fell 70% through runbook automation. Two real incidents during the engagement were resolved cleanly using the tested DR procedures with no data loss. SOC2 compliance readiness landed in four months, and the engineering team came off infrastructure on-call entirely.