- Change the Dates and Developer name etc.
- A Well Documented documentation is enough
- This has 2 documentations ( New + Old )
- Change Names, dates accordingly
- – Fivetran sync alert: connection_refused
- – Cloud SQL instance production becomes
- – Engineering declares Disaster Mode (DR‑001)
- – Emergency database granted via break‑glass IAM role (cloudsql.emergencyReader)
- – Attempted automated fail‑over — (HA standby unhealthy)
- – Point‑in‑time recovery (PITR) clone created (restore point 03:10)
- – Service fully restored —
- 1 hour ➜ Actual: (✖ Miss)
- 15 minutes ➜ Actual: (✔ Pass)
- 19 minutes of transactional data between 02:51–03:10 (replayed from upstream logs after cut‑over).
- API error rate peaked at 100 % for 72 minutes; degraded performance until 05:29.
- Fivetran sync jobs queued > 2 hours; dbt models failed.
Detection via automated alerting (Fivetran & Stackdriver).
invoked: break‑glass IAM role allowed the on‑call engineer to export binary logs (.wal) for forensic replay.
Failed HA fail‑over highlighted standby mis‑configuration.
PITR clone recovery per GCP run‑book, validated checksum integrity.
Customer communications posted every 30 minutes via status page.
- — Owner: Data Eng (target 15 Jul 2025; In Progress)
- — Owner: DevOps (22 Jul 2025; Scheduled)
- — Owner: Compliance (31 Jul 2025; Not Started)
- — Owner: Security (31 Aug 2025; Not Started)
Optimsync Unplanned Disaster Event (Live Incident) Google Cloud SQL (Production Database) [Insert actual date, e.g., May 10, 2025] Developer 2
- Engineering Team
- Data Team (ETL / Fivetran / dbt)
During a scheduled ETL job using Fivetran and dbt, our production Google Cloud SQL instance unexpectedly went down. This caused a complete failure in our data ingestion pipeline, leading to the loss of processed data.
- Mid-process failure interrupted data extraction and transformation.
- All unbacked-up data was lost.
- Approximately [insert time, e.g., 2 hours] of unavailability.
- or backup restoration was triggered at the time.
- The incident was detected via failed Fivetran sync alerts.
- Cloud SQL logs were reviewed to identify the root cause.
- Manual restoration attempts were made but failed due to lack of recent backups.
- The incident was escalated to GCP support.
- Postmortem was conducted to evaluate weaknesses in our DR strategy.
- Regular automated backups were not enforced or verified.
- Disaster recovery (DR) plan was not up to date.
- No clear RTO/RPO objectives were previously defined.
- in Google Cloud SQL with 7-day retention.
- 1 hour
- 15 minutes
- Implemented to test integrity.
- Integrated for ETL failures.
- Scheduled moving forward.
- Simulated database failure and recovery
- [e.g., August 1, 2025]
- Engineering / DevOps