Incident Response Playbook
This playbook covers the top 5 critical incident scenarios for the Gatelithix Gateway. Each scenario includes symptoms, diagnosis steps, resolution procedures, and escalation paths.
General Incident Protocol
- Detect: Monitoring alerts fire (Cloud Monitoring, uptime checks, error rate spikes)
- Assess: Determine severity and affected services
- Contain: Isolate the issue (circuit breaker, traffic shift, scale down)
- Resolve: Apply the appropriate fix from the scenario below
- Post-mortem: Document root cause, timeline, and preventive measures
Scenario 1: Database Connection Failure
Symptoms
- Health probe (
/health/ready) returns 503 - Connection pool errors in Cloud Run logs:
failed to acquire connection,connection refused - Increased latency on all API endpoints
- Cloud Monitoring alert:
Cloud SQL connection errors > threshold
Diagnosis
# Check Cloud SQL instance status
gcloud sql instances describe gatelithix-core-db \
--project gatelithix-core --format="value(state)"
# Check Cloud SQL Auth Proxy logs (if applicable)
gcloud logging read 'resource.type="cloud_run_revision" AND textPayload:"cloudsqlconn"' \
--project gatelithix-core --limit 50
# Check active connections
gcloud sql instances describe gatelithix-core-db \
--project gatelithix-core --format="value(settings.databaseFlags)"
# Verify IAM authentication is working
gcloud sql instances describe gatelithix-core-db \
--project gatelithix-core --format="value(settings.databaseFlags[].value)"Resolution
- If Cloud SQL instance is stopped/suspended: Restart via console or
gcloud sql instances restart - If connection pool exhausted: Scale down Cloud Run instances to reduce connection count (MaxConns=5 per instance)
- If IAM auth failing: Verify service account has
roles/cloudsql.clientandroles/cloudsql.instanceUser - If network issue: Check VPC connector status, verify firewall rules have not changed
# Restart Cloud Run revision to reset connection pools
gcloud run services update api-gateway \
--region us-central1 --project gatelithix-core \
--update-env-vars="RESTART_TRIGGER=$(date +%s)"Vault PCI Database Connectivity
The vault service’s /health/ready probe checks PCI Cloud SQL connectivity. When the PCI database is unreachable:
- Vault
/health/readyreturns 503 with{"status":"unhealthy","error":"database unavailable"} - Cloud Run stops routing traffic to unhealthy vault instances
- All tokenization and PAN operations fail until connectivity is restored
Diagnosis checklist for vault DB failures:
- Cloud SQL Auth Proxy — Verify the PCI Cloud SQL instance is accepting connections via the Go connector (
cloudsqlconnlibrary) - IAM permissions — Vault service account needs
roles/cloudsql.clientandroles/cloudsql.instanceUseron the PCI project - Network peering — Confirm
core-to-pciVPC peering is inACTIVEstate and firewall rules allow TCP 5432 from the vault Cloud Run VPC connector subnet
Escalation
- Connection pool issues persisting after restart: Review
MaxConnssetting against Cloud SQLmax_connections - Cloud SQL instance unavailable: Open GCP Support ticket (Priority P1 for production)
- IAM auth failures across all services: Check organization IAM policy changes
Scenario 2: Pub/Sub Message Backlog
Symptoms
- Webhook delivery delays reported by merchants
- Subscription backlog growing in Cloud Monitoring
gatelithix pubsub-dlq peekshows increasing dead-letter messages- Payment event processing latency increasing
Diagnosis
# Check subscription backlog
gcloud pubsub subscriptions describe webhook-outbound-sub \
--project gatelithix-core \
--format="value(messageRetentionDuration)"
# Check DLQ message count
gcloud pubsub subscriptions describe webhook-outbound-dlq-sub \
--project gatelithix-core
# View recent DLQ messages (non-destructive peek)
gatelithix pubsub-dlq peek --subscription webhook-outbound-dlq-sub --limit 10
# Check subscriber Cloud Run logs for processing errors
gcloud logging read 'resource.type="cloud_run_revision" AND severity>=ERROR AND textPayload:"webhook"' \
--project gatelithix-core --limit 50Resolution
- If subscriber is crashing: Check Cloud Run logs for panic/OOM, increase memory if needed
- If processing errors: Fix the processing bug, redeploy, then replay DLQ messages
- If backlog due to slow consumer: Increase Cloud Run max instances for the subscriber service
- If merchant endpoint is down: Messages will naturally retry (exponential backoff 10s-600s, max 5 attempts before DLQ)
# Replay DLQ messages after fixing the issue
gatelithix pubsub-dlq replay --subscription webhook-outbound-dlq-sub
# Scale up subscriber instances temporarily
gcloud run services update api-gateway \
--max-instances 20 \
--region us-central1 --project gatelithix-coreEscalation
- DLQ growing despite healthy subscribers: Investigate message format changes or schema drift
- Pub/Sub service degradation: Check GCP Status Dashboard
- Message replay failures: Manual investigation of individual DLQ messages required
Scenario 3: Circuit Breaker Trip
Symptoms
- Connector returning consistent errors (5xx, timeouts)
- Circuit breaker enters OPEN state, all requests to that connector fail immediately
- Error logs:
circuit breaker OPEN for connector [name] - Payment decline rate spikes for merchants routed to the affected connector
Diagnosis
# Check connector health status
gatelithix connector-health
# View connector Cloud Run logs
gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="stripe-connector"' \
--project gatelithix-core --limit 50
# Check PSP status pages
# Stripe: https://status.stripe.com/
# NMI: https://status.nmi.com/
# FluidPay: https://status.fluidpay.com/
# Verify connector Cloud Run instance is healthy
gcloud run services describe stripe-connector \
--region us-central1 --project gatelithix-core \
--format="value(status.conditions)"Resolution
- If PSP is down: Wait for PSP recovery. Circuit breaker will automatically transition to HALF-OPEN and test recovery
- If connector service is unhealthy: Check deployment, redeploy if needed
- If network issue: Verify egress firewall rules allow TCP 443 to PSP IPs
- If API key expired/revoked: Rotate the connector API key using the appropriate rotation script
# Force circuit breaker reset (after confirming PSP is healthy)
# Redeploy the connector service to reset in-memory circuit breaker state
gcloud run services update stripe-connector \
--region us-central1 --project gatelithix-core \
--update-env-vars="RESTART_TRIGGER=$(date +%s)"Escalation
- PSP outage lasting more than 30 minutes: Evaluate failover routing to alternate connector
- Circuit breaker tripping on all connectors simultaneously: Investigate gateway egress network issue
- Persistent intermittent failures: Contact PSP support with request IDs and timestamps
Scenario 4: KMS Unavailable
Symptoms
- Vault service returning 500 errors on tokenization requests
- Error logs:
encrypt: rpc error,kms: permission denied,kms: key not found - All payment flows requiring tokenization fail
- Gateway returns
error_code: vault_unavailableto merchants
Diagnosis
# Check KMS key status
gcloud kms keys describe pan-encryption-key \
--keyring gatelithix-vault \
--location us-central1 \
--project gatelithix-pci
# Check KMS key version status
gcloud kms keys versions list \
--key pan-encryption-key \
--keyring gatelithix-vault \
--location us-central1 \
--project gatelithix-pci
# Verify vault SA has KMS permissions
gcloud kms keys get-iam-policy pan-encryption-key \
--keyring gatelithix-vault \
--location us-central1 \
--project gatelithix-pci
# Check Cloud KMS API is enabled
gcloud services list --project gatelithix-pci | grep cloudkmsResolution
- If KMS key version disabled: Enable the latest key version
- If IAM permissions removed: Re-apply
roles/cloudkms.cryptoKeyEncrypterDecrypterto vault SA - If KMS API disabled: Re-enable
cloudkms.googleapis.com - If key rotation created a new primary version: Ensure application handles decryption with previous key versions
# Re-enable a disabled key version
gcloud kms keys versions enable VERSION_NUMBER \
--key pan-encryption-key \
--keyring gatelithix-vault \
--location us-central1 \
--project gatelithix-pci
# Grant KMS permissions (if removed)
gcloud kms keys add-iam-policy-binding pan-encryption-key \
--keyring gatelithix-vault \
--location us-central1 \
--project gatelithix-pci \
--member="serviceAccount:vault-sa@gatelithix-pci.iam.gserviceaccount.com" \
--role="roles/cloudkms.cryptoKeyEncrypterDecrypter"Escalation
- KMS key destroyed (not recoverable): This is a critical data loss event. Engage GCP Support immediately
- KMS service-wide outage: Check GCP Status Dashboard , open P1 support ticket
- HSM hardware failure: GCP handles HSM redundancy automatically; if persisting, escalate to GCP
Scenario 5: Vault Unreachable
Symptoms
- Gateway returns proxy errors when calling vault service
- Error logs:
vault proxy: connection refused,vault: deadline exceeded - Tokenization requests fail with 502/504
- Health probe shows vault as unhealthy in gateway readiness check
Diagnosis
# Check vault Cloud Run health
gcloud run services describe vault \
--region us-central1 --project gatelithix-pci \
--format="value(status.conditions)"
# Check vault Cloud Run logs
gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="vault"' \
--project gatelithix-pci --limit 50
# Verify VPC peering is active
gcloud compute networks peerings list \
--network core-vpc --project gatelithix-core
# Check firewall rules
gcloud compute firewall-rules describe pci-vpc-allow-core-ingress \
--project gatelithix-pci
# Verify vault service URL in gateway config
gcloud run services describe api-gateway \
--region us-central1 --project gatelithix-core \
--format="value(spec.template.spec.containers[0].env)"Resolution
- If vault Cloud Run instance crashed: Check for OOM or panic in logs, increase memory if needed
- If VPC peering is broken: Re-establish peering (
terraform applyon core and PCI network modules) - If firewall rule deleted: Re-apply via
terraform applyon PCI network module - If vault service URL changed: Update
VAULT_URLenvironment variable in gateway Cloud Run config - If PCI database is down: Follow Scenario 1 diagnosis for PCI Cloud SQL instance
# Restart vault service
gcloud run services update vault \
--region us-central1 --project gatelithix-pci \
--update-env-vars="RESTART_TRIGGER=$(date +%s)"
# Verify connectivity from gateway to vault
gcloud run services describe vault \
--region us-central1 --project gatelithix-pci \
--format="value(status.url)"Escalation
- VPC peering failure persisting after
terraform apply: GCP networking team investigation - Vault service repeatedly crashing: Check for database corruption, KMS issues (see Scenarios 1, 4)
- Cross-project IAM binding failure: Verify
roles/run.invokerbinding for gateway SA on vault service