Incident Response Playbook

This playbook covers the top 5 critical incident scenarios for the Gatelithix Gateway. Each scenario includes symptoms, diagnosis steps, resolution procedures, and escalation paths.

General Incident Protocol

Detect: Monitoring alerts fire (Cloud Monitoring, uptime checks, error rate spikes)
Assess: Determine severity and affected services
Contain: Isolate the issue (circuit breaker, traffic shift, scale down)
Resolve: Apply the appropriate fix from the scenario below
Post-mortem: Document root cause, timeline, and preventive measures

Scenario 1: Database Connection Failure

Symptoms

Health probe (/health/ready) returns 503
Connection pool errors in Cloud Run logs: failed to acquire connection, connection refused
Increased latency on all API endpoints
Cloud Monitoring alert: Cloud SQL connection errors > threshold

Diagnosis


# Check Cloud SQL instance status
gcloud sql instances describe gatelithix-core-db \
  --project gatelithix-core --format="value(state)"
 
# Check Cloud SQL Auth Proxy sidecar logs
gcloud logging read 'resource.type="cloud_run_revision" AND labels."run.googleapis.com/container_name"="cloud-sql-proxy"' \
  --project gatelithix-core --limit 50
 
# Check active connections
gcloud sql instances describe gatelithix-core-db \
  --project gatelithix-core --format="value(settings.databaseFlags)"
 
# Verify IAM authentication is working
gcloud sql instances describe gatelithix-core-db \
  --project gatelithix-core --format="value(settings.databaseFlags[].value)"

Resolution

If Cloud SQL instance is stopped/suspended: Restart via console or gcloud sql instances restart
If connection pool exhausted: Scale down Cloud Run instances to reduce connection count (MaxConns=5 per instance)
If IAM auth failing: Verify service account has roles/cloudsql.client and roles/cloudsql.instanceUser
If network issue: Check VPC connector status, verify firewall rules have not changed


# Restart Cloud Run revision to reset connection pools
gcloud run services update api-gateway \
  --region us-central1 --project gatelithix-core \
  --update-env-vars="RESTART_TRIGGER=$(date +%s)"

Vault PCI Database Connectivity

The vault service’s /health/ready probe checks PCI Cloud SQL connectivity. When the PCI database is unreachable:

Vault /health/ready returns 503 with {"status":"unhealthy","error":"database unavailable"}
Cloud Run stops routing traffic to unhealthy vault instances
All tokenization and PAN operations fail until connectivity is restored

Diagnosis checklist for vault DB failures:

Cloud SQL Auth Proxy sidecar — Verify the Auth Proxy sidecar container is healthy (/startup endpoint on port 9090) and connecting to the PCI Cloud SQL instance
IAM permissions — Vault service account needs roles/cloudsql.client and roles/cloudsql.instanceUser on the PCI project
Network peering — Confirm core-to-pci VPC peering is in ACTIVE state and firewall rules allow TCP 5432 from the vault Cloud Run VPC connector subnet

Escalation

Connection pool issues persisting after restart: Review MaxConns setting against Cloud SQL max_connections
Cloud SQL instance unavailable: Open GCP Support ticket (Priority P1 for production)
IAM auth failures across all services: Check organization IAM policy changes

Scenario 2: Pub/Sub Message Backlog

Symptoms

Webhook delivery delays reported by merchants
Subscription backlog growing in Cloud Monitoring
gatelithix pubsub-dlq peek shows increasing dead-letter messages
Payment event processing latency increasing

Diagnosis


# Check subscription backlog
gcloud pubsub subscriptions describe webhook-outbound-sub \
  --project gatelithix-core \
  --format="value(messageRetentionDuration)"
 
# Check DLQ message count
gcloud pubsub subscriptions describe webhook-outbound-dlq-sub \
  --project gatelithix-core
 
# View recent DLQ messages (non-destructive peek)
gatelithix pubsub-dlq peek --subscription webhook-outbound-dlq-sub --limit 10
 
# Check subscriber Cloud Run logs for processing errors
gcloud logging read 'resource.type="cloud_run_revision" AND severity>=ERROR AND textPayload:"webhook"' \
  --project gatelithix-core --limit 50

Resolution

If subscriber is crashing: Check Cloud Run logs for panic/OOM, increase memory if needed
If processing errors: Fix the processing bug, redeploy, then replay DLQ messages
If backlog due to slow consumer: Increase Cloud Run max instances for the subscriber service
If merchant endpoint is down: Messages will naturally retry (exponential backoff 10s-600s, max 5 attempts before DLQ)


# Replay DLQ messages after fixing the issue
gatelithix pubsub-dlq replay --subscription webhook-outbound-dlq-sub
 
# Scale up subscriber instances temporarily
gcloud run services update api-gateway \
  --max-instances 20 \
  --region us-central1 --project gatelithix-core

Escalation

DLQ growing despite healthy subscribers: Investigate message format changes or schema drift
Pub/Sub service degradation: Check GCP Status Dashboard
Message replay failures: Manual investigation of individual DLQ messages required

Scenario 3: Circuit Breaker Trip

Symptoms

Connector returning consistent errors (5xx, timeouts)
Circuit breaker enters OPEN state, all requests to that connector fail immediately
Error logs: circuit breaker OPEN for connector [name]
Payment decline rate spikes for merchants routed to the affected connector

Diagnosis


# Check connector health status
gatelithix connector-health
 
# View connector Cloud Run logs
gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="stripe-connector"' \
  --project gatelithix-core --limit 50
 
# Check PSP status pages
# Stripe: https://status.stripe.com/
# NMI: https://status.nmi.com/
# FluidPay: https://status.fluidpay.com/
 
# Verify connector Cloud Run instance is healthy
gcloud run services describe stripe-connector \
  --region us-central1 --project gatelithix-core \
  --format="value(status.conditions)"

Resolution

If PSP is down: Wait for PSP recovery. Circuit breaker will automatically transition to HALF-OPEN and test recovery
If connector service is unhealthy: Check deployment, redeploy if needed
If network issue: Verify egress firewall rules allow TCP 443 to PSP IPs
If API key expired/revoked: Rotate the connector API key using the appropriate rotation script


# Force circuit breaker reset (after confirming PSP is healthy)
# Redeploy the connector service to reset in-memory circuit breaker state
gcloud run services update stripe-connector \
  --region us-central1 --project gatelithix-core \
  --update-env-vars="RESTART_TRIGGER=$(date +%s)"

Escalation

PSP outage lasting more than 30 minutes: Evaluate failover routing to alternate connector
Circuit breaker tripping on all connectors simultaneously: Investigate gateway egress network issue
Persistent intermittent failures: Contact PSP support with request IDs and timestamps

Scenario 4: KMS Unavailable

Symptoms

Vault service returning 500 errors on tokenization requests
Error logs: encrypt: rpc error, kms: permission denied, kms: key not found
All payment flows requiring tokenization fail
Gateway returns error_code: vault_unavailable to merchants

Diagnosis


# Check KMS key status
gcloud kms keys describe pan-encryption-key \
  --keyring gatelithix-vault \
  --location us-central1 \
  --project gatelithix-pci
 
# Check KMS key version status
gcloud kms keys versions list \
  --key pan-encryption-key \
  --keyring gatelithix-vault \
  --location us-central1 \
  --project gatelithix-pci
 
# Verify vault SA has KMS permissions
gcloud kms keys get-iam-policy pan-encryption-key \
  --keyring gatelithix-vault \
  --location us-central1 \
  --project gatelithix-pci
 
# Check Cloud KMS API is enabled
gcloud services list --project gatelithix-pci | grep cloudkms

Resolution

If KMS key version disabled: Enable the latest key version
If IAM permissions removed: Re-apply roles/cloudkms.cryptoKeyEncrypterDecrypter to vault SA
If KMS API disabled: Re-enable cloudkms.googleapis.com
If key rotation created a new primary version: Ensure application handles decryption with previous key versions


# Re-enable a disabled key version
gcloud kms keys versions enable VERSION_NUMBER \
  --key pan-encryption-key \
  --keyring gatelithix-vault \
  --location us-central1 \
  --project gatelithix-pci
 
# Grant KMS permissions (if removed)
gcloud kms keys add-iam-policy-binding pan-encryption-key \
  --keyring gatelithix-vault \
  --location us-central1 \
  --project gatelithix-pci \
  --member="serviceAccount:vault-sa@gatelithix-pci.iam.gserviceaccount.com" \
  --role="roles/cloudkms.cryptoKeyEncrypterDecrypter"

Escalation

KMS key destroyed (not recoverable): This is a critical data loss event. Engage GCP Support immediately
KMS service-wide outage: Check GCP Status Dashboard , open P1 support ticket
HSM hardware failure: GCP handles HSM redundancy automatically; if persisting, escalate to GCP

Scenario 5: Vault Unreachable

Symptoms

Gateway returns proxy errors when calling vault service
Error logs: vault proxy: connection refused, vault: deadline exceeded
Tokenization requests fail with 502/504
Health probe shows vault as unhealthy in gateway readiness check

Diagnosis


# Check vault Cloud Run health
gcloud run services describe vault \
  --region us-central1 --project gatelithix-pci \
  --format="value(status.conditions)"
 
# Check vault Cloud Run logs
gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="vault"' \
  --project gatelithix-pci --limit 50
 
# Verify VPC peering is active
gcloud compute networks peerings list \
  --network core-vpc --project gatelithix-core
 
# Check firewall rules
gcloud compute firewall-rules describe pci-vpc-allow-core-ingress \
  --project gatelithix-pci
 
# Verify vault service URL in gateway config
gcloud run services describe api-gateway \
  --region us-central1 --project gatelithix-core \
  --format="value(spec.template.spec.containers[0].env)"

Resolution

If vault Cloud Run instance crashed: Check for OOM or panic in logs, increase memory if needed
If VPC peering is broken: Re-establish peering (terraform apply on core and PCI network modules)
If firewall rule deleted: Re-apply via terraform apply on PCI network module
If vault service URL changed: Update VAULT_URL environment variable in gateway Cloud Run config
If PCI database is down: Follow Scenario 1 diagnosis for PCI Cloud SQL instance


# Restart vault service
gcloud run services update vault \
  --region us-central1 --project gatelithix-pci \
  --update-env-vars="RESTART_TRIGGER=$(date +%s)"
 
# Verify connectivity from gateway to vault
gcloud run services describe vault \
  --region us-central1 --project gatelithix-pci \
  --format="value(status.url)"

Escalation

VPC peering failure persisting after terraform apply: GCP networking team investigation
Vault service repeatedly crashing: Check for database corruption, KMS issues (see Scenarios 1, 4)
Cross-project IAM binding failure: Verify roles/run.invoker binding for gateway SA on vault service