Skip to Content
OperationsIncident Response

Incident Response Playbook

This playbook covers the top 5 critical incident scenarios for the Gatelithix Gateway. Each scenario includes symptoms, diagnosis steps, resolution procedures, and escalation paths.

General Incident Protocol

  1. Detect: Monitoring alerts fire (Cloud Monitoring, uptime checks, error rate spikes)
  2. Assess: Determine severity and affected services
  3. Contain: Isolate the issue (circuit breaker, traffic shift, scale down)
  4. Resolve: Apply the appropriate fix from the scenario below
  5. Post-mortem: Document root cause, timeline, and preventive measures

Scenario 1: Database Connection Failure

Symptoms

  • Health probe (/health/ready) returns 503
  • Connection pool errors in Cloud Run logs: failed to acquire connection, connection refused
  • Increased latency on all API endpoints
  • Cloud Monitoring alert: Cloud SQL connection errors > threshold

Diagnosis

# Check Cloud SQL instance status gcloud sql instances describe gatelithix-core-db \ --project gatelithix-core --format="value(state)" # Check Cloud SQL Auth Proxy logs (if applicable) gcloud logging read 'resource.type="cloud_run_revision" AND textPayload:"cloudsqlconn"' \ --project gatelithix-core --limit 50 # Check active connections gcloud sql instances describe gatelithix-core-db \ --project gatelithix-core --format="value(settings.databaseFlags)" # Verify IAM authentication is working gcloud sql instances describe gatelithix-core-db \ --project gatelithix-core --format="value(settings.databaseFlags[].value)"

Resolution

  1. If Cloud SQL instance is stopped/suspended: Restart via console or gcloud sql instances restart
  2. If connection pool exhausted: Scale down Cloud Run instances to reduce connection count (MaxConns=5 per instance)
  3. If IAM auth failing: Verify service account has roles/cloudsql.client and roles/cloudsql.instanceUser
  4. If network issue: Check VPC connector status, verify firewall rules have not changed
# Restart Cloud Run revision to reset connection pools gcloud run services update api-gateway \ --region us-central1 --project gatelithix-core \ --update-env-vars="RESTART_TRIGGER=$(date +%s)"

Vault PCI Database Connectivity

The vault service’s /health/ready probe checks PCI Cloud SQL connectivity. When the PCI database is unreachable:

  • Vault /health/ready returns 503 with {"status":"unhealthy","error":"database unavailable"}
  • Cloud Run stops routing traffic to unhealthy vault instances
  • All tokenization and PAN operations fail until connectivity is restored

Diagnosis checklist for vault DB failures:

  1. Cloud SQL Auth Proxy — Verify the PCI Cloud SQL instance is accepting connections via the Go connector (cloudsqlconn library)
  2. IAM permissions — Vault service account needs roles/cloudsql.client and roles/cloudsql.instanceUser on the PCI project
  3. Network peering — Confirm core-to-pci VPC peering is in ACTIVE state and firewall rules allow TCP 5432 from the vault Cloud Run VPC connector subnet

Escalation

  • Connection pool issues persisting after restart: Review MaxConns setting against Cloud SQL max_connections
  • Cloud SQL instance unavailable: Open GCP Support ticket (Priority P1 for production)
  • IAM auth failures across all services: Check organization IAM policy changes

Scenario 2: Pub/Sub Message Backlog

Symptoms

  • Webhook delivery delays reported by merchants
  • Subscription backlog growing in Cloud Monitoring
  • gatelithix pubsub-dlq peek shows increasing dead-letter messages
  • Payment event processing latency increasing

Diagnosis

# Check subscription backlog gcloud pubsub subscriptions describe webhook-outbound-sub \ --project gatelithix-core \ --format="value(messageRetentionDuration)" # Check DLQ message count gcloud pubsub subscriptions describe webhook-outbound-dlq-sub \ --project gatelithix-core # View recent DLQ messages (non-destructive peek) gatelithix pubsub-dlq peek --subscription webhook-outbound-dlq-sub --limit 10 # Check subscriber Cloud Run logs for processing errors gcloud logging read 'resource.type="cloud_run_revision" AND severity>=ERROR AND textPayload:"webhook"' \ --project gatelithix-core --limit 50

Resolution

  1. If subscriber is crashing: Check Cloud Run logs for panic/OOM, increase memory if needed
  2. If processing errors: Fix the processing bug, redeploy, then replay DLQ messages
  3. If backlog due to slow consumer: Increase Cloud Run max instances for the subscriber service
  4. If merchant endpoint is down: Messages will naturally retry (exponential backoff 10s-600s, max 5 attempts before DLQ)
# Replay DLQ messages after fixing the issue gatelithix pubsub-dlq replay --subscription webhook-outbound-dlq-sub # Scale up subscriber instances temporarily gcloud run services update api-gateway \ --max-instances 20 \ --region us-central1 --project gatelithix-core

Escalation

  • DLQ growing despite healthy subscribers: Investigate message format changes or schema drift
  • Pub/Sub service degradation: Check GCP Status Dashboard 
  • Message replay failures: Manual investigation of individual DLQ messages required

Scenario 3: Circuit Breaker Trip

Symptoms

  • Connector returning consistent errors (5xx, timeouts)
  • Circuit breaker enters OPEN state, all requests to that connector fail immediately
  • Error logs: circuit breaker OPEN for connector [name]
  • Payment decline rate spikes for merchants routed to the affected connector

Diagnosis

# Check connector health status gatelithix connector-health # View connector Cloud Run logs gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="stripe-connector"' \ --project gatelithix-core --limit 50 # Check PSP status pages # Stripe: https://status.stripe.com/ # NMI: https://status.nmi.com/ # FluidPay: https://status.fluidpay.com/ # Verify connector Cloud Run instance is healthy gcloud run services describe stripe-connector \ --region us-central1 --project gatelithix-core \ --format="value(status.conditions)"

Resolution

  1. If PSP is down: Wait for PSP recovery. Circuit breaker will automatically transition to HALF-OPEN and test recovery
  2. If connector service is unhealthy: Check deployment, redeploy if needed
  3. If network issue: Verify egress firewall rules allow TCP 443 to PSP IPs
  4. If API key expired/revoked: Rotate the connector API key using the appropriate rotation script
# Force circuit breaker reset (after confirming PSP is healthy) # Redeploy the connector service to reset in-memory circuit breaker state gcloud run services update stripe-connector \ --region us-central1 --project gatelithix-core \ --update-env-vars="RESTART_TRIGGER=$(date +%s)"

Escalation

  • PSP outage lasting more than 30 minutes: Evaluate failover routing to alternate connector
  • Circuit breaker tripping on all connectors simultaneously: Investigate gateway egress network issue
  • Persistent intermittent failures: Contact PSP support with request IDs and timestamps

Scenario 4: KMS Unavailable

Symptoms

  • Vault service returning 500 errors on tokenization requests
  • Error logs: encrypt: rpc error, kms: permission denied, kms: key not found
  • All payment flows requiring tokenization fail
  • Gateway returns error_code: vault_unavailable to merchants

Diagnosis

# Check KMS key status gcloud kms keys describe pan-encryption-key \ --keyring gatelithix-vault \ --location us-central1 \ --project gatelithix-pci # Check KMS key version status gcloud kms keys versions list \ --key pan-encryption-key \ --keyring gatelithix-vault \ --location us-central1 \ --project gatelithix-pci # Verify vault SA has KMS permissions gcloud kms keys get-iam-policy pan-encryption-key \ --keyring gatelithix-vault \ --location us-central1 \ --project gatelithix-pci # Check Cloud KMS API is enabled gcloud services list --project gatelithix-pci | grep cloudkms

Resolution

  1. If KMS key version disabled: Enable the latest key version
  2. If IAM permissions removed: Re-apply roles/cloudkms.cryptoKeyEncrypterDecrypter to vault SA
  3. If KMS API disabled: Re-enable cloudkms.googleapis.com
  4. If key rotation created a new primary version: Ensure application handles decryption with previous key versions
# Re-enable a disabled key version gcloud kms keys versions enable VERSION_NUMBER \ --key pan-encryption-key \ --keyring gatelithix-vault \ --location us-central1 \ --project gatelithix-pci # Grant KMS permissions (if removed) gcloud kms keys add-iam-policy-binding pan-encryption-key \ --keyring gatelithix-vault \ --location us-central1 \ --project gatelithix-pci \ --member="serviceAccount:vault-sa@gatelithix-pci.iam.gserviceaccount.com" \ --role="roles/cloudkms.cryptoKeyEncrypterDecrypter"

Escalation

  • KMS key destroyed (not recoverable): This is a critical data loss event. Engage GCP Support immediately
  • KMS service-wide outage: Check GCP Status Dashboard , open P1 support ticket
  • HSM hardware failure: GCP handles HSM redundancy automatically; if persisting, escalate to GCP

Scenario 5: Vault Unreachable

Symptoms

  • Gateway returns proxy errors when calling vault service
  • Error logs: vault proxy: connection refused, vault: deadline exceeded
  • Tokenization requests fail with 502/504
  • Health probe shows vault as unhealthy in gateway readiness check

Diagnosis

# Check vault Cloud Run health gcloud run services describe vault \ --region us-central1 --project gatelithix-pci \ --format="value(status.conditions)" # Check vault Cloud Run logs gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="vault"' \ --project gatelithix-pci --limit 50 # Verify VPC peering is active gcloud compute networks peerings list \ --network core-vpc --project gatelithix-core # Check firewall rules gcloud compute firewall-rules describe pci-vpc-allow-core-ingress \ --project gatelithix-pci # Verify vault service URL in gateway config gcloud run services describe api-gateway \ --region us-central1 --project gatelithix-core \ --format="value(spec.template.spec.containers[0].env)"

Resolution

  1. If vault Cloud Run instance crashed: Check for OOM or panic in logs, increase memory if needed
  2. If VPC peering is broken: Re-establish peering (terraform apply on core and PCI network modules)
  3. If firewall rule deleted: Re-apply via terraform apply on PCI network module
  4. If vault service URL changed: Update VAULT_URL environment variable in gateway Cloud Run config
  5. If PCI database is down: Follow Scenario 1 diagnosis for PCI Cloud SQL instance
# Restart vault service gcloud run services update vault \ --region us-central1 --project gatelithix-pci \ --update-env-vars="RESTART_TRIGGER=$(date +%s)" # Verify connectivity from gateway to vault gcloud run services describe vault \ --region us-central1 --project gatelithix-pci \ --format="value(status.url)"

Escalation

  • VPC peering failure persisting after terraform apply: GCP networking team investigation
  • Vault service repeatedly crashing: Check for database corruption, KMS issues (see Scenarios 1, 4)
  • Cross-project IAM binding failure: Verify roles/run.invoker binding for gateway SA on vault service