カスケード障害のメカニズム

1つの遅延が全体を死滅させるメカニズムと、その対策を詳しく解説します。

カスケード障害の発生フロー

典型的なシナリオ

1. 外部APIが遅延（通常100ms → 5s）
   ↓
2. 接続待ちスレッドが解放されずプールが飽和
   ↓
3. 全エンドポイントが応答不能に陥る
   ↓
4. 不完全なDBトランザクションを残して全プロセスダウン

→ 1つの遅延が全体を死滅させる

実際のタイムライン

時刻: 2024-01-01 10:00:00
状況: 外部決済APIが遅延

10:00:00.000 - 外部決済APIが通常の100msから5秒に遅延開始
10:00:00.100 - リクエスト1受信（スレッド1を取得、外部API呼び出し開始）
10:00:00.200 - リクエスト2受信（スレッド2を取得、外部API呼び出し開始）
10:00:00.300 - リクエスト3受信（スレッド3を取得、外部API呼び出し開始）
...
10:00:01.000 - スレッドプールが飽和（5/5スレッドが外部API待ち）
10:00:01.100 - リクエスト6受信（スレッドプールが満杯、待機）
10:00:01.200 - リクエスト7受信（スレッドプールが満杯、待機）
...
10:00:05.000 - 外部APIが応答（5秒経過）
10:00:05.100 - スレッド1が解放されるが、すぐに次のリクエストで使用される
10:00:05.200 - スレッド2が解放されるが、すぐに次のリクエストで使用される
10:00:05.300 - 外部APIが再び遅延（負荷が高い）
10:00:10.000 - すべてのスレッドが再び外部API待ちで飽和
10:00:10.100 - 新しいリクエストが処理できず、タイムアウトエラーが発生
10:00:10.200 - エラーログが大量に出力され、ログシステムも負荷が高い
10:00:15.000 - システム全体が応答不能に陥る

問題のあるコード

# ❌ 問題のあるコード: タイムアウトなし、サーキットブレーカーなし
class PaymentService
  def charge_payment(order_id, amount)
    # 問題: タイムアウトが設定されていない
    # 問題: サーキットブレーカーがない
    response = HTTParty.post(
      'https://payment-api.example.com/charge',
      body: {
        order_id: order_id,
        amount: amount
      }.to_json,
      headers: { 'Content-Type' => 'application/json' }
    )

    JSON.parse(response.body)
  end
end

なぜ事故るか:

タイムアウトなし: 外部APIが遅延すると、スレッドが長時間占有される
サーキットブレーカーなし: 外部APIが障害状態でも呼び出し続ける
スレッドプールの飽和: すべてのスレッドが外部API待ちで、他のリクエストが処理できない

解決策

1. タイムアウトの設定

# ✅ 良い例: タイムアウトを設定
class PaymentService
  def charge_payment(order_id, amount)
    response = HTTParty.post(
      'https://payment-api.example.com/charge',
      body: {
        order_id: order_id,
        amount: amount
      }.to_json,
      headers: { 'Content-Type' => 'application/json' },
      timeout: 3 # 3秒でタイムアウト
    )

    JSON.parse(response.body)
  rescue Net::ReadTimeout, Net::OpenTimeout => e
    raise PaymentTimeoutError, "Payment API timeout: #{e.message}"
  end
end

2. サーキットブレーカーの実装

# ✅ 良い例: Circuitboxを使用したサーキットブレーカー
require 'circuitbox'

class PaymentService
  def initialize
    @circuit = Circuitbox.circuit(
      :payment_api,
      exceptions: [Net::ReadTimeout, Net::OpenTimeout],
      time_window: 60,
      volume_threshold: 10,
      error_threshold: 50,
      sleep_window: 30
    )
  end

  def charge_payment(order_id, amount)
    @circuit.run do
      response = HTTParty.post(
        'https://payment-api.example.com/charge',
        body: {
          order_id: order_id,
          amount: amount
        }.to_json,
        headers: { 'Content-Type' => 'application/json' },
        timeout: 3
      )

      JSON.parse(response.body)
    end
  rescue Circuitbox::OpenCircuitError
    raise PaymentServiceUnavailableError, 'Payment service is temporarily unavailable'
  end
end

3. Active Jobによる非同期処理

# ✅ 良い例: Active Jobで非同期処理
class PaymentJob < ApplicationJob
  queue_as :default

  retry_on StandardError, wait: :exponentially_longer, attempts: 3

  def perform(order_id, amount)
    PaymentService.new.charge_payment(order_id, amount)
  end
end

class OrderService
  def create_order(order_data)
    order = Order.create!(order_data)

    # 非同期処理をキューに投入（スレッドプールを占有しない）
    PaymentJob.perform_later(order.id, order_data[:amount])

    order
  end
end

まとめ

カスケード障害のメカニズムのポイント：

発生フロー: 外部APIの遅延 → スレッドプールの飽和 → 全エンドポイントの応答不能 → システム全体のダウン
対策: タイムアウト設定、サーキットブレーカー、Active Jobによる非同期処理
原則: すべての設計はこのフローを想定して、「時間とリソース」で守る

適切な対策により、1つの遅延が全体を死滅させることを防げます。