Mastering CUBIC Congestion Control: Debugging a Stuck Congestion Window in QUIC

By ⚡ min read

Overview

This tutorial dives into a subtle but critical bug in the CUBIC congestion control algorithm when ported from the Linux kernel to a QUIC implementation (quiche). The bug caused the congestion window (cwnd) to become permanently stuck at its minimum value after a congestion collapse, preventing recovery. You'll learn the underlying mechanics, step-by-step reproduction, a simple fix, and common pitfalls to avoid.

Mastering CUBIC Congestion Control: Debugging a Stuck Congestion Window in QUIC
Source: blog.cloudflare.com

Prerequisites

  • Basic understanding of TCP/IP and QUIC protocols
  • Familiarity with congestion control concepts (cwnd, slow start, congestion avoidance)
  • Access to a Linux environment for testing (optional but recommended)
  • Knowledge of C or Rust (quiche is Rust, but examples are language-agnostic)

Step-by-Step Instructions

1. Understanding CUBIC's Core Logic

CUBIC, standardized in RFC 9438, is the default congestion controller in Linux. It manages the cwnd to probe for available bandwidth: increasing cwnd when no loss is detected (probing), and decreasing it when loss occurs (backoff). The algorithm uses a cubic function (hence the name) to grow cwnd after a loss event, aiming for better network utilization.

2. The Bug: A Stuck Congestion Window

The bug manifests when the connection experiences heavy loss early, driving cwnd to cwnd_min (typically 2 or 4 packets). Normally, after a loss event, CUBIC should eventually recover and grow cwnd. However, due to an interaction with the app-limited exclusion (RFC 9438 §4.2-12), the cwnd becomes permanently stuck at the minimum. The app-limited rule is designed to prevent premature growth when the application isn't sending enough data, but a logic error causes CUBIC to never exit the recovery state.

3. Reproducing the Bug

To reproduce, set up a QUIC connection using quiche with CUBIC as the congestion controller. Simulate heavy packet loss (e.g., 50% loss rate) during the first few round trips. Monitor cwnd over time. Expected behavior: cwnd drops to cwnd_min and stays there indefinitely.

# Example using quiche's test harness (pseudo-code)
let mut cc = Cubic::default();
cc.on_loss(initial_packet);  // heavy loss
assert!(cc.cwnd == cwnd_min);
// Simulate many ACK rounds without growth
for _ in 0..1000 {
    cc.on_ack(now());
}
assert!(cc.cwnd == cwnd_min);  // fails because cwnd never increases

4. Root Cause Analysis

The bug stems from the porting of a Linux kernel patch that aligned CUBIC with the app-limited exclusion. In the Linux TCP stack, the app-limited check is wrapped inside a larger condition that only applies when the connection is not in recovery (i.e., after a loss event). In quiche's port, that guard was omitted, causing the app-limited exclusion to fire even during recovery, preventing CUBIC from ever leaving the minimum cwnd. The exact location is in the cubic_update() function where tcp_friendliness adjustments are made.

Mastering CUBIC Congestion Control: Debugging a Stuck Congestion Window in QUIC
Source: blog.cloudflare.com

5. The Fix: A One-Line Change

The fix adds a condition to skip the app-limited check when the connection is still in the recovery phase. In the quiche source, this is a single line added to cubic.rs:

// Before (buggy):
if app_limited { return; }

// After (fixed):
if app_limited && !self.recovery { return; }

This ensures that during recovery (post-loss), CUBIC continues to grow cwnd even if the application is not fully utilizing the window. Once recovery ends, the original app-limited logic applies.

6. Verifying the Fix

Re-run the reproduction test. The cwnd should now start increasing after recovery, eventually leaving the minimum. Use a debug trace to confirm the sequence:

  • Initial loss -> cwnd drops to 2
  • ACKs arrive, cwnd bypasses app-limited check
  • cwnd grows (e.g., 3, 4, 5...)
  • Eventually leaves recovery, cwnd continues normal cubic growth

Common Mistakes

  • Assuming TCP and QUIC congestion control are identical: While RFC 9438 defines CUBIC for TCP, QUIC implementations may have subtle differences (e.g., loss detection, app-limited semantics). Always test both protocols.
  • Neglecting edge cases: The bug only appears under heavy early loss. Many tests skip this scenario. Ensure your test suite includes extreme loss patterns.
  • Overlooking the app-limited condition: App-limited is meant to prevent over-probing, but can interact poorly with recovery logic. Always audit all state transitions.
  • Copying kernel code verbatim: The Linux kernel has intricate dependencies. Porting requires understanding of the surrounding context (e.g., the recovery flag).

Summary

This tutorial walked through a real-world bug where CUBIC's congestion window got stuck at minimum due to a misapplied app-limited exclusion in a QUIC implementation. By understanding the core logic, reproducing the issue, and applying a one-line fix, we prevented permanent throughput collapse. Key takeaways: always verify edge-case behavior, avoid blind code porting, and test recovery paths thoroughly.

For further details, refer to the original overview or explore the quiche source code.

Recommended

Discover More

Monitoring AI Agents in Production with Grafana Cloud’s New Observability FeaturesAnalyzing Microsoft's Latest Security Update: 138 Patches with Critical DNS and Netlogon RCE FixesHow Educational Institutions Can Respond to a Data Extortion Attack on Their Learning Management SystemQ&A: How Adversaries Are Weaponizing AI – Insights from Google's Threat Intelligence ReportApril 2026 Patch Tuesday: Comprehensive Guide to the Record-Breaking Security Updates