Why a Single Edge Server Can Absorb Higher Burst Traffic: What We Actually Optimized in the Kernel Hot Path

Why single-node burst tolerance is not mainly a hardware question, but a hot-path design question, and how we reduced pressure across pacing, handshake gating, async device checks, sharded connection management, and dirty-state convergence inside the edge kernel.

ブログ

When people talk about the capacity of a single edge server, they usually focus on just two numbers: bandwidth and CPU cores.

Both matter, of course. But they are often not the real dividing line that determines whether one machine can withstand burst traffic.

What usually brings a single machine down is not that average throughput is insufficient. It is that once a burst arrives, a string of small hot-path problems gets amplified at the same time.

  • The handshake path gets blocked by control logic.
  • Rate limiting turns user-space scheduling into a hotspot.
  • Connection registration and eviction depend on global locks or full scans.
  • External validation enters the first-packet path and directly magnifies latency.
  • Every connection performs work that should have been moved earlier or pushed downward.

So whether one machine can absorb higher bursts is not fundamentally a hardware-stacking problem. It is a kernel hot-path design problem.

The goal of our kernel-side work has never been to make a benchmark graph look slightly nicer. The goal is this:

Let burst traffic be absorbed first by the cheapest layer, the layer closest to the kernel, and the layer least dependent on global contention, instead of allowing user-space hot paths to get blown open immediately.

That is the one thing this article is about.

1. The conclusion first: burst tolerance on a single node is not mainly about processing each request faster, but about not letting the hot path kill itself first

A lot of systems look perfectly normal at off-peak times and then show very familiar symptoms as soon as traffic spikes.

  • Handshakes suddenly become slower.
  • New connections begin to jitter noticeably.
  • CPU shoots upward while real throughput does not rise proportionally.
  • Rate limiting is enabled, yet overall service experience actually gets worse.
  • A small amount of abnormal source traffic disturbs the entire scheduling rhythm of the machine.

Behind those symptoms there is usually one common cause:

The system has pushed too much work into user space that never belonged in the first-packet path or the data hot path in the first place.

All of our kernel-side optimization work follows one shared principle.

  1. Push work down toward the socket side whenever it can be pushed down.
  2. If work must remain in user space, make it as lock-free, sharded, and asynchronous as possible.
  3. If external validation is unavoidable, do not make the first request wait for it when you can avoid doing so.
  4. If state must converge, track only objects that are truly dirty rather than scanning everything meaninglessly.

The improved capacity of a single machine comes from that full hot-path pressure-relief design, not from one isolated tuning parameter.

2. Layer one: push TCP download pacing into the kernel instead of trying to force it with sleep loops in user space

In sustained-transfer scenarios, the download direction is one of the easiest places to get wrong.

A very common implementation keeps a limiter in user space for every connection and approximates the target rate through sleeps and wakeups.

That can look acceptable for a short time. Under high concurrency and burst conditions, however, its weaknesses become structural.

  • Large numbers of connections sleep and wake together, turning scheduling itself into extra overhead.
  • Send rhythm tends to collapse into batch spikes.
  • User-space threads get occupied by rate-limiting work instead of real transfer work.
  • CPU stays busy, but it is busy scheduling rather than actually moving data.

That is why on Linux we prefer to hand the TCP download path to kernel-level pacing first.

Two concrete actions matter in this design.

1. Install the `fq` qdisc on eligible interfaces

At startup, the system installs the `fq` queueing discipline on interfaces that qualify, are enabled, and are not loopback interfaces. That is not cosmetic configuration. It provides the actual kernel-side surface needed for later pacing control.

2. Set `SO_MAX_PACING_RATE` on the connection

Once the connection supports the corresponding system-call capability, the kernel limits send rate directly at the socket layer. In other words, the real download rhythm no longer depends on a user-space loop repeatedly asking whether it is time to send. It moves closer to the kernel transmit path.

The gains are immediate.

  • Smoother pacing
  • Fewer batch wakeups
  • Lower user-space scheduling burden
  • A better chance of keeping CPU available for logic that truly must remain in user space during peak periods

A lot of people hear this and treat it as just another rate-control tweak. It is more important than that.

When a large burst arrives, the system is no longer forcing all send rhythm to pile up in user space. That directly determines whether the machine can stay stable under peak conditions.

3. Layer two: user-space throttling should not become a second scheduler. It should stay lightweight, lock-free, directional, and debt-bounded

Not every path is a good fit for kernel pacing. Upload traffic, non-TCP paths, and connection objects without the relevant socket capability still need a user-space fallback.

But the hard part here is not building a limiter. It is building one that does not become its own bottleneck under burst conditions.

We deliberately avoided turning user-space throttling into one large central scheduler. Instead, we use a restrained design:

  • Advance a virtual timeline through atomic CAS operations.
  • Do not grant extra burst credit.
  • Enforce a hard upper bound on wait time.
  • Keep upload and download completely separate.

Each of those points matters.

1. Go lock-free to reduce hotspot contention

Traditional throttlers built around heavy locking often turn the rate-limiter itself into a hotspot once concurrency rises. We use atomic timeline advancement instead, so every chunk of data does not have to fight for one global scheduling state.

2. Do not allow hidden large bursts

Many limiters accumulate credit and then release a noticeable burst over a short interval. That may not stand out in average-throughput numbers, but for a single machine trying to survive bursts it transforms a controllable rhythm into harder-to-predict instantaneous peaks.

We intentionally do not take that path.

3. Cap scheduling debt

One of the easiest user-space throttling problems to overlook is that closed connections can keep leaving behind scheduling debt, which then distorts tail behavior. We give wait time an explicit upper bound so old connections do not keep dragging the system into meaningless debt repayment after they are gone.

That may stay invisible at off-peak times, but it becomes very important during real bursts. Burst traffic is exactly when you do not want stale debt hanging around outside the hot path and slowing recovery.

4. Layer three: separate handshake cadence from sustained throughput so the first-packet path is not slowed down by the full rate-limiting system

Many systems make one fundamental mistake: they treat new-connection establishment speed and sustained transfer speed as the same rate-limiting problem.

That creates a particularly bad outcome under burst traffic.

  • The original intention was merely to limit total throughput.
  • But the first-packet path gets slowed down at the same time.
  • What users experience is simple: even connection establishment starts stuttering during peak periods.

We split those two things apart explicitly inside the kernel.

Session establishment goes through an independent handshake gate

The system maintains an independent handshake-cadence gate for each service instance. It is not a simple sleep. It is a fast time-interval and CAS-based decision path.

More importantly, it is not a rigid one-size-fits-all cutoff.

  • There is a strict baseline rate.
  • There is also a controlled small-window explicit burst.

That burst is not there to loosen control casually. It exists to accommodate normal behavior such as first-page fan-out, concurrent asset loads, and other short bursts that belong to legitimate client behavior. But its window and count are explicitly bounded, so it does not spill outward into "whoever hits harder wins."

The significance of this design is substantial.

  • Burst traffic does not immediately blow up the connection-establishment path.
  • Legitimate small fan-out is less likely to be harmed too early.
  • Throughput limiting and handshake gating no longer contaminate one another.

If many systems think in terms of controlling large traffic, our real question is slightly different:

How do you control large traffic without letting the first-packet path collapse first?

5. Layer four: device limits must not enter the first-packet blocking chain. They should follow a latency-first, asynchronously convergent path

This is one of the layers we think very few teams implement carefully, even though it matters enormously for single-node burst tolerance.

When capabilities like device limits are implemented crudely, the most common model looks like this:

  • A new public IP appears.
  • The runtime synchronously asks the control plane.
  • The first request waits for the answer before it is allowed or denied.

That is functionally correct, but during peak periods it ties first-packet latency directly to an external dependency. Once the control plane is even slightly slow, the edge node begins to stall itself.

We do not take that path.

Instead, this path uses a latency-first asynchronous adjudication model.

1. The first decision prioritizes first-packet latency

When a new public IP appears for the first time, the system does not put the first request behind an external check. It grants a very short grace allow so the connection can pass through the hot path first.

2. Device validation runs asynchronously in the background

The real validation work happens concurrently in the background rather than blocking a request thread.

3. Bursts from the same service and the same IP are deduplicated

If sixteen concurrent requests arrive from the same new IP, the system does not fire sixteen external validations. It performs one asynchronous check. That matters tremendously for bursts because it prevents the platform from generating a secondary flood against its own control plane.

4. Results are written back into short local allow and deny caches

Once the asynchronous result arrives, the node quickly writes an allow or deny entry into local cache. If the final judgment is denial, the system evicts the affected connections precisely by service and IP instead of handling the case ambiguously.

The value of this path is substantial.

  • The first request is not directly frozen by an external interface.
  • Control-plane latency does not get amplified straight into first-packet latency on the edge node.
  • Same-source bursts do not turn into an external-validation storm.
  • The final state still converges back to the correct answer.

This is a textbook example of runtime design built for single-node burst tolerance rather than a normal business toggle.

6. Layer five: connection registration and eviction use sharded indexes instead of one global lock and one full scan

Once burst traffic appears, another point that can drag a single machine down very quickly is connection-state management.

A lot of systems lose ground here for the same reasons.

  • All connections hang off one global structure.
  • Registration, deletion, and targeted eviction all contend for the same lock.
  • If the system needs to clean up by user or by IP, it can only scan everything.

That design may look harmless at low concurrency. At high concurrency it becomes a hotspot almost immediately.

Our kernel does not use that model.

1. Connection registration is sharded by user key

The system breaks connection registration into multiple shards instead of pushing everything into one large locked structure. As a result, registration and teardown do not force every concurrent connection to compete for one lock.

2. Maintain an additional `service + IP` index

Beyond the sharded registry, the system maintains a connection index keyed by `service_id + ip`. That is important because many governance actions are fundamentally local actions against one source under one service.

Once that index exists, the system can:

  • Evict connections for one IP under one service precisely.
  • Avoid scanning every connection on the machine.
  • Keep local anomalies local.

This sort of data-structure decision is not easy to showcase in marketing language, but it very often determines whether a single machine stays controllable once peak load and abnormal conditions start stacking together.

Because what actually crashes systems is often not slightly more traffic. It is this:

After traffic rises slightly, every governance action suddenly needs a global scan.

We took special care to avoid that.

7. Layer six: track only dirty service state instead of scanning every user on every cycle

Single-node burst tolerance is not only about how to receive load. It is also about how to flush state outward without blowing yourself up in return.

Many systems make statistics, online state, and reporting unnecessarily heavy.

  • Every cycle scans all users.
  • Payloads are rebuilt even when nothing changed.
  • State aggregation becomes a new periodic hotspot.

We do not do that.

Inside the kernel, traffic and online state follow a dirty-mark plus hot-set model.

  • Only service instances that truly changed enter the candidate set for reporting.
  • Objects that did not change are not repeatedly repackaged.
  • After reporting converges, the state is marked clean and removed from the hot set.

That produces one especially important benefit:

At off-peak times the system spends almost no effort on cold objects. At peak times it processes only the objects that actually became hot.

That is a major difference from systems that rescan everything every cycle. During a burst, the last thing you want is a background convergence thread performing meaningless full-volume work at the same time.

8. Layer seven: self-clean long-running state so the machine does not grow heavier the longer it runs

A lot of systems look good at startup and then become progressively heavier over time. The root cause is often not the workload itself, but the fact that state structures become dirtier and dirtier.

  • Old connection traces are not cleaned up fully.
  • State for inactive users stays resident forever.
  • Temporary caches keep growing.
  • Log-deduplication maps and device-state maps expand without stopping.

We built explicit governance for that into the kernel.

  • If a user is already inactive, has no online references, and has no traffic waiting to be sent, its state object is reclaimed.
  • Short allow and deny IP caches are pruned by time.
  • Rate-limit and device-related log-throttling structures are cleaned on a best-effort basis.

That may be less flashy than talking about higher throughput, but it is critical for single-node pressure handling. A server that really absorbs bursts in production does not only need to survive this minute. It needs to keep a stable hot path after days, weeks, and months.

In other words, single-node capability is not just peak capability. It is also the ability not to degrade over long periods of operation.

9. Why this combination of optimizations lets one machine absorb peaks more steadily

When you look at these design choices together, the picture becomes clear. What we built is not one magical performance switch. It is a layered burst-absorption system.

Layer 1: stop the bleeding at the first packet and the handshake

  • Handshake gating exists independently.
  • Small bursts are explicitly bounded.
  • The first packet is not easily slowed down by bulk-throughput throttling.

Layer 2: hand large transfer to the layer closest to the kernel whenever possible

  • TCP download prefers kernel pacing.
  • User space keeps only the necessary fallback throttling.

Layer 3: external dependencies do not block the hot path

  • Device limits begin with latency-first handling.
  • Adjudication happens asynchronously in the background.
  • Same-source bursts are deduplicated.

Layer 4: connection management and governance stay local

  • The connection table is sharded.
  • A local `service + ip` index exists.
  • Local anomalies do not drag down the entire machine.

Layer 5: state convergence touches only the objects that are truly hot

  • Dirty marks
  • Hot sets
  • Self-cleaning and long-term stability

Once these layers work together, a single edge server behaves very differently under burst traffic.

  • It does not explode first in user-space scheduling.
  • It does not explode first while waiting on external validation.
  • It does not explode first in connection governance or state scanning.

It absorbs as much of the burst as possible at the cheapest, most local, and most kernel-adjacent point first.

That is the real reason the pressure tolerance of a single node goes up.

10. Why this is not self-congratulatory micro-optimization, but a real change to the single-node ceiling

Many optimizations sound highly technical while delivering little beyond a better local benchmark. What matters to us is whether the optimization genuinely changes the capacity boundary of the machine.

These optimizations matter because they strike several of the real weak points that burst traffic exposes in single-node systems:

  • User-space scheduling that is too heavy
  • External dependencies blocking the hot path
  • Handshake control and throughput control mixed together
  • Connection governance built around global scanning
  • State convergence rebuilt from scratch every cycle

Once those weaknesses are separated and addressed one by one, the machine is no longer merely surviving with higher CPU usage. Instead:

The same hardware can absorb peaks more calmly, and the same peak is less likely to kill the system by destroying its own hot path first.

That says far more than any generic statement that "we improved performance."

Closing

We believe what truly determines whether a single edge server can absorb higher burst traffic is not one isolated parameter, and not a simple increase in hardware budget, but this instead:

Whether handshake control, pacing, device validation, connection governance, and state convergence have all been turned into burst-friendly kernel-level design paths rather than hot-path liabilities.

That is why we keep investing in the edge kernel.

Because truly mature single-node capability has never meant that average throughput looks decent on paper. It means this:

When the burst really arrives, the system does not get pierced by its own hot path first.

この記事は参考になりましたか?

ネットワーク評価の依頼

ネットワーク評価の依頼

現在の構成を教えてください。起こり得る問題を整理し、このサービスが合うかをお伝えします。

評価を依頼