CPU Isolation and Real-Time Performance Tuning

5G vRAN workloads running L1 PHY signal processing have timing budgets measured in microseconds. A single OS timer interrupt or RCU callback on the wrong core can cause a missed deadline, resulting in dropped radio frames. This post covers the complete CPU isolation and real-time tuning stack used in the EIB-Customer platform.

The Problem: OS Noise

In a default Linux system, every CPU core is subject to:

Periodic timer ticks (CONFIG_HZ, typically 250–1000/s)
RCU (Read-Copy-Update) callbacks — kernel lock-free data structure updates
Hardware interrupt delivery
Kernel thread migrations
Scheduler load balancing

Each of these introduces latency jitter. For a 5G DU, jitter exceeding ~10μs on processing cores is unacceptable. The solution: partition the CPUs into housekeeping and isolated sets.

CPU Partitioning: 64 Cores, Two Roles

NUMA Node 0 (Cores 0–31)      NUMA Node 1 (Cores 32–63)
┌────────────────────┐         ┌────────────────────┐
│  Core 0            │         │  Core 32            │
│  Housekeeping      │         │  Housekeeping       │
│                    │         │                     │
│  Cores 1–30        │         │  Cores 33–62        │
│  ISOLATED          │         │  ISOLATED           │
│  (Workloads)       │         │  (Workloads)        │
│                    │         │                     │
│  Core 31           │         │  Core 63            │
│  Housekeeping      │         │  Housekeeping       │
└────────────────────┘         └────────────────────┘

Housekeeping cores (0, 31, 32, 63): Handle all OS activity — kernel threads, system services, Kubernetes system pods, interrupt handling.

Isolated cores (1–30, 33–62): Reserved exclusively for workload pods — DPDK PMDs, 5G RAN functions, real-time applications. 60 of 64 cores are isolated.

Kernel Boot Parameters

CPU isolation is configured in the kernel command line, injected via edge-cluster.yaml:

isolcpus=domain,nohz,managed_irq:1-30,33-62
nohz_full=1-30,33-62
rcu_nocbs=1-30,33-62
irqaffinity=0,31,32,63

isolcpus

Removes cores from the general scheduler pool. The domain flag removes them from load balancing domains. nohz enables per-core tickless mode. managed_irq moves managed interrupts away from isolated cores.

nohz_full

Enables adaptive ticks — the timer interrupt is suppressed on isolated cores when only one runnable task is present. Eliminates the most common source of OS noise on application cores.

rcu_nocbs

Offloads RCU callbacks from isolated cores to dedicated RCU threads running on housekeeping cores. Without this, RCU work can interrupt DPDK poll loops at unpredictable intervals.

irqaffinity

Pins hardware interrupt delivery to housekeeping cores only. NIC interrupts, timers, and other hardware events are guaranteed not to land on isolated cores.

Kubernetes CPU Management

Kernel isolation alone isn't sufficient — Kubernetes must also respect core boundaries. Two kubelet policies enforce this:

# /etc/kubelet-conf/topologymgr.conf
cpuManagerPolicy: static
topologyManagerPolicy: single-numa-node

Static CPU Manager Policy

Pods requesting integer CPU counts in the Guaranteed QoS class receive exclusive CPU allocation. The kubelet pins the pod's processes to specific physical cores — they won't migrate, ever.

Single-NUMA-Node Topology Policy

Ensures that CPUs, memory, and devices (SR-IOV SMC's, hugepages) allocated to a pod all come from the same NUMA node. Cross-NUMA memory access adds ~100ns latency per access — unacceptable for real-time workloads.

TuneD: cpu-partitioning Profile

The cpu-partitioning.service systemd unit applies the TuneD cpu-partitioning profile at boot. This consolidates and reinforces the kernel boot parameters and adds:

CPU governor set to performance on all cores
C-states disabled (no CPU idle states)
P-state tuning for maximum frequency
Additional kernel parameters for real-time scheduling

# Verify active profile
tuned-adm active
# Expected: Current active profile: cpu-partitioning

# Verify CPU governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Expected: all lines read "performance"

Huge Pages

DPDK requires huge pages for its memory allocator. Each huge page reduces TLB (Translation Lookaside Buffer) pressure by mapping 1GB instead of 4KB per entry — critical when the DPDK application is constantly accessing large packet buffers.

# In edge-cluster.yaml kernel cmdline:
hugepagesz=1G hugepages=40

# Verify at runtime
grep HugePages /proc/meminfo
# HugePages_Total: 40
# HugePages_Free: 38  (2 allocated to running DPDK pods)

40GB of 1GB huge pages represents a significant memory commitment on a 512GB system (~8%). This is intentional — DPDK applications require contiguous physical memory regions that cannot be swapped.

Real-Time Kernel (PREEMPT_RT)

The base OS is SLE Micro 6.1 RT — built with the PREEMPT_RT patchset. Key differences from a standard kernel:

Most kernel spinlocks converted to mutexes (preemptible)
Interrupt handlers run in preemptible thread context
High-resolution timers for sub-millisecond precision
Priority inheritance for PI-aware mutexes

This gives worst-case latency bounds rather than just good average-case performance.

Performance Optimisation Summary

Component	Optimisation	Effect
Kernel	PREEMPT_RT variant	Deterministic scheduling, <1ms worst-case
CPU Governor	performance	Max frequency, no power saving
CPU Isolation	60 cores isolated	Exclusive allocation to workloads
NUMA Policy	single-numa-node	Enforce memory locality
Huge Pages	40GB (1GB pages)	Reduce TLB misses by 90%+
nohz_full	Isolated cores tickless	Eliminate timer interrupts
rcu_nocbs	RCU offloaded	Remove RCU jitter from app cores
irqaffinity	Pinned to cores 0,31,32,63	Predictable interrupt handling

Measuring Latency

# cyclictest — measure timer latency on isolated cores
taskset -c 1 cyclictest -p 99 -m -n -i 200 -D 10m

# Expected on a well-tuned system:
# T: 0 ( pid) I:200 C:3000000 Min:    2 Act:    3 Avg:    3 Max:    8
# Max latency < 10μs is the target for DPDK workloads

Target metrics: CPU jitter <10μs on isolated cores, verified by cyclictest running for 10+ minutes with no single spike exceeding the threshold.

This tuning stack — kernel RT, CPU isolation, NUMA topology management, and huge pages — is what makes this platform suitable for 5G vRAN rather than just general-purpose workloads.