Kubernetes Testing Strategy for High-Performance Networking

Deploying a high-performance Kubernetes platform for 5G workloads is only half the job. You need to validate that the hardware acceleration actually works — that DPDK SMC's are reachable, SR-IOV resources are allocatable, and MacVLAN pods get the connectivity they need. This post covers the test manifests and validation strategy for the EIB-Customer platform.

What We're Testing

Three distinct network acceleration paths need validation:

DPDK — userspace packet processing via vfio-pci SMC's and rancher.io/dpdk resource
SR-IOV Netdevice — kernel-mode SMC's via intel.com/sriov_netdevice resource
MacVLAN — Layer 2 secondary network via Multus NAD

Each requires different resource requests, security contexts, and network attachments.

Test 1: DPDK Validation

What to verify

Pod schedules successfully (DPDK SMC's available as resources)
Hugepages mounted correctly at /hugepages-1Gi
Network annotation suse-dpdk attaches a secondary interface
Required capabilities granted (NET_ADMIN, NET_RAW, IPC_LOCK)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dpdk-test
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: dpdk-test
  template:
    metadata:
      labels:
        app: dpdk-test
      annotations:
        k8s.v1.cni.cncf.io/networks: suse-dpdk
    spec:
      containers:
      - name: dpdk
        image: registry.suse.com/bci/bci-busybox:16.0
        command: ["sh", "-c", "echo 'DPDK pod running' && sleep 3600"]
        resources:
          limits:
            rancher.io/dpdk: "1"
            hugepages-1Gi: 2Gi
            memory: 2Gi
          requests:
            rancher.io/dpdk: "1"
            hugepages-1Gi: 2Gi
            memory: 2Gi
        securityContext:
          capabilities:
            add: ["NET_ADMIN", "NET_RAW", "IPC_LOCK"]
        volumeMounts:
        - mountPath: /hugepages-1Gi
          name: hugepages
      volumes:
      - name: hugepages
        emptyDir:
          medium: HugePages-1Gi

Requests and limits must be identical (Guaranteed QoS class) for hugepages and DPDK resources. The kubelet will reject pods where these differ.

Validation commands

# Check pod scheduled and running
kubectl get pods -l app=dpdk-test

# Verify hugepages mounted
kubectl exec dpdk-test-xxx -- mount | grep hugepages

# Verify secondary interface attached
kubectl exec dpdk-test-xxx -- ip addr show
# Should show eth0 (Calico) and net1 (DPDK SMC)

# Check resource allocation on node
kubectl describe node node1 | grep -A 5 Allocated

Test 2: SR-IOV Netdevice Validation

What to verify

intel.com/sriov_netdevice resource available on nodes
Pod receives a SMC with working kernel network interface
VLAN 538 connectivity reachable from pod

apiVersion: v1
kind: Pod
metadata:
  name: sriov-test
  annotations:
    k8s.v1.cni.cncf.io/networks: suse-sriov-netdevice
spec:
  containers:
  - name: sriov
    image: registry.suse.com/bci/bci-busybox:16.0
    command: ["sleep", "3600"]
    resources:
      limits:
        intel.com/sriov_netdevice: "1"
      requests:
        intel.com/sriov_netdevice: "1"
    securityContext:
      capabilities:
        add: ["NET_ADMIN"]

Validation commands

# Verify SR-IOV resources advertised
kubectl get nodes -o json | \
  jq '.items[].status.allocatable | with_entries(select(.key | contains("sriov")))'

# Expected:
# {
#   "intel.com/sriov_dpdk": "8",
#   "intel.com/sriov_netdevice": "8"
# }

# Verify interface in pod
kubectl exec sriov-test -- ip addr show net1
# Should show IP from 192.168.27.128/27 range

Test 3: MacVLAN Validation

apiVersion: v1
kind: Pod
metadata:
  name: macvlan-test
  annotations:
    k8s.v1.cni.cncf.io/networks: macvlan-conf
spec:
  containers:
  - name: macvlan
    image: registry.suse.com/bci/bci-busybox:16.0
    command: ["sleep", "3600"]

# Verify Layer 2 connectivity
kubectl exec macvlan-test -- ip addr show net1
# Expected: IP from 192.168.41.48/28 range

kubectl exec macvlan-test -- ping -c 3 192.168.41.1
# Expected: gateway reachable

Cluster Health Validation

Before running network tests, verify the full cluster stack is healthy:

# All system pods running
kubectl get pods -A | grep -v Running | grep -v Completed
# Expected: empty (no non-running pods)

# All Helm charts deployed
kubectl get helmcharts -n kube-system
# Expected: all show DEPLOYED status

# SR-IOV operator ready
kubectl get pods -n sriov-system
kubectl get sriovnetworknodepolicies -n sriov-system

# Longhorn storage healthy
kubectl get pods -n longhorn-system
kubectl get storageclass
# Expected: longhorn (default) StorageClass present

Troubleshooting: Common Failures

DPDK pod stuck in Pending

# Check if DPDK resources advertised
kubectl describe node | grep rancher.io/dpdk
# If missing: SR-IOV operator may not have bound SMC's

# Check SR-IOV operator logs
kubectl logs -n sriov-system -l app=sriov-network-config-daemon

# Verify SMC driver binding on node
cat /sys/class/net/p3p1/device/virtfn8/driver_override
# Expected: vfio-pci

SR-IOV SMC's not created

# Check the custom SMC creation service
systemctl status sriov-custom-auto-vfs.service

# Manually apply SR-IOV policies
kubectl apply -f /opt/sriov/

# Check SMC's on the physical NIC
ip link show p3p1
# Look for "vf 0 ... vf 7 ..." entries

Node fails to join cluster

# Check API VIP reachable from node2
ping 192.168.41.30

# Verify token matches on both nodes
cat /etc/rancher/rke2/config.yaml | grep token

# Check firewall allows required ports
iptables -L INPUT | grep -E "6443|9345"

# RKE2 agent logs
journalctl -u rke2-agent -f

Performance Benchmarks

Once functional validation passes, establish baseline performance metrics:

Test	Tool	Target
CPU latency jitter	cyclictest	<10μs max on isolated cores
DPDK throughput	testpmd	95+ Gbps line rate
SR-IOV latency	iperf3 + sockperf	<50μs RTT
Storage I/O	fio	Baseline for NVMe + Longhorn
Memory bandwidth	stream	125 GB/s (NUMA-local)

Document your baselines. Without measured baselines at deployment time, you have no reference point when performance degrades during operation. Run these tests before handing the platform over.

Litmus Chaos Engineering

The platform ships with Litmus Chaos Operator v3.10.0 pre-installed. Define chaos experiments to validate resilience:

Node restart — does the cluster recover automatically?
Network partition — does the API VIP failover correctly?
Pod kill — do Kubernetes restarts converge correctly?
Storage disruption — does Longhorn replication heal?

Litmus is installed but no experiments were defined at handover — this is a gap worth addressing before production traffic lands on the platform.