DC24 observations

ECN

before the data challenge, our ECN config ('net.ipv4.tcp_ecn') was set to 1 (always on and always request ECN)
we saw the throughput to SARA falling to very slow rates during the period their network link was loaded.

@Alexander Rogovskiy then tried transfers from a test gateway by disabling ecn (setting to 0 - always off)
and found the speed increased from a few hundred kbps to 16MB/s.

@Thomas, Jyothish (STFC,RAL,SC) made a sandbox with those changes and asked @James Adams to review, who suggested using ecn=2 (off by default, enabled on request) which is the linux kernel default. This change was made to the sandbox.

All gateways were added to this sandbox and recompiled. We saw an increase in throughput as a result, along with an increase in dropped packets. This was thought to be due to insufficient TCP buffer sizes, hence @Thomas, Jyothish (STFC,RAL,SC) made further changes increasing the TCP and ring buffers following the esnet 100G Network Tuning reccommendations. This was initially rolled out over a subset of gateways on 2/20 16:15 and that subset stopped dropping packets. this change was left in place overnight to ensure the stability of the gateways and then deployed over the rest of the gateways the next morning. All gateways then stopped dropping packets and maintained a near saturation throughput.

Load Balancing

initial load balancing parameters: 80/20 sysload/cpu weighting, with 20s reporting time. this was reduced to 3s in preparation as pre-DC traffic caused the load distribution to be somewhat uneven. when all gws became loaded this was switched to a 50/50 sysload/network load distribution with little difference. We tested different metrics like ping skew, number of connections and 60/40 network/cpu split. They all had a similarly spiky distribution, some were maxing out the network card in bursts but all had a fairly consistent functional check failure rate. We then switched to a round robin (static load) setup, which provided the most stability in terms of throughput and efficiency. It worked well during the day but overnight it fell into an old pattern of a couple individual gateways building up load and memory usage until they hit systemd memory limit and restarted. The next morning we switched to a round robin with failsafe, where the reported load is static until the sysload goes above 95%. this resulted in a stable run since. The inadequate balancing for dynamic loads might be due to the algorithm implemented in the selection by load, which can lead to inconsistent results.

Deletions

Deletions struggled when the gateways were loaded. The switch to round robin load balancing appears to have had a positive effect on the deletion rate, but it also coincided with a gradual decrease in load since as the DC drew to a close.

Current Capacity

removing gw4-7 caused gws to start failing functional tests before the maximum throughput levels observed on the 20th. They were re-introduced later that day and stayed on for the remainder of the challenge. This suggests we’d be running close to capacity with the current gateway set with DC24 level traffic.