Weighted Random Load Balancing

Introduction

This load balancing aims to provide a fast, scalable and order agnostic selection for XRootD Cluster Management System redirects.

The XRootD NodeTab (list of nodes in the cluster) is static and initialized to the maximum allowed size (STMax).

How does this work?

The basic principle of this algorithm is a random weighted selection. An easy way to picture it is as follows:

Imagine a spinning wheel divided in slices. The less loaded gateways will have a larger slice of the pie. Now throw a dart at the board. Whichever slice the dart landed on is going to be selected for the transfer.

 

image-20240729-134418.png

If a node is unavailable or has issues, it has an effective slice size of 0, meaning it will never be selected.

A fuzz value provides a baseline for each slice size, providing some tuning adjustments for a more even distribution. e.g. a fuzz of 20 on the same values above will result in this wheel:

 

image-20240729-134802.png

There’s still a preference for the 50% loaded node, but it’s a more even distribution.

Variables used

sp - selected node
np - current node
np->load - load reported by the node
TotWeight - the current sum of inverse load, adjusted with a fuzz factor for tuning
NodeWeight - array of total weights at the current node

Initialization

  1. TotWeight is set to 0

  2. NodeWeight for the current node is set to 0

  3. sp is set to the first valid node

Looping on each node (including the first sp)

  1. TotWeight is incremented by fuzz + (100 - np->load)

    1. fuzz prevents inverse load being 0 for a gateway at 100 load. this provides even load balancing in cases where every node is at 100 load

    2. higher fuzz values reduce the importance assigned to the load, helping to tune the algorithm.

  2. NodeWeight for the current node is set to the current TotWeight.

 

 

After the loop has gone trough all available nodes

  1. Generate a random number from 1 to the final TotWeight

  2. Select the first node where the value in NodeWeight is greater than the random number

 

In the above example, the final TotWeight is 90. The random number generated is 45. since for the first node, the TotWeight was 20, it is skipped as 20<45. For the second node the TotWeight was 50 and since 50>45, this node is selected and returned.

 

 

 

Code

/******************************************************************************/ /* S e l b y L o a d R */ /******************************************************************************/ // Caller must have the STMutex locked. The returned node, if any, is unlocked. XrdCmsNode *XrdCmsCluster::SelbyLoadR(SMask_t mask, XrdCmsSelector &selR) { static std::random_device rand_dev; static std::default_random_engine generator(rand_dev()); XrdCmsNode *np = nullptr, *sp = nullptr; bool reqSS = (selR.needSpace & XrdCmsNode::allowsSS) != 0; // Scan for a node (preset possible, suspended, overloaded, full, and dead) selR.Reset(); SelTcnt++; int totWeight = 0; for (int i = 0; i <= STHi; ++i) { NodeWeight[i] = 0; // make node unselectable first if (!((np = NodeTab[i]) && (np->NodeMask & mask))) continue; if (!(selR.needNet & np->hasNet)) { selR.xNoNet = true; continue; } selR.nPick++; if (np->isOffline) { selR.xOff = true; continue; } if (np->isBad) { selR.xSusp = true; continue; } if (np->myLoad > Config.MaxLoad) { selR.xOvld = true; continue; } if (selR.needSpace) { if (np->DiskFree < np->DiskMinF || (reqSS && np->isNoStage)) { selR.xFull = true; continue; } } // If node passes filters, give it a weight totWeight += Config.P_fuzz + (100 - np->myLoad); NodeWeight[i] = totWeight; } std::uniform_int_distribution<int> distr(1, totWeight); int selected = distr(generator); for (int i = 0; i <= STHi; ++i) { if (NodeWeight[i] < selected) continue; sp = NodeTab[i]; break; } return sp ? sp : calcDelay(selR); }