At the core of the Edge Gateway service lives a request queue that remained relatively unchanged through the majority its existence. It was stable, performant and resourceful, but ultimately ill-suited to the ultimate goal; to distribute work to devices prioritised by their absolute real-time relative performance grade.
Centralised load balancing
As those of you who have ever had to horizontally scale a services deployment with a typical cloud hosting provider will know, the mechanics are fairly simple. You create a replicable ‘image’, set some limits, and sit back and watch y̵o̵u̵r̵ ̵b̵a̵n̵k̵ ̵b̵a̵l̵a̵n̵c̵e̵ as server-after-server spawn in the wake of a traffic increase or sudden surge.
A lightweight request proxy service known as a load balancer evenly distributes the requests amongst the cohort of servers, doing all it can to spread the load, whilst simultaneously running regular albeit rudimentary health checks on each device.
The approach makes a couple of assumptions. Firstly that all requests require equal or similar resources, and the secondly that all servers are of an identical specification. With dedicated devices these are relatively harmless prerequisites however on a decentralised network where computational power, availability and connectivity are unpredictable, distributing fairly is a fresh and interesting challenge.
The Gateway queue
When a device establishes a connection with the nearest Gateway it opens a bi-directional gRPC stream which acts as a subscription to the Gateways request queue. As requests enter the queue they are assigned to a device based on the order in which the device connected and the Gateway cycles through this devices list ad nauseam. If two devices were connected, each would receive every other request.
In its very first deployment we noticed that under relatively heavy load a single underperforming device could significantly degrade the experience for the end user simply by taking longer than average to respond. With CDN this would typically manifest itself as one or more slowly loading images on an otherwise rapid page load. The more images on a single page, the more frequent obvious the delay.
To circumvent this degraded performance we upgraded the queue to re-issue slow requests to the queue to allow a larger share of devices to asynchronously attempt it. This significantly reduced the possibility of a degraded performance, but did increase the number of roundtrips to the image origin.
Device performance grade
Deciding how to distribute requests based on device performance has been a key goal for a while, and cracking this would be one of the most valuable breakthroughs for the Edge network.
It’s entirely possible to grading a device based on its hardware spec, however until a request has been resolved its impossible to know what resources will be utilised. As a request terminates at the Gateway, there is no meaningful information about the request, and therefore no way of prioritising specific devices based on a perceived performance grade.
Adding to this uncertainty there are further complications during device appraisal in respect to the operational costs attached to each request. For example resizing a large source image may have a low CPU load whilst seriously impacting the systems memory and network bandwidth. A small source image requiring a colour or quality regrade may require little to no bandwidth, but have high CPU processing time.
There are other contributing factors that shape overall performance, therefore we must accept that spare computational capacity will be impacted by unpredictable external factors. The device owner may be running other applications, potentially consuming a significant proportion of resources. There’s a multitudinous array of impacting factors such as the inescapable reality that a home network may have any number of concurrent users, removes any hope of establishing a reliable and sustained device grade.
The law of probability
We started the research process making some assumptions that would later be challenged. The most important assertion I personally made relates to distribution and probability and it started with a simple example.
Let’s imagine there are just two devices on the network, one highly performant, and another with far lower resources. If there were just two requests made, there would be an equal number of requests per device, and should one of these requests be more complex by order of magnitude, we would certainly expect to see skewed results.
Should the complex request by resolved by the powerful device, and the far simpler request be resolved by the low powered of the two, we might observe a similar response time. In contrast, should the simplest of the two request be resolved by the more powerful device whilst the complex request burdens the low powered device, an observer may assume a far larger disparity between the two devices. It was this assessment that originally turned us away from the idea of grading based on response time alone.
What we overlooked here was a rather simple fact; a small sample can produce inaccurate results, and as the sample size is expanded, the edge cases lose their significance. Whilst two requests to two devices is certainly going to give skewed results, thousands of requests to the same two devices will ultimately balance out, with the probability that deviations will settle at an average. This is the basis for the law of probability. Given time and an equal opportunity, all devices will have incurred a cumulative load.
Realtime impact score
With no discriminatory factors at play, it became possible to evaluate each device based purely on average response time. We started off by building a proof of concept heap that reordered devices based on their average response time on a per-request basis. Using this method, Gateway could assign requests to connected devices based on a real time score, updating the score at the point it received a successful response. The score is directly based on response time, so in this case lower means better.
With a low volume of requests this method proved sufficient but our higher load benchmarking revealed a weakness in the design. Whilst a device with the lower score is clearly more likely to respond faster, a high volume of concurrent requests risk flooding the device, causing it to respond slowly to each request, potentially causing a crash. In order to avoid this binary traffic assignment we needed a way to make sure other connected devices handled a proportionate share of concurrent requests.
The solution was to develop a multiplier to the impact score of the device that took into account the number of in-flight requests the device was processing. The equation is simple. Take the current score (x) and multiply it by the number of concurrent requests (n) + 1. The equation is 𝑥 ( 𝑛 + 1 ). At the point a high scoring device receives a request, it’s impact score is immediately multiplied by the number of concurrent requests, and only when it has responded to each request does this multiplier reduce. Here’s an example.
Device A has an average response time of 150ms so its score at rest is 150
Device B has an average response time of 400 so its score at rest is 400
When the first request hits Gateway it’s assigned to Device A. Now Device A has 1 in-flight request, so applying our equation, its real time score is calculated as 150(1+1) = 300.
If a second concurrent request is enqueued before Device A has resolved the first, the score becomes 150(2+1) = 450.
At this point, whilst Device A may have the lowest score at rest, it now temporarily has a score of 450, every so slightly higher than Device B’s score of 400.
This trickle-down scoring guarantees a well proportioned distribution of requests with immediate consideration for the impact of concurrency.
After 5 weeks of rigorous analysis on testnet the new load balanced priority queue was deployed live a few days ago. We’re still gathering a full report on its long term impact on response time, but the immediate performance improvements have been huge. This is a promising step towards fully contribution-based earnings which in turn will autonomise the process by which network contributors can evaluate and upgrade their device portfolios. We hope that eventually the network can spot hardware faults and localised network issues simply by analysing real time impact scores.
Written by Arthur Mingard. Arthur is the CTO of Edge Network Technologies.