Ditch your pets, be free with the flock!

Toby O'Rourke
Engineering at FundApps
5 min readMay 12, 2022

--

Or, how we’ve gone from 20 “pet” boxes to somewhere in the order of 20,000 containers and have fewer sleepless nights as a result.

FundApps was founded in 2010 and, until recently, our technology was rooted in its time. Yeah. There’s a monolith.

Our Platform

Shareholding disclosure is a complex domain. We cover more than 100 jurisdictions and source data from more than 30 sources. At the heart of what we do is a bespoke rule engine. It runs a batch process that follows a workflow of validation, planning and performing various calculations to analyse a client dataset. This typically happens once per day, per client.

That analysis comprises more than 500 rules and our largest clients supply tens of millions of individual assets, daily. They can have tens of thousands of individual portfolios, each of which must be analysed against those 500 rules and may contain any number of those millions of assets.

It’s a huge computational problem and it was all happening on a single node, because the engine architecture relied on a shared memory model.

You can probably work out where this is going. A small farm of relatively large EC2 instances. These instances mainly hang around waiting for work. Before we re-platformed, overall utilisation — optimistically — stood somewhere around 20%.

Yet we still saw issues. A regular patch management dance that necessitated downtime and weekend work. Drifting configuration. Noisy neighbour events: client workloads will sometimes behave unpredictably, they would consume excess resources and crowd out concurrent runs.

Toil. Incidents. Waste. Clearly, something needed to be done.

Modernising a Monolith with Serverless

We’ve made a big bet on Serverless and heavily managed services. I’ve spoken before about the importance of not doing undifferentiated work. In a business like ours, time not making an ever more compelling proposition to our clients or prospects is waste.

We wanted to minimise the amount of old-school “sysadmin” work. We wanted to maximise the value add of every engineer’s effort. We went all in on a 12 factor approach. Serverless is a great fit for batch workloads — if you’re not working you’re not paying.

We decided that each tenant would reside in a separate AWS account under our organisation. Total segregation of data at rest and in flight, which makes InfoSec folks happy. And a full set of rate limits per tenant, which makes our most demanding workloads a breeze.

The new FundApps Rule Engine Architecture

We make heavy use of step functions to orchestrate the workflow, which hasn’t changed that much. Each step is actually a child step function. They all start off by sending a bunch of messages via SQS. These get consumed by ECS tasks running on Fargate. There’s Dynamo and ElastiCache around the sides — they help us break that in-memory dependency. S3 is our persistence layer: S3 Select is a great product. Using the most managed services. Avoiding undifferentiated stuff. Spending maximum time adding value.

A batch run typically takes 5–30 minutes and runs on 50–200 containers. We have 100 clients. Most have a Production and UAT environment. Each is a separate tenant. Along with demo and internal environments, that’s well over 200 individual accounts. That’s potentially tens of thousands of containers running at once.

But, and this is the kicker, because it’s all IaC and ephemeral, there’s less toil, less undifferentiated work and no more resource contention or wasteful over provisioning.

Maintaining Consistency

With all changes of this kind, the devil is in the detail. For us, that’s making sure all containers in an ECS cluster are the same for the life of a calculation run, but still be able to ship new code to our clients as quickly as possible.

Right now, the FundApps team is ~40 engineers (but, y’know, let’s talk). Not all of that team work on the engine, but even so, tens of changes happen on a daily basis.

Ship changes, calculate results, don’t worry!

Once our test suite passes a Github Action bakes a container image and pushes it to ECR in a central CI/CD Account. We create a new task definition in each tenant account that references this new image. We have confidence in the pipeline that a good build will produce a container and update the Task Definition in 200+ accounts. And we have confidence in our ECS configuration that it will not pick up changes while a batch is running.

Then it’s just a waiting game. Eventually, we’ll see input from clients. The whole apparatus will spin in to life. The ECS cluster will see its size changed from 0 to (say) 100. As tasks spin up they pull the correct image from ECR. Work happens. Results get calculated. Markets become more transparent. And then it’s done and we spin it all down again. And that’s it.

Because the task definition is pinned to a version, even if a task dies for some reason, we’ll still start the same version. There’s no risk of having an inconsistent set of tasks within a cluster.

Because there’s a CD process happening, batch runs seconds apart could be on different versions. But that’s what we want. Latest and greatest software in front of clients at the earliest possible opportunity.

The payoff

We’ve gone from lovingly tending a handful of big heavy boxes, to starting, upgrading and terminating thousands of containers on a daily basis and not worrying about it

Because we are using ephemeral, idempotent compute — container images — there’s no worry about config drift, patch management or maintenance windows in a traditional sense. It just happens.

Our Infra folk have their weekends back. Our clients have higher quality of service. We’re dedicating more effort to new features. And we’re not wastefully consuming power running servers at idle.

That’s cool. And cheap. And secure. And fast.

--

--