Maude Lemaire is a Sr. Staff Software Engineer @ Slack and technical lead for the backend performance infrastructure team.
Slack didn’t have any load testing tooling. They had a big customer that was putting strain on their infra.
The initial tool (API Blast) just made API requests to the server. It had a few parameters for specifying concurrency, rate limits etc. but that’s it.
- Why did they not use off the shelf tool which already provide these functionalities?
API Blast didn’t test the web-socket stack. The only thing that was tested is ingest. Message propagation wasn’t tested since there were no incoming clients.
- Edge API is for serving data cached in various Points-of-Presence around the world & required instantaneously. (Like the quick switcher, search bar that comes up on ⌘ + K)
- Real time services maintain all active web-socket connections with all users around the world. It organizes those connections by channels.
This worked for a while but a newer customer wanted a channel to house all 300K users where 100K users are likely to be active at the same time.
Slack is susceptible to load-related performance problems in 3 key ways:
- Massive fan-out (users sending messages in big channels )
- Event floods
- Thundering herd
- Eg: If someone posts a message that goes to 100K active users and even 10 users send 1 reaction, those reactions need to be propagated to 100K clients which leads to a 1 million web-socket events.
In mid-2019, they built a tool called Puppet Show.
- Simulated real desktop users by spinning up 1000s of headless chrome browsers logged into Slack (distinct users, distinct token) across a K8s cluster.
- They had a script simulate different actions like switching to a channel & posting a message, switching to another & adding a reaction, trolling Slackbot etc.
- A central component (Puppeteer) oversaw all the puppets. The puppets would check in regularly to update the puppeteer about their state. Puppets would receive a script from Puppeteer and start executing it.
- Sidenote: Don’t confuse Puppeteer with the Node.js library used to control Chromium.
- Pros
- High fidelity. Nothing better than logging into the slack client and executing actions.
- Flexible scripting using Javascript
- Cons
- Costs a lot. For each puppet instance, the cost was 37 cents per day. Running 100K instances would cost 37K USD everyday.
- Spinning up 100K instances took several days and pods would crash frequently.
- Once it was verified that Slack can handle the load, they stopped using this tool.
They signed up a customer (in 2020, around the pandemic) that wanted support for 500K (IBM probably) users in the same Slack instance.
The headless chrome browsers were replaced w/ lightweight client simulators written in Go.
- A koi is a Slack client simulation (single Go routine)
- A school is a collection of koi (single Go program that runs in a single pod on K8s)
- The keeper manages the schools and keeps track of the overall load test state and parameters.
A JSON configuration file needs to be provided when you boot up a load test that tells a “koi” what to do once booted.
Each action is mapped to a probability of being performed. Then there’s a set of implementation w/ each action that the koi should perform with each of those actions.
{
"behaviors": {
"chat.postMessage": {
"frequency": 0.043
}
},
"sequences": {
"chat.postMessage": {
"doc": "Sends a message to a random channel.",
"steps": [
...
]
}
}
}
Every tick (configurable but 1s by default), a koi runs through its entire configuration and performs the actions based on their odds.
This was good enough to simulate massive fan-out and event floods but not thundering herds since they couldn’t simulate coordinated behavior. That’s why they have “formations” that allows specifying the percentage of users participating over a period of time.
{
"formations": [
{
"name": "Populate announcement channel with reactions",
"begin_within_secs": 30,
"percent": 1.0,
"sequence": {
"steps": [
...
]
}
}
]
}
A koi cost 0.1 cents to run per day.