ragona

Building a high-performance TCP client with async/await in Rust

August 21, 2019

As you might have heard, async/await is coming to Rust soon. This is a big deal. Rust has already has popular crates (tokio, actix) that provide asynchronous concurrency, but the async syntax coming to stable in 1.39 is much, much more approachable. My experience has been that you can produce and reason about application flow much more easily, which has made me significantly more productive when dealing with highly concurrent systems.

To kick the tires of this new syntax I dug into the nightly branch, and built a high-performance TCP client called clobber. In this post I'll talk about why I think async/await in Rust is a big deal, and walk you some of the code in clobber.

So why is this a big deal?

It significantly lowers the barrier of entry for async programming. This will make Rust a good choice for a greater range of applications -- and in my opinion this will really unlock the ability to build web applications for developers who are interested in Rust, but also highly prioritze easily understandable code. Lowering the friction to learn a language and make something cool is a big deal, and I think that this change in particular will result in a significant uptick in adoption.

Building a TCP load generator

I've built a couple high-concurrency HTML and TCP clients, and it's a fun way to learn about performance in a language. It's also heavily network I/O bound, so your job is to get out of the CPU's way. It's a perfect use case to test async.

Project goals

1. Ease of use

One of the steps when you're getting a service or service feature ready to launch that always seems to take longer than you budget for is running the load test. Configuring and tuning a load test is hard, and depending on the characteristics of the client and server, you can end up needing to spin up a bunch of big servers just to test your service.

2. Performance

I wanted to build a high-performance TCP client that throws requests at a server as fast as possible, with the goal of supporting more requests with fewer load CPU cycles than other solutions. I want to be able to do quick testing against services running on localhost and achieve good performance using a small number of OS threads.

3. Readable code

The part of async syntax that I'm really excited about is that it produces much more readable code. One of the tricky parts here is that the Rust community doesn't yet have well-established and documented idioms around async, but luckily you can mostly fall back on the standard Rust style. In this project it was especially important to me to produce code that is easy to understand, since I struggled a bit to find meaty but not incredibly complicated examples of the async coding style in Rust.

Dependencies

One of the current rough edges in the Rust async is that most of the popular libraries do not yet support the syntax. In order to create clobber I had to dip into some less known libraries. tokio, for example, doesn't support async yet, but luckily there is an actively maintained fork called romio that offers basic TCP functionality using the async syntax. romio is main dependency for clobber, and just like tokio, romio is a wrapper on top of mio, which itself wraps non-blocking sockets in std::net.

The performance I'm seeing out of romio is solid, but I do see some impact from std::sync::mutex in my profiling, and I wonder if we might be able to squeeze a bit more performance out of the stack by offering a thread-local reactor. I don't know much about that area though, so there may be a very good reason that won't work.

clobber example code

Let's dig into the code. The whole program is kicked off with a normal syncronous method which will block until the test is done. It's handy that the async keyword can be scoped to a routine so that clients calling your library from sync code don't have to worry about it:

fn main() -> io::Result<()>
    // setup 
    tcp::clobber(settings, Message::new(bytes)).expect("Failed to clobber :(");
    // teardown 
}

The main clobber() method handles creating the actual async futures. Since you can only take advantage of .await within an async context, you have to pass your initial future to an executor, which will handle polling the future. I wanted to use a LocalPool executor to constrain execution to a single OS thread since clobber is heavily network I/O bound, and this isn' a good use case for work stealing. This improves performance, but requires some additional logic.

pub fn clobber(config: Config, message: Message) -> std::io::Result<()> {
    let mut threads = Vec::with_capacity(config.num_threads() as usize);

    for _ in 0..config.num_threads() {
        let thread = std::thread::spawn(move || {
            let mut pool = LocalPool::new();
            let mut spawner = pool.spawner();

            for i in 0..config.connections_per_thread() {
                spawner
                    .spawn(async move {
                        // do async things!
                        connection(message, config)
                            .await
                            .expect("Failed to run connection");
                    })
                    .unwrap();
            }
            pool.run();
        });
        threads.push(thread);
    }
    for handle in threads {
        handle.join().unwrap();
    }

    Ok(())
}

If you didn't care about LocalPool and managing OS threads then this example would be even shorter -- you could just use something like juliex to spawn all futures. I wanted to be able to more precisely control OS threads, and I also wanted to show a use case that is somewhat bigger than the standard bite-sized snippet.

From here out we're into async functions, so we don't have to worry about an executor. Here is the core of the connection() method:

async fn connection(message: Message, config:Config) {
   // This is the guts of the application; the loop that executes requests
    while !loop_complete() {
        if let Ok(mut stream) = connect(&config.target).await {
            if write(&mut stream, &message.body).await.is_ok() {
                read(&mut stream).await.ok();
            }
        }
    }
}

Alright, here's what's awesome about async. I've expressed a three step process here (connect, write, read) in a tight loop, and the loop repeats for the entire lifetime of the application. This is not going to block the OS thread when waiting on I/O though; it will yield execution, and each OS thread will hop between its request loops as appropriate. This pretty much just works, and I got solid performance results right out of the box with this approach. (Though I will note that this was not the first implementation I attempted, and I got less promising results out of some of the default reactors.)

Performance Results

I'm gonna stay a bit fuzzy here, since I'm currently just testing on my laptop, but it's fast! Easily faster than other similarly simple implementations that I've written in other languages. My laptop can happily send over 20,000 requests per second to a local host, and if you're willing to bundle multiple requests into a single write then you can hit over a million requests per second. I tried out some comparable tools written in C, and was pleased to see that the Rust implementation kept up or surpassed them.

I haven't tried to implement any interesting performance black magic, and there are plenty of areas that could be optimized for better performance. (If you happen to spot one, point it out! I'd love to hear about it.) I haven't done more formal benchmarking yet, but initial results suggest that for some use cases this is faster than some of my favorite existing tools, which often struggle around 10k per second.

Next Steps

std::sync usage

The standard sync primitives like mutex are still showing up in my profiling as potentially costly. There are still parts of this application that use std::sync via dependencies, and with enough worker threads I'm concerned that whatever thread is doing cross-thread sync will ultimately throttle in a way that limits top-end performance.

TLS support

I suspect that 99% of the use of such a tool would be for HTTP requests, and this enables fast testing of local HTTP endpoints by performing a connect/write/read loop. A full version of such a tool would need TLS support and a number of other currently-missing features, and Rust doesn't currently have a stable async TLS implementation. In general this is something I expect the Rust community to take care of, but it'll take time.

Conclusion

Async in Rust is going to enable a much better developer experience for a big category of applications, I expect that it will drive adoption. As soon as the async MVP hits stable I expect to see a flurry of activity, and I can't wait to see what the Rust community builds. There are still some rough edges, and it'll take some time before the async library ecosystem has stabilized. That might take a year or two, but Rust is making very fast progress in this area and producing excellent results.

Is it time to go turn your production web service into an async Rust application? No, probably not. But that time is coming, and early signs are incredibly promising.