Video live streaming: Notes on RTMP, HLS, and WebRTC

These days, when people talk about "live streaming," they might be talking about three quite different underlying technologies.

RTMP is widely used to send video into a live session, but is rarely used for viewing video streams.
HLS is how platforms like Twitch deliver live streams to large audiences.
WebRTC was designed to support interactive use cases like video calls, and is the underlying technology in applications like Google Meet, Microsoft Teams video, and Discord video.

Why are there three standards?

Well, partly these standards evolved over time, and are still evolving. Our computers and phones and networks continue to get faster, people keep thinking up new things to do with faster computers and networks, and better and better technical building blocks are created to support these new things we're all doing.

But also, delivering video at scale is challenging, and engineering trade-offs abound. HLS and WebRTC are nice examples of two (good) approaches to solving a problem (video streaming) that optimize for very different aspects of that problem.

From RTMP to HLS: Two decades of progress in live streaming

RTMP is 20 years old, and was originally developed by Macromedia for the Flash server and player. That's pretty ancient technology in Internet video terms. But because RTMP has been around so long, it's widely supported as a way to send live video from one computer system to another.

If you are a Twitch streamer, for example, you're almost certainly using RTMP to send video from your gaming machine/music studio/vibe lab to Twitch's cloud infrastructure.

On the other hand, if you're watching a stream on Twitch, you're not watching that original RTMP stream. Twitch uses a much newer standard, HLS, to deliver video to the millions and millions of people who watch live streams every day.^[1]

There are quite a few nice things about HLS, but two advantages in particular over RTMP.

First, HLS supports multiple bitrates and allows viewing clients to switch between multiple bitrates dynamically. This is generally important for video delivery at scale, and it's especially important for live streams. If I'm watching a video on my phone on a not-very-great cellular data connection, I have a lot less bandwidth available than if I were watching the same video at home where my Internet connection is (usually) very fast.^[2]

What this means is that I actually need to watch a different encoding of the video if I'm on my cellular data connection than if I'm using my home internet. On my phone, I just want to see the video and I know the quality's not going to be great. At home, I want the best quality video available.

A further wrinkle is that the bandwidth I have available might change while I’m watching the video. My phone could move closer to a cell tower and have faster service. Or at home, my laptop might decide to download a big system update which could cut in half the bandwidth available for video.

Just the chunks, ma’am

To support dynamic playback bitrates, HLS video is packaged as a series of short time chunks. Each chunk is encoded at several different bitrates. A typical HLS "bitrate ladder" today will include five encodings ranging from about 6 megabits per second on the high end to about 250 kilobits per second at the bottom. (Higher bitrates deliver higher video quality.)

Video players that support HLS are able to continuously monitor approximately how much bandwidth is available and to switch to downloading smaller-bitrate or larger-bitrate chunks.

The metadata about chunks is stored in manifest files, which are usually named something.m3u8. If you open Chrome devtools and start playing a video on Twitch, you can look in the network panel for fetches of .m3u8 file segments, and see a bunch of interesting information about the HLS encoding!

Standing on the shoulders of giants

The second advantage that HLS has over RTMP is that the HLS chunks can be delivered efficiently by HTTP CDNs like Cloudflare, Fastly, Akamai, and Cloudfront.

"HLS" is an acronym that expands to "HTTP Live Streaming." HLS chunks and manifests mostly just look like files. Players fetch manifests and chunks with normal HTTP requests. So HLS benefits from the amazing capabilities of modern CDNs.

RTMP, on the other hand, is its own TCP-level thing. It is a streaming-oriented protocol, rather than a request-oriented protocol like HLS. To send video and audio over an RTMP connection, you open up a TCP socket and you start pushing RTMP-formatted data over that socket.

An enormous amount of infrastructure engineering work has gone into making HTTP fast, scalable, and cost effective everywhere in the world. The CDN-friendly nature of HLS was a core design goal and is a big deal.

But there's a catch. Latency. Which will bring us to WebRTC.

But first… an interlude on latency

Let's do some hand-waving and define the latency we're interested in for live streaming as "the end-to-end delay between a video sender and a video viewer."

Here's a very schematic view of the stages that video bits go through to get from a video sender to a viewer, relayed through a media server:

capture (for example, a webcam)
encoding (into a compressed video format like AVC)
transmission to the media server (move those bytes across the network somehow)
processing the video on the media server
transmission to the viewer (move those bytes across the network again)
playback (decode, decompress, and finally show the video on a screen)

For each of these steps, there are complicated trade-offs between video quality, reliability, cost, and latency.^[3]

With today's fast computers and nifty video codecs it's possible to compress video at fairly high quality with just a few tens of milliseconds of latency delay. Similarly, networking in general is pretty fast. We've gotten good at routing packets on the Internet. Steps 1-6 typically add up to somewhere between 50ms and 300ms.

Except... for possible knock-on effects from how we choose to encode, package, and transfer the video data. (Steps two through four, above.) The chunked, request-oriented approach of HLS forces a lot of extra latency at several stages.

Six seconds is a typical HLS chunk length.^[4] To oversimplify a little bit, that means there's six seconds of latency added to the pipeline while we encode, package, and upload each chunk. Then our CDN has to fetch and cache the chunks. And finally, on the viewer side, most players will download two full chunks and cache them locally before starting to play the video while the third chunk is downloading.

Our latency is now somewhere between 12 and 20 seconds.

It's possible to push HLS latency down by using shorter chunk sizes, streaming the chunk uploads to our CDN origin server, extending our CDN's low-level mechanics so that data can start propagating while chunks are in flight, and optimizing buffering on the playback side.

All of these changes make playback less resilient and cut against the core engineering trade-off at the heart of HLS: optimizing video for delivery via HTTP infrastructure.

Below chunk sizes of about one second, we lose most of the benefits of delivering video via CDNs.^[4:1] A request-oriented approach like HLS can therefore get down to about two seconds of latency in the best case, and four or five seconds of latency in the typical case.

If our target latency is lower than that, we'll need to take a different approach.

WebRTC – an industry standard for low-latency audio and video

WebRTC is a stream-oriented standard (like RTMP) that supports adaptive bitrates and is natively supported in today's web browsers (like HLS). Low latency was the primary design goal of WebRTC.

With WebRTC, video encoding happens on the sending side and video is sent as a continuous stream. WebRTC media servers don't usually transcode the video.^[5] When possible, WebRTC connections are UDP rather than TCP, which lowers networking overhead and latency. And, finally, on the receiving side, a typical playback buffer is 30ms, rather than HLS's typical multi-second buffers.

All of this together means that WebRTC end-to-end latency will usually be between 50ms and 200ms. That's low enough to work well for two people having a conversation and is in line with typical latencies for telephone calls.

WebRTC's low latency comes with two big trade-offs.

First, the approach to maintaining video quality has to be different from the approach that HLS relies on. And second, it's harder to support large audiences with WebRTC.

Bandwidth and packet loss

From a network engineer's perspective, video quality hinges on the answers to just two simple questions:

How many video packets can you push through a network connection? (This is typically called "bandwidth.") And,
How reliably do those packets arrive within a specific latency window? (This is a combination of packet loss and jitter.)

These two metrics are both important, and because a network stack is a complicated layer cake of protocols, they interact with each other in interesting and sometimes surprising ways.^[6] In all cases, though, the end user experience of a “worse network” is the same because there are only two degrees of freedom available to the video stack: the video either pauses while the player buffers (or re-buffers), or the user sees a lower-quality (lower-bitrate) version of the video.

Very approximately, as packet loss or jitter go up, the effective bandwidth of a connection goes down. However, it's fair to say that at moderate levels of packet loss, if you don't care too much about latency, you can just buffer a lot of packets and not think too much about packet loss and jitter.^[7]

Buffering a lot of packets — many seconds worth of packets — is the approach that HLS takes to maintaining video quality in the face of real-world network behavior.

Microwave ovens and latency budgets

Let's say you're using your home wifi connection to watch a movie on Netflix.^[8] You decide to microwave some popcorn. Your wifi connection is using one of the 2.4ghz bands, so packet loss on the network spikes terribly as soon as the microwave oven turns on.

HLS can deal with this pretty well. There's already several seconds of video cached, so that video keeps playing just fine. The player will notice that the in-progress chunk downloads are taking longer. If the packet loss spike was shorter than a few seconds, TCP retries will probably just compensate for the packet loss and the player won't need to do anything at all. If the packet loss continues, the player has plenty of time to decide whether to switch gears and start downloading lower-bitrate chunks.

The very tight latency budget for WebRTC makes for a completely different situation.

WebRTC playback buffers usually don't hold more than 50ms or so of video packets. If there's a big packet loss spike, the player can't just keep rendering cached video data. And the round-trip time to re-request a lost packet will often be slow enough that it's not worth asking the server to send us packets we missed.^[9] Finally, the player has to make any decision about hopping down to a lower bitrate very quickly.

All of this together means that the typical WebRTC failure mode for dropped packets is just to skip that packet and continue playing the video as well as possible.^[10]

In a real-time video call, video quality issues look like small framerate glitches, worsening to either visual corruption or long freezes if a video keyframe is missed.^[11] To compensate for packet loss, the media server and the player can cooperate to choose a lower bitrate stream. But that probably won't happen quickly enough to avoid visual artifacts. To compensate for all of this, WebRTC implementations usually default to somewhat lower bitrates than HLS implementations do.

HLS is more resilient to network issues, at the cost of much higher latency than WebRTC.

WebRTC can deliver sub-200ms latencies in the average case (and sub-400ms latencies almost always) but at somewhat lower average video quality.

If interactive latencies – latencies below 400ms – are the goal, WebRTC's trade-offs are clearly the right ones.

Not all heroes wear capes (some of them carry pagers)

The other big difference between HLS and WebRTC is how difficult it is to support large audiences for a live stream and large numbers of users across a service or application.

HLS offloads to HTTP CDNs most of the complexity of scaling up distribution. This is great, because modern CDNs are very good.

WebRTC is newer than HTTP, and vastly fewer engineering hours have gone into building infrastructure at scale for WebRTC than for HTTP.

However, more and more work is going into WebRTC infrastructure, because low-latency video is more and more widely used. A number of companies now offer scalable WebRTC infrastructure as a service. And an increasing number of open source projects provide excellent building blocks for deploying production infrastructure and for experimenting with new approaches.^[12]

Improving WebRTC infrastructure

The big challenges involved in building out a "WebRTC CDN" are:

Media servers need to copy incoming UDP video and audio packets and route the copies to each viewer of a live stream. This has to be done fast. This copy-and-routing job is relatively inexpensive from a compute perspective, and it's relatively simple conceptually. But down at the implementation level, there are a number of tricky components, such as estimating the bandwidth available to each viewer and switching to different bitrates as needed.

At some point, the number of packets that need to be routed exceeds the limits of a single machine. So it's necessary to implement cascading or mesh networking between servers.

Relatedly, the building blocks that are now standard for horizontally scaling HTTP and other request-oriented workloads don’t really work for WebRTC servers. So to scale an application or service (lots of live sessions in parallel) requires writing custom scale-out/scale-in logic, implementing appropriate service discovery, creating new monitoring and observability tooling, etc, etc.
Packet loss and jitter are generally much lower when connecting to nearby servers than to servers that are far away. So if the audience for a live stream is geographically distributed, it's important to have a distributed infrastructure of media servers and (again) necessary to implement mesh or cascading server-to-server media transit.
Encoding a stream in real time for playback with very small buffers is challenging. The current state of the art is to encode to three bitrates (rather than to five or six, as is typical with HLS).

Today, WebRTC is mature enough that for live streams with 15,000 viewers, it's often preferable to use WebRTC over HLS. Low latency allows features like bringing viewers onto a "stage" to participate in the live stream, interactive audience features like polls and emoji reactions, and real-time bidding in live auctions.

As WebRTC infrastructure continues to improve, larger and larger low-latency live streams are likely to become more and more widely used.

Twitch's servers transcode the RTMP video to HLS in real time. And that transcoding is actually multiple encodes to generate streams at several different bitrates. (Six bitrates, today at least. I just started watching a random Twitch stream in Chrome, and looked at the manifest files in the devtools network console, to check.) That's really impressive. Did I mention that computers keep getting faster? ↩︎
And also things like battery life, but we're trying to keep this explanation relatively simple. ↩︎
Chunk sizes are typically between five and ten seconds because chunks that are at least that big make it easy for clients to buffer enough video to keep playing even when there’s a temporary decrease in network bandwidth, and having relatively big chunks improves CDN performance (and cost) by increasing the likelihood that the chunks will be cached and served directly from the CDN’s edge nodes. ↩︎
As mentioned in the previous footnote, small chunks have lower CDN cache hit rates. To decide which chunk to download, the player needs to look at recent network performance. With smaller chunks, the player must average network performance over a shorter time window, so bitrate variance is higher. There are also more chunks, which adds to cache pressure. Finally, chunk downloads will overlap for a larger percentage of wall clock time, complicating bandwidth estimation. ↩︎ ↩︎
Media servers are a big topic, in and of themselves. For our purposes here, it's sufficient to note that WebRTC media servers typically just route packets rather than transcode video. Routing is much cheaper than transcoding in both time and dollars, both of which are important to scaling to large numbers of simultaneous viewers. ↩︎
Most Internet speed test tools only show you bandwidth. They don't show packet loss and jitter. They don't test for cliffs in packet loss by ramping up bandwidth slowly. (Non-linear fall-offs in effective bandwidth are the norm on real-world connections). Finally, many ISPs prioritize packets going to/from the big speed test origins. (I’m shocked, shocked to find that gambling is going on in here!) So … most speed tests are only very vaguely useful for testing whether a video call will work well. ↩︎
Sincere apologies to any queuing theory afficionados who are unhappy about this extreme over-simplification. ↩︎
Netflix does not use HLS. Netflix uses a custom version of MPEG-DASH. So do YouTube and HBO Max. (Well, YouTube is partly a testbed for new Google technologies, so it's a little bit hard to generalize about YouTube.) MPEG-DASH is similar enough to HLS that it's fine for the purposes of this article to treat them as the same thing. My understanding is that Netflix uses DASH partly for historical reasons and partly because DASH supports DRM better (for some version of "better") than HLS does. ↩︎
Remember that we're using UDP rather than TCP, so the player logic has to request missed packets. There's no automatic mechanism for that in UDP. ↩︎
Interestingly, the "very bad network" failure mode for the HLS approach is arguably worse than the "very bad network" failure mode of WebRTC. An HLS connection that can't maintain sustained real-time download and exhausts its (big) buffer will have to buffer until new chunks can load. Thanks to this failure mode and to YouTube's UI for it, "buffering" is a universally understood technical concept. On the other hand, a WebRTC implementation can always just play whatever video frames are able to squeeze through the bad network connection. Queueing theorists, are you happier now? ↩︎
Keyframes and encoding strategies for video resilience is another big and interesting topic. Check out Daily engineer Vanessa's glitch art talk on the subject. ↩︎
Companies that support large WebRTC live streams as a major feature of their commercial platforms include Millicast, Agora, and Daily. Open source WebRTC media server projects of note include Jitsi, Mediasoup, and Pion. Zoom offers low latency live streams using a proprietary media stack that is broadly similar to WebRTC. Most of the technical discussion of WebRTC in this article applies, more or less, to Zoom's (very good) stack, too. Zoom's proprietary stack does not run natively in the web browser, however, so Zoom's low latency live streams require using Zoom's native client apps. ↩︎

Categories

Topics