Logo
HowVideo.works

How video works

This is an informational site about video and how it works. We started by talking about Playback, next we added content about Delivery. Stay tuned for more about Processing and Capture.

1
Playback
Delivery
2
Processing
3
4
Capture

Playback

It might seem obvious, but there's a lot wrapped up in the word "playback"! The video player's UI is the first thing that most people associate with playback since it's what they see and interact with the most whenever they watch video on a site or platform. The icons and colors used in a player’s controls might be the most visible to the end-user, but that's just the tip of the playback iceberg.

The player controls or exposes other vital aspects of playback beyond just the controls themselves. Its functionality includes features like subtitles and captions, programmatic APIs for controlling playback, hooks for things like client-side analytics, ads, and much more.

Perhaps most importantly, a modern video platform will use what's called adaptive bitrate streaming, which means they provide a few different versions of a video, also known as renditions, for the player to pick from. These different versions vary in display sizes (resolutions) and file sizes (bitrate), and the player selects the best version it thinks it can stream smoothly without needing to pause so it can load more of the video (buffering). Different players make different decisions around how and when to switch to the different versions, so the player can make a big difference in the viewer's experience!

You might remember watching videos on Netflix or Youtube and noticing that sometimes in the middle of the video the quality will get worse for a few minutes, and then suddenly it will get better. That is what you saw when the quality changes you are experiencing adaptive bitrate streaming. If you are doing any kind of video streaming over the internet your solution must support this feature, without it it's likely that a large number of your viewers will be unable to stream your content.

Iceberg Image

HLS

While the concept (and spec overall) can be intimidating, the basic concept behind HLS is surprisingly simple. Even though the term stands for "HTTP Live Streaming”, this technology has been adopted as the standard way to play video on demand. You take one big video file and break it up into small segments that can be anywhere from 2-12 seconds. So if you have a two-hour-long video, broken up into 10-second segments, you would have 720 segments.

Each segment is a file that ends with .ts. They are usually numbered sequentially so you get a directory that looks like this:

  • segments/

    • 00001.ts
    • 00002.ts
    • 00003.ts
    • 00004.ts

The player will download and play each segment as the user is streaming. And the player will keep a buffer of segments in case it loses network connection later.

Now let’s take this simple HLS idea a step further. What we can do here is create the segment files at different renditions. Working off our example above, using a 2-hour long video with 10-second segments we can create:

  • 720 segment files at 1080p
  • 720 segment files at 720p
  • 720 segment files at 360p

The directory structure might look something like this:

  • segments/

    • 1080p/

      • 00001.ts
      • 00002.ts
      • 00003.ts
      • 00004.ts
    • 720p/

      • 00001.ts
      • 00002.ts
      • 00003.ts
      • 00004.ts
    • 360p/

      • 00001.ts
      • 00002.ts
      • 00003.ts
      • 00004.ts

Now, this starts to get cool, the player that is playing our HLS files can decide for itself which rendition it wants to consume. To do this, the player will try to estimate the amount of bandwidth available and then it will make its best guess as to which rendition it wants to download and show to you.

The coolest part is that if the amount of bandwidth available to the player changes then the player can adapt quickly, this is called adaptive bitrate streaming.

Manifest Files (a.k.a. Playlist Files)

Since the individual segment files are the actual content broken up into little pieces it is the job of the manifest files (aka playlist files) to tell the player where to find the segment files.

There are two different kinds of manifest files. For a single video there is one master manifest and multiple rendition manifests. The master manifest file is the first point of contact for the player. For an HTML player in the browser, the master manifest is what would get loaded as the src= attribute on the player. The master manifest will tell the player about each rendition. For example, it might say:

  • I have a 1080p rendition that uses 2,300,000 bits per second of bandwidth. It’s using these particular codecs, and the relative path for that manifest file is "manifests/rendition_1.m3u8”.
  • I have a 720p rendition that uses 1,700,000 bits per second of bandwidth. It’s using these particular codecs, and the relative path for that manifest file is "manifests/rendition_2.m3u8”.
  • I have a 360p rendition that uses 900,000 bits per second of bandwidth. It’s using these particular codecs, and the relative path for that manifest file is "manifests/rendition_3.m3u8”.

When the player loads the master manifest it observes all the renditions that are available and picks the best one. To continue this example, let’s say our player picks the 1080p resolution because the player has enough bandwidth available. So now it’s time to load the rendition manifest ("manifests/rendition_1.m3u8")

The rendition manifest looks a lot different than the master. It is going to have some metadata and a link to every individual segment. Remember the segments from up above are the actual pieces of video content. In our example of a 2-hour long video broken up into 720 segments of 10 seconds each. The rendition_1.m3u8 manifest is going to have an ordered list of the 720 segment files. As one would expect, for a long video, this file can get quite large:

  • segments/1080p/00001.ts
  • segments/1080p/00002.ts
  • segments/1080p/00003.ts
  • segments/1080p/00004.ts
  • segments/1080p/00005.ts
  • ….etc for 720 segments

In summary, these are the steps the player goes through to play a video:

  1. Load the master manifest which has information about each rendition
  2. Find out which renditions are available and pick the best one (based on available bandwidth)
  3. Load the rendition manifest to find out where the segments are
  4. Load the segments and start playback
  5. After playback starts, that is when we get into adaptive bitrate streaming.

DASH

DASH employs the same strategy as HLS. One video file is broken up into small segments of different resolutions. The format of a DASH playlist file is in XML instead of plaintext like it is with HLS. The specifics of how the DASH manifest tells the player where to find each segment file is a little different. Instead of linking to each segment specifically, the DASH manifest supplies a "SegmentTemplate” value that tells the player how to calculate the specific link for each segment.

Whether using HLS or DASH, the biggest benefit that they both bring to the table is that manifest files and segments are delivered over standard HTTP. Having a streaming format that works over standard HTTP means that all of this content can be served over tried and true HTTP servers and it can be cached of existing CDN infrastructure. Moving all of this video content around is as simple as sending and receiving HTTP requests.

Adaptive Bitrate

Adaptive Bitrate Streaming

For both HLS and DASH, since we are creating all these different renditions of our content, players can adapt to the different renditions in real-time on a segment-by-segment basis.

For example, in the beginning, the player might have a lot of bandwidth, so it starts streaming at the highest resolution available (1080p). Streaming is going smooth for the first 5 minutes. At the end of the first 5 minutes maybe the internet connection starts to suffer and now less bandwidth is available, so the player will degrade to 360p for as long as it needs to. Then, as more bandwidth becomes available again, the player will ratchet back up to higher resolutions.

All of this resolution switching is entirely up to the player. Using the right player can make a huge difference.

Let’s take a real-world example. You open up Netflix on your mobile device and decide to watch The Office for the 11th time in a row. After you scroll around and pick your favorite episode (Season 2 Episode 4) and hit play.

Now the player kicks into gear and you see a red spinner. What is going on behind the scenes while you see the red spinner is that the player is making an HTTP request to Netflix’s servers to determine what resolutions are available for this video. Next, the player will run a bandwidth estimation algorithm to get a sense of how strong your internet connection is. Right now, you are on a good wifi connection so the player will start playing at the highest rendition available for your screen size.

As you are watching this 23-minute episode of The Office the player is working hard in the background to keep up with your streaming. Let’s say you go for a walk and get off the WiFi and now you’re on a cellular network and you don’t have a strong signal. You may notice that at times the video gets a little blurry for a few minutes, then it recovers. Right before the video got blurry the player determined that there was not enough bandwidth available to keep streaming at a high rendition, so it has two options (1) Buffer, meaning pause the video and show a loading spinner and make you wait while it downloads more segments or (2) Degrade to a lower resolution so you can keep watching. A good player will pick number 2: it’s better to give you a lower resolution instead of making you wait.

The player’s goal is always to give you the highest rendition that it can, without making you wait.

MP4 & WebM

MP4 and WebM formats are what we would call pseudo-streaming or "progressive download”. These formats do not support adaptive bitrate streaming. If you have ever taken an HTML <video> element and added a "src” attribute that points directly to an mp4 than this is what you are doing. When linking directly to a file most players will progressively download the file. The good thing about progressive downloads is that you don’t have to wait for the player to download the entire file before you start watching. You can click play and start watching while the file is being downloaded in the background. Most players will also allow you to drag the playhead to specific places in the video timeline and the player will use byte-range requests to estimate which part of the file corresponds to the place in the video you are attempting to seek.

What makes MP4 and WebM playback problematic is the lack of adaptive bitrate support. Every user who watches your content must have enough bandwidth available to download the file faster than it can playback the file. When using these formats you constantly have to make a tradeoff between serving a high-resolution file that requires more bandwidth (and thus, locking out users with lower bandwidth) vs. serving a lower resolution file and requires less bandwidth (and thus, unnecessarily lower the quality for the high-bandwidth users). This becomes especially important as we are experiencing a shift as more and more users are streaming video from mobile devices on cellular connections. Cellular connections are notoriously inconsistent and flakey and the way to reliably stream to these devices is through formats that support adaptive bitrate streaming.

Players

There's nearly an infinite number of players on the market to choose from for HLS and DASH. Some are free and open-sourced, some are paid and require a proper license to use. Each player supports different features, for example, captions, DRM, ad injection, and thumbnail previews. When choosing a player you will need to make sure it supports the features you need and allows you to customize the UI elements enough for you to control the look and feel. These are all decisions you will need to make for web-based playback in the browsers, native app playback on iOS and Android or any other operating systems where your content will be streamed.

Delivery

For video delivery, the goal always remains the same: deliver the video segment as fast as possible to avoid buffering and ensure an uninterrupted viewing experience.

As we learned in the Playback section, any video player supporting adaptive bitrate (ABR) has the goal to provide the highest quality video without interruption. For this goal to be achieved, the system that stores the video content must deliver it as fast as possible.

Video content can include segmented delivery files such as TS segments (for older HLS delivery), MP4 fragments, or CMAF chunks for DASH and modern HLS. As for the system that’s responsible for delivering that content, there are two primary components: the origin server and the content delivery network (CDN).

As a video developer, the origin is your source of truth. It’s where you upload your original video files and where other system components like the CDN pull files from to help you deliver your video content faster.

You can deliver video content directly from your origin, but it’s not a good idea if your audience is large and dispersed. When this is the case, CDNs help you scale your service to distribute videos to many viewers with higher speed.

Below, we’ll talk about CDNs and other system components that are responsible for the smooth streaming experiences we’ve all come to expect.

Content Delivery Networks

Before diving into more technical details about CDNs and video streaming, let’s break the CDN concept down into a simple analogy.

A single video server (i.e. origin) responding to numerous requests from streaming devices in multiple regions is like a single cashier responding to numerous requests for purchases from customers in multiple service lines. Just like the cashier experiences stress, so does the video server. And when things come under stress they tend to slow down or shut down completely. The CDN prevents this from happening, especially for video content that is popular among a dispersed audience.

Cashier Image

In more technical terms, a CDN is a system of interconnected servers located across the globe that uses geographical proximity as the main criteria for distributing cached content (e.g. segmented video files) to viewers. When a viewer requests content from their device (i.e. clicks a video play button), the request is routed to the closest server in the content delivery network.

If this is the first request for the video segment to that CDN server, the server will forward the request to the origin server where the original file is stored. The origin will respond to the CDN server with the requested file, and, in addition to delivering the file to the viewer, the CDN server will cache (i.e. store) a copy of that file locally. Now, when future viewers request the same file, the origin server is bypassed and the video is served immediately from the local CDN server.

POP Image

Considering that the internet is mostly made up of fiber buried underground and underwater, it makes sense why the location of video servers is important. For example, a viewer in Asia wanting to stream content from an origin server based in the United States will experience poor loading times as this request has to travel across oceans and continents. But with a CDN, the video is almost always delivered from a CDN server in Asia. These locations that store cached video content are known as points of presence (PoPs).

POP animation

The business case for improving the viewer experience with CDNs should make sense at this point. By getting video from a local cache, viewers don’t have to wait more than a couple hundred milliseconds for a video they want to watch to start playing. This amount of response time is virtually unnoticeable and creates an experience that incentivizes viewers to stay on the platform where the video is playing.

As for the business case from a cost perspective, delivering video with CDNs significantly reduces bandwidth costs because content doesn’t have to travel as far. Delivering content from local caches requires less stress on networks, and lower costs related to data transmission efficiency are passed on to the video provider. Imagine all the network overhead if requests from all over the world were being served from a single server!

Another business case deals with uptime and reliability. CDNs have traffic capacity that exceeds most normal enterprise network capabilities. Where a self-hosted video may be unavailable due to unexpected traffic peaks, CDNs are more distributed and remain stable during peak traffic instances. For this reason content delivery networks are also referred to as content distribution networks.

How Video Content Travels

When talking about how content travels (and how the Internet works) it’s common to use the terms first mile, middle mile, and last mile.

First mile: When content travels from the origin to the CDN. For example, video content located on an origin server like Amazon S3 is sent to a CDN like StackPath when it’s requested for the first time by a viewer. This content travels from the origin to the CDN using the Internet backbone as a highway. This backbone is composed of various networks owned by different companies that link up using an agreement called peering.

Middle mile: When content travels from the CDN to the ISP. After the video content is cached in the CDN, it can directly connect to Internet Service Providers (ISPs) that the viewer uses to connect to the Internet with. This direct connection to tier-1 network carriers like Veriszon and Comcast allows the CDN to provide enterprise-grade performance and route traffic around network congestion and weather-caused outages that’s common in the public Internet.

Last mile: When content travels from the ISP to the end user. At this point, the video content is traveling through buried fiber, telephone lines, or cellular towers to make it to the viewer’s device.

As for how these various pieces of the Internet communicate, the most common language is the HTTP protocol. This is defined as a stateless protocol under the GET method, meaning that the content does not change when it travels.

With this in mind, a CDN that had dozens or even hundreds of locations does not need the servers in those locations to communicate with one another to make sure it has the right video content. As long as each server fetches the video content from the origin, every viewer will receive the same video, regardless of their location in the world.

Multi-CDN

In a multi-CDN environment, the goal is to distribute load among two or more CDNs. Multi-CDN enables you to direct user requests to the optimal CDN according to your business needs.

In the following figure, there are three CDN providers A, B, and C. CDN provider A is providing an excellent service for users 1 and 2, but CDN provider C offers a better experience for user 3. In a multi-CDN environment, requests by users 1 and 2 can be directed to CDN A and requests by user 3 can be directed to CDN C.

CDN selection

Depending on the technique, selecting an optimal CDN could be based on a number of criteria: availability, geographic location, traffic type, capacity, cost, performance, or combinations of the above. Services like NS1 make this type of routing possible.

Multi-CDN is something Mux uses itself. For example, in June 2019, Verizon made an erroneous routing announcement update which channeled a major chunk of Internet traffic through a small ISP in Pennsylvania. This led to significant network congestion and degraded performance for some of the large CDNs in its multi-CDN setup. By using CDN switching, Mux was able to divert most video traffic to StackPath’s CDN and continue to deliver optimal viewing performance during the outage.

Using various content delivery networks, Mux is driving HTTP Live Streaming (HLS) latency down to the lowest levels possible levels, and partnering with the best services at every mile of delivery is crucial in supporting this continued goal.

Brought to you by
stackpath logo