Be up to date even more - low latency streaming to the rescue

What is latency?

Can you recall a situation when your neighbor was shouting through the wall because they knew that the football goal was scored before you did? Most of us have probably experienced it at least once. How is it possible that we are watching the same game but at a slightly different time? The answer is latency. It is the amount of time between the moment of a live situation and the moment when this specific situation is being displayed on the device screen. This is commonly known as “glass-to-glass” or “end-to-end” latency. To be specific, the typical latency for cable HD TV is 5 seconds. Under this value, but above 1 second, we are talking about low latency. It appears in game streaming or live sport events. Ultra-low latency drops below 1 second. The last one is also called near real time latency and it is desired in scenarios where the video must be delivered immediately, i.e. bi-directional interviews, online auctions or real-time device control like in drones.

The delay doesn’t have to be less than 1 second in all cases. It depends on the purpose of the application. For example, streaming previously recorded events can have higher latency because then the risk of packet loss is lower which results in a better picture quality. However, when it comes to live streaming, nothing satisfies viewers more than being up to date with the real event as much as they can.

Where does latency come from?

The journey that video takes from “glass to glass” is composed of a few stages: encoding the video from the raw camera input, packaging and ingest operations, CDN propagation and then decoding and loading the video to the player buffer (figure 1.). Each of them causes latency. Encoding is the process where raw video content is being compressed and changed to a digital file or format. Thanks to it, the content is compatible for different devices and platforms. The main goal of encoding is to achieve a smaller file size. The produced delay here depends on the encoder performance, used streaming protocol and output format. Next stage – ingest upload time – depends on the distance between ingest cluster and where encoding takes place. CDN propagation time can be reduced by designing the CDN for a very fast transfer. The lower the throughput is, the bigger the delay will be. We should also choose the closest CDN to the end-user to minimize the delay. Decoding is the opposite of encoding. The role of a decoder is to uncompress encoded video stream to display the raw video on a viewer’s screen.

Figure 1.

The biggest latency is created by the player – by media segment length, to be specific. For example, if the segment is 4 seconds long, the player will generate at least 4 seconds of delay from the moment when the first part is requested. Moreover, the common practice is to download at least one additional segment to put it in the buffer before playback starts, which increases the time to see the first frame of the video.

Use LL-DASH

Companies all over the world try to find better and better solutions to make the video latency the lowest they can. THEOPlayer is one of the examples. They created the HESP protocol which is not an official standard but it sounds promising as they claim that the video can be started at any point in time at the right latency. It happens thanks to so-called Initialization Streams which contain independent frames which are needed to start the playback. Two official protocols that support ultra-low latency are LL-DASH, LL-HSL. In this article, I want to touch on only the first one. DASH stands for Dynamic Adaptive Streaming over HTTP. It uses adaptive bitrate (ABR) technique which can dynamically pick the highest video quality of a stream that the user’s bandwidth allows.

DASH breaks the video content into smaller parts called segments and sends them to the client over HTTP. Those parts are available at different qualities – a single source media is encoded to multiple bit rates. Depending on which highest quality can be downloaded to be played without any stalls or re-buffering events, the player can switch between different encodings (different qualities) to serve users the video continuously. So, when a user’s Internet connection is poor, they receive a video in a lower quality until the connection gets better. It pays off more to watch a video without interruptions but in a worse quality than to wait for ages for the highest quality to be loaded to play the video.

The main role in DASH lies in its manifest – Media Presentation Description (MPD). It is an XML document that consists of information about media segments, how they are related, the bandwidths which are associated with streams, metadata like mimeType, codecs, chunk byte-ranges, video resolutions and more. In other words, it tells the player what has to be downloaded to play the video in the right way. The manifest holds information when a segment will be available on the server. Thanks to that, the client can observe when segments become ready to download, and request it as soon as it is available, which reduces the delay a little.

The manifest structure is shown below and in figure 1 inside the packager stage. Period includes Adaptation Sets which can have different Representations that differ in resolution (width and height attributes). Inside them, we can see Segments.

<Period id="0" start="PT0.0S">
		<AdaptationSet id="0" contentType="video" startWithSAP="1" segmentAlignment="true" bitstreamSwitching="true" frameRate="30/1" maxWidth="1280" maxHeight="720" par="16:9">
			<Representation id="0" mimeType="video/mp4" codecs="avc1.7a001f" bandwidth="1500000" width="1280" height="720" sar="1:1">
				<SegmentTemplate timescale="15360" initialization="init-stream$RepresentationID$.m4s" media="chunk-stream$RepresentationID$-$Number%05d$.m4s" startNumber="1">
					<SegmentTimeline>
						<S t="0" d="128000" />
					</SegmentTimeline>
				</SegmentTemplate>
			</Representation>
			<Representation id="1" mimeType="video/mp4" codecs="avc1.7a001f" bandwidth="800000" width="960" height="540" sar="1:1">
				<SegmentTemplate timescale="15360" initialization="init-stream$RepresentationID$.m4s" media="chunk-stream$RepresentationID$-$Number%05d$.m4s" startNumber="1">
					<SegmentTimeline>
						<S t="0" d="128000" />
					</SegmentTimeline>
				</SegmentTemplate>
			</Representation>
		</AdaptationSet>
	</Period>

One segment is usually from 2 to 10 seconds in length. When using DASH (not LL-DASH), the whole segment must be ready to play the first frame which causes a delay. The bigger the segment size is, the bigger the latency will be. To lower the latency, reducing the segment size is not an option. Of course, the delay would be smaller but at the cost of losing quality. Fortunately, CMAF comes to the rescue. Common Media Application Format is a container format which simplifies streaming with one common container that works across all platforms and devices. It holds audio and video content. It provides interoperability between HLS and DASH which means that the same stream can be played on Apple as well as on Microsoft devices (figure 2.). Before CMAF one video had to be encoded and stored twice because Apple software demanded HSL to play a video whereas Microsoft – DASH.

Figure 2.

The second crucial thing in CMAF is chunked transfer encoding. It allows cutting video segments into smaller chunks (figure 3.). Moof is a single Movie Fragment Box which contains the information about the media streams included in one single fragment, for example the timestamp. Mdat stands for Media Data Boxes and it stores binary codec data.

Figure 3.

How is such a low latency achieved?

Each chunk is independent of the others and is sent to the player individually. Thanks to such a setup, the player doesn’t have to wait for the whole segment to be completed because chunks are delivered once they are available using Chunked Transfer Encoding. So, the player can play generated chunks belonging to the same segment that is not finished already because other chunks are still coming. That’s the gist of low latency. The overhead is reduced as well as the client’s buffer and consequently – latency. One disadvantage of this solution is that there is a risk that the video will be unstable because of the small buffer, and so there is a little amount of loaded content that is ready to play.

Let’s imagine that now a camera is just taking the sixth video segment assuming each segment is 2 seconds long. Using standard DASH we, as viewers, are seeing on our screens the third segment because the player has a buffer of 3 segment length. Moving to LL-DASH with 500 ms chunks makes the user experience far better! End users can watch the sixth video segment while the camera in the live event is capturing the sixth one.

Figure 4.

As we can see, the player is responsible for the largest amount of delay. By cutting video segments into smaller portions, we can reduce the latency a few times.

Low latency is always a tradeoff between the most up to date video and the stream stability. Ultra-low latency is an even more strict technique because the delay is 1 or less second. The key to efficient streaming is to find the right balance between the purpose of the video and the acceptable delay.