Working with video call participants’ media tracks

This is part of our social gaming series, in which we walk through key features of our Daily-powered social game, Code of Daily: Modern Wordfare. This is a standalone post. You do not need to read the rest of the series to follow along. But if you’re curious about the greater context of our application, check out  part one or the rest of the series.

Introduction

Daily’s Client SDK provides developers a high amount of flexibility when it comes to running their video calls. This enables seamless integration of Daily-powered calls with the UX and branding of the consuming application.

That flexibility also means developers end up doing a bit more handling of their users’ video, audio, and screen tracks in their own code—specifically, retrieving tracks from Daily and keeping them updated on their application’s media DOM elements.

In this post, we’ll take a look at how Daily video and audio features were incorporated into our social game, Code of Daily: Modern Wordfare. Specifically, we’ll focus on handling players’ video and audio with the media tracks Daily provides.

Toggling video in Code of Daily: Modern Wordfare
Toggling video in Code of Daily: Modern Wordfare
💡
For a less hands-on video call implementation that requires no track handling on your part, check out Daily Prebuilt, which you can embed straight into your application in just a few lines of code.

Following along

Check out the demo repository to follow along directly on GitHub. I’ll link any relevant reference to specific code here. You can also follow the instructions to clone and run the game locally if you’d like to spin up the application on your own machine.

Media properties of a Daily call participant

Each presence-enabled call participant in Daily can be retrieved through the Daily call object’s participants() instance method. The participant object contains references to that participant’s video and audio tracks.

Daily provides a few pieces of key data in relation to each track. You can find a list of all the track properties Daily provides in our documentation.

In Code of Daily: Modern Wordfare (CoD), we’ve used two main Daily-provided track properties to render players’ video and audio:

  • state: whether the track is playable, blocked, loading, or in some other state
  • persistentTrack: the media track itself

Using these two properties, we can implement all of the track juggling we need for the social game. We’ll go through how I’ve done that here. But first, let’s take a closer look at what a track actually is.

What is a track?

A Daily track is an instance of MediaStreamTrack. A MediaStreamTrack contains the audio or video data that a call participant is sending to other participants who are subscribed to their tracks. Such tracks can be attached to a MediaStream, which is then assigned as the source of a media DOM element (like an HTMLVideoElement or HTMLAudioElement). Each MediaStream instance can contain one or more MediaStreamTracks. Each media DOM element can contain one source object, which is usually a MediaStream.

An HTMLVideoElement can be created as follows:

An HTMLAudioElement can be created as follows:

In CoD we only concern ourselves with a participant’s video and audio tracks. But if your application calls for screen sharing, your tracks could also be screen video and system audio.

With Daily’s Client SDK, developers have direct control over which remote tracks a user is subscribed to. In the case of CoD, since our game sessions are intended for only a small number of players, each player subscribes to every other player’s video and audio automatically.

You might notice that when retrieving a track from a Daily participant, the track object contains two fields that seem quite similar:

  • persistentTrack (use this one!): contains a reference to a track in any state
  • track (avoid): contains a playable track, if one exists

Why use persistentTrack?

Both track and persistentTrack are instances of MediaStreamTrack. However, track is only set when there is a ready-to-be-played media track available. That means you will never find a track in a "blocked", "off", or other non-playable state in this field.

The persistentTrack field contains a track regardless of its playability. This track may be "interrupted", "blocked", "loading", or any other non-playable state.

We recommend using persistentTrack in your application because it maintains a track reference regardless of its playability, allowing you to minimize repeatedly resetting tracks on your media DOM elements. Redundantly swapping tracks and streams on your media elements is not recommended for the following reasons:

  • Juggling playable and non-playable tracks on associated DOM elements exposes visual issues like black frames during track interrupts in some browsers.
  • In some browsers (like Safari), swapping audio tracks on the media DOM element can cause audio interruptions and prevent autoplays, especially if the user has the application tab in the background.

We plan to eventually update the track property to behave exactly like persistentTrack, so you may as well future-proof your implementation by using persistentTrack today.

How to retrieve Daily’s media tracks

There are many ways to utilize Daily’s media tracks and your application flow might call for something unique, but let’s go through the approach used in CoD.

First, I added handlers to the "track-started" and "track-stopped" events. These are emitted by Daily when the playability of a track has changed. This can be triggered by a participant toggling their media on or off, device permission changes, or even network hiccups. The handler definition happens in CoD’s Game class, during the initial call setup.

   this.call.registerTrackStartedHandler((p) => {
      const tracks = Call.getParticipantTracks(p.participant);
      try {
        updateMedia(p.participant.session_id, tracks);
      } catch (e) {
        console.warn(e);
      }
    });

    this.call.registerTrackStoppedHandler((p) => {
      const tracks = Call.getParticipantTracks(p.participant);
      try {
        updateMedia(p.participant.session_id, tracks);
      } catch (e) {
        console.warn(e);
      }
    });

When Daily receives a "track-started" event, the code above retrieves all available participant tracks using the getParticipantTracks() static method on our Call class:

  // getParticipantTracks() retrieves video and audio tracks
  // for the given participant, if they are usable.
  static getParticipantTracks(p: DailyParticipant): Tracks {
    const mediaTracks: Tracks = {
      videoTrack: null,
      audioTrack: null,
    };

    const tracks = p?.tracks;
    if (!tracks) return mediaTracks;

    const vt = tracks.video;
    const vs = vt?.state;
    if (vt.persistentTrack && (vs === playableState || vs === loadingState)) {
      mediaTracks.videoTrack = vt.persistentTrack;
    }

    // Only get audio track if this is a remote participant
    if (!p.local) {
      const at = tracks.audio;
      const as = at?.state;
      if (at.persistentTrack && (as === playableState || as === loadingState)) {
        mediaTracks.audioTrack = at.persistentTrack;
      }
    }
    return mediaTracks;
  }

Above, we will retrieve the participant's audio and video tracks from their participant object. If the track is in a "playable" or "loading" (i.e., expected to be playable shortly) state, we set it on our mediaTracks object and return it to the caller (which is our track event-handling function).

If the tracks are not playable, we do not return them, and the local participant will experience the given participant as hidden or muted (we’ll go through how we do that below).

The Tracks type that is returned above looks as follows, with fields for both kinds of track that we care about (video and audio):

export type Tracks = {
  videoTrack: MediaStreamTrack | null;
  audioTrack: MediaStreamTrack | null;
};

After retrieving the participant’s media tracks, our track event handler calls updateMedia(), which is where we’ll actually update the relevant DOM element with the retrieved tracks. Let’s go through how that’s handled.

Handling media tracks

When assigning Daily media tracks to a media DOM element, we have to deal with a few scenarios:

  • Brand new tracks being assigned on a media element that doesn’t yet have any. In CoD, this happens when a new participant first joins the game.
  • The participant sends tracks that differ from those previously set, which means the tracks on our media DOM element need to be updated. For example, this could happen if a move to another SFU (Selective Forwarding Unit) is triggered mid-call.
  • The participant mutes their video or audio and the media DOM element has to be updated to suit (such as hiding the video in favor of a stylized “video-off” background).

A reliable way of setting media tracks on DOM elements

To reliably juggle media tracks on DOM elements, I would propose implementing updateMedia() as follows:

  • If there is no existing MediaStream set as the media DOM element’s source object, construct a new stream and set the available tracks on it
  • Otherwise, check whether the provided video and audio track IDs (obtainable via the MediaStreamTrack’s id property) match those that already exist on the media DOM element’s MediaStream. If either of the IDs do not match, replace just that track, not the entire stream.
  • Additionally, if the newly provided tracks do not contain a valid video track, hide the associated video element to show whatever “cam-off” style you might want for that participant.
A video tile showing a gradient background for participant with their camera off
Gradient background for call participants with their camera off

You can have a look at the implementation of this approach on GitHub. You will note the usage of the following methods on the MediaStream class in the implementation:

  • getTracks(): Obtaining all existing tracks on the video element’s source object MediaStream
  • getAudioTracks(): Obtaining all audio tracks from the existing source object
  • getVideoTracks(): Obtaining all video tracks from the existing source object
  • removeTrack(): Removing an out-of-date MediaStreamTrack from the existing source object
  • addTrack(): Adding a new MediaStreamTrack to the existing source object

This implementation results in a new MediaStream being constructed only once, when first setting the tracks. From then on, you can replace just the relevant tracks as needed (those that have actually changed). If the new tracks we receive are identical to those that are already set, this results in a no-op. Special styling can be applied to indicate cam-off or mic-off states when the video and audio tracks are null, without removing the old tracks from the MediaStream.

A tempting, but flawed, approach to setting media tracks on DOM elements

A developer might be tempted to implement our updateMedia() method by simply re-creating a MediaStream with the updated media tracks each time new tracks are retrieved from Daily. If no playable tracks are received, you might remove the media DOM element’s srcObject completely since there is nothing to play. In fact, you can see an example of this approach (as well the updates I made to implement the more robust approach outlined above) in an older version of Modern Wordfare.

The problem with this approach is primarily twofold:

  • We open ourselves up to playback issues when resetting the media source on a DOM element: video flickering, audio hiccups similar to the kind you might experience if using the playable-only track property we covered above
  • We perform needless construction and teardown of the media element’s source by setting a new MediaStream as the video element’s source repeatedly. For example, in WebKit, setting srcObject  on a media element invokes various loading operations

Consider separate video and audio DOM elements

In CoD, we went with adding all tracks to a single stream on a single HTMLVideoElement for simplicity. This is because our social game currently subscribes to the tracks of every other participant automatically (Daily’s default behavior), and does no hiding or scrolling of certain participants’ tiles.

But there are cases for separating the video and audio into separate DOM elements.

Pagination of video tiles

You might have a use case where certain participants’ video gets hidden behind pagination or similar features. For example, this is the case in our own Daily Prebuilt, which supports large numbers of participants and therefore has to hide some of them on the screen.

In these cases, you can consider assigning a participant's video track to an HTMLVideoElement and the audio track to its own HTMLAudioElement, each with its own respective MediaStream.

This way, even if a call participant scrolls or otherwise goes out of view of the local participant, you can still opt to play their audio to the local user. They don’t have to be seen to be heard.

Browser autoplay considerations

Browsers may block audio elements or video elements with audio tracks assigned from autoplaying in the browser. Automatic playback of audio elements is commonly blocked if the user has not yet interacted with the page via a gesture (clicking or tapping, for example). You can read more about audio autoplay behavior on MDN.

For this reason, it can be worth considering separating a participant’s media tracks into a separate HTMLVideoElement and HTMLAudioElement. Otherwise, if audio and video are both on a single video element and autoplay is blocked, the end user might see frozen video frames in addition to no audio until the conditions for playing the media are met.

💡
In Code of Daily: Modern Wordfare, we decided to forgo the use of the autoplay properly completely after running into an issue with especially aggressive autoplay blocking in Safari. If you’re curious, check out the PR containing the implementation and reasoning.


You will likely want to hide the default call controls associated with an HTMLAudioElement. This can be done by constructing the element through JavaScript and not attaching it to the DOM, or defining the <audio> tag without the controls property.

Wrapping up

In this post, we covered a lot of information about Daily’s media tracks. You now know:

  • What Daily media tracks are
  • The difference between Daily’s track and persistentTrack properties
  • How to retrieve a participant’s tracks
  • Recommendations of how to render the media tracks in the DOM, along with some common pitfalls and what we recommend that you not do
  • When it might make sense to assign all tracks to a single HTMLVideoElement and when you might consider using a dedicated HTMLAudioElement element for the audio track

Please reach out if you have any feedback, or questions about working with Daily’s participant media tracks.

Never miss a story

Get the latest direct to your inbox.