WEBVTT

00:00.000 --> 00:12.480
All right, everyone. Good morning. Still the morning. I want to thank Yaron for sitting

00:12.480 --> 00:18.320
of the stage today and organizing all this open media dev room. My name is Roe Mao,

00:18.320 --> 00:23.560
Boksis. I'm a software engineer. I've been working in the field for about 10 years,

00:23.560 --> 00:28.920
a little bit more. And today I'm going to try to talk about how you bridge the gap between

00:28.920 --> 00:35.520
web browsers API and what you do in the server and the backend. So that's all planned.

00:35.520 --> 00:42.120
So yeah, nowadays the web in the past 20 years has become very, very rich in features

00:42.120 --> 00:48.080
and particularly in media. And it's also becoming the number one software that people really

00:48.080 --> 00:51.560
have installed and where they do most of their work. It's getting more and more rare to

00:51.560 --> 00:57.320
install third-party applications. So we can do a lot of things in brothers, browsers, but

00:57.320 --> 01:04.160
not everything can be done. We have a lot of API for manipulating media listed here. But

01:04.160 --> 01:08.440
there are some interesting challenges when you want to do a really state-of-the-art in

01:08.440 --> 01:14.400
brother specific experience. And so what I'm going to be up today is very specific examples

01:14.400 --> 01:19.000
that I've learned in the past six months. He's installed in my current position, which

01:19.000 --> 01:24.320
I'm going to explain in the next slides. So my background is I've installed media streaming

01:24.320 --> 01:30.320
in college. I was at the course on trial where VRC was at the time. Jean Baptiste also

01:30.320 --> 01:34.240
will talk after me and was there. Then I did some open source contribution with liquid

01:34.240 --> 01:39.200
soap and I'm trying to contribute with a fempeg. But all of that was my non-professional

01:39.200 --> 01:45.160
career and then the other part of it was web. I did a lot of web-sealed websites building

01:45.160 --> 01:50.120
rich application and other things. But recently I started a current role where I got hired

01:50.120 --> 01:55.120
to work on a web application that is actually also very media rich and was very exciting

01:55.120 --> 01:59.400
to be able to bridge those two gaps and see how these two worlds can interact. And that's

01:59.400 --> 02:04.200
where I've based all the topics of discussion for this talk.

02:04.200 --> 02:11.600
So this script, the website that I work for, is a online video editor. It's basically a web

02:11.600 --> 02:18.080
application you can go. You can upload your files, you can compose them, you can add text,

02:18.080 --> 02:23.760
put some effects and export that as a video. So it's the same space as a premier or the

02:23.760 --> 02:30.760
things but it's all web. It started originally as a text base so you would upload a file,

02:30.760 --> 02:36.760
audio file and edit the text and the file with the audio would be cut and then evolve a video

02:36.760 --> 02:43.640
and have a lot of features in particular. In recent years they incorporated AI research

02:43.640 --> 02:48.520
team and started having their own effects that does like, you know, background removal,

02:48.520 --> 02:52.800
I contact you follow the camera and those effects are done in the background typically.

02:52.800 --> 02:56.960
So you're starting to see that we have an application that's very rich that does another

02:56.960 --> 03:02.680
things and some of it has to be done in the server, some of it has to be done in the client.

03:02.680 --> 03:07.120
The other thing I want to highlight is that specifically to the web, a web application

03:07.120 --> 03:11.040
functions essentially very differently than a normal application. And one of the things you

03:11.120 --> 03:17.880
can think about is that you have a whole class of execution context. You can have different

03:17.880 --> 03:25.400
browsers, a different OS so you can be Chrome on iOS. You can be Firefox on Linux.

03:25.400 --> 03:30.480
You can be Safari with my Quest too. Also, you can have different apps. You might

03:30.480 --> 03:35.000
want to do the same projects, open into different apps or different browsers. Are you

03:35.000 --> 03:39.240
going to want to collaborate saying that I'm using Chrome on Linux but I'm sharing that

03:39.280 --> 03:44.560
projects with my colleague who has, you know, Safari on Mac and I want both of them to have

03:44.560 --> 03:51.520
a good user experience in the brother and that's where the real complexity comes in terms

03:51.520 --> 03:57.600
of doing that. So yeah, that's the full stack media editor and how do you make that work

03:57.600 --> 04:03.680
in a context of browsers. So I'm going to focus on a very subset of the features because

04:03.680 --> 04:09.040
there are a lot of features. I'm going to focus on how you ingest media, display them

04:09.040 --> 04:14.480
to the user and export them at the end. So the first thing we're going to start looking at

04:14.480 --> 04:22.640
is what are the capabilities in a browser to ingest media. So imagine that it's a really

04:22.640 --> 04:28.000
user-oriented application. Most of the users are not software developers. They don't know a lot

04:28.000 --> 04:33.680
of media. They might upload very large files of very exotic formats and they expect everything

04:33.680 --> 04:40.680
to work very rapidly. So we have all these APIs that can be used. The web collects. We have

04:40.680 --> 04:46.040
the web audio API. We have the WebGL for composition. But how do you maintain a consistent

04:46.040 --> 04:51.040
UX user experience across different browsers platform and collaborative stations?

04:51.040 --> 05:01.400
So typically here's one example. Codex support is very different across browsers. And we

05:01.400 --> 05:06.080
want to support all of that. We want to really our user to be able to upload the Matroska

05:06.080 --> 05:12.120
files or their mob files or their A, C, Opus, whatever they have. We're going to support it

05:12.120 --> 05:18.000
in the browser. Unfortunately, the range of things that the brother which support is extremely

05:18.000 --> 05:25.160
unpredictable. Chrome on WebKit, Mac OS, my support things it doesn't support on Linux.

05:25.160 --> 05:32.160
So we're going to receive all these files and have to decide what to do with them. On the

05:32.160 --> 05:37.960
contrary, you can see back-end things like FFMPEG would be able to process pretty much

05:37.960 --> 05:45.680
anything if you compare it correctly. On top of that, the set of codex that we can use inside

05:45.680 --> 05:49.840
the web codex to decode hardware is even more limited than what you might support. Some

05:49.840 --> 05:54.560
of them are supported in WebKit. You see, but in WebKit. So it's just a lot. But what

05:54.640 --> 05:59.600
we really want to try to do is try to take advantage of the native anchoring API when

05:59.600 --> 06:04.920
they exist so that we can at least leverage the part of the web API that are relevant

06:04.920 --> 06:12.080
and then use in the back-end when we can, or when we need to. So it's not just codex.

06:12.080 --> 06:17.760
It's also containers. So remember, you have the way you encode the data for what you

06:17.760 --> 06:23.200
can do, but then you have the way you store it in those big boxes. Metroscopes move a lot

06:23.200 --> 06:28.440
of them. And same thing, they're not all supported. And we're worse than that. There's

06:28.440 --> 06:36.440
actually, I mean, there's actually no support for demoxing in the browser. So let's say

06:36.440 --> 06:41.320
you receive a Metroscop, which is not supported or an MPEG TS, which is not supported. But

06:41.320 --> 06:46.200
you still have H264 in it that you can decode natively in the browser. You still need

06:46.200 --> 06:52.680
to be able to access that packet of data, send it to the browser to decode and then display

06:52.680 --> 06:58.280
the frame back. So how you do that is a question that we're going to address typically.

06:58.280 --> 07:05.600
That's one of the example where we can bridge gaps between browsers API and backend API.

07:05.600 --> 07:14.160
So first of all, you get to think about the web codex audio decoding and video decoding API.

07:14.160 --> 07:20.360
So those are native to the browser. And they use array buffer that are basically chunk of

07:20.360 --> 07:28.440
memories that you can decode natively. The same exist for video. That's the one we're

07:28.440 --> 07:32.880
here. And that's what we're going to use. But what we wanted to be able to do is how

07:32.880 --> 07:37.680
can we access that chunk of data that's stored in a mattress cafe, when we don't have

07:37.680 --> 07:43.360
the native support for it. And that's we'll see that in a minute. So yeah, that's one of the

07:43.360 --> 07:49.600
problems we have is your user starts sending a lot of files in your application, a very

07:49.600 --> 07:54.640
exotic format. Some of them we could decode natively, but they are in certain form of containers

07:54.640 --> 08:00.840
that are natively supported by the web API. How do you bridge that gap? So that's one example

08:00.840 --> 08:07.600
we're going to see right next here. And the solution, as you can see, is to leverage

08:07.600 --> 08:12.880
the wisdom, compilation of technologies that are essentially back in, but to lift them

08:12.880 --> 08:19.120
into the browser. So you can start executing, bridge the gap between what the web offers

08:19.120 --> 08:25.440
and what you have typically in the backend, which is FFMPG, demurccing capabilities, but

08:25.440 --> 08:33.320
this time you lift them in the client. So that's was an presentation of the kind of technologies

08:33.320 --> 08:39.000
that are in the browser. So the next, oh yes, sorry, I forgot. The last thing we need to do,

08:39.000 --> 08:43.880
and it's also important, we also need to do renders. So you have your video, right? And

08:43.880 --> 08:48.760
you're composing those videos, you're adding different layers on it, you want some text,

08:48.760 --> 08:53.160
what we're going to do, we're going to leverage the WebGL composition. I might do a little bit

08:53.160 --> 08:58.120
faster on that because of time limitation, but that's the last part of the application. And

08:58.120 --> 09:03.240
this also is challenging, particularly because remember, we want to add text, a perfect example

09:03.240 --> 09:09.080
of something that you would think is simple and actually it turns out pretty hard, because when

09:09.080 --> 09:17.080
you add text, you have things like that. I say we want to be able to do a collaborative session.

09:17.080 --> 09:23.880
So I am working on macOS and I'm adding images on my brothers, renders with the local system

09:23.880 --> 09:29.560
phones. But my colleague is on Android, on Windows, and has different emoji phones and it starts

09:29.560 --> 09:35.560
looking differently. How do you do that too? So I'm going to run some example of how we did that,

09:35.560 --> 09:40.200
it's going to have to be a little bit quick because of time limit, but I hope this will be

09:40.200 --> 09:45.720
insightful, at least I find that interesting when I was working. So the first challenge is how do we

09:45.720 --> 09:51.640
seek when the user upload the file? So let's say I'm uploading a file that comes from a screen

09:51.640 --> 09:57.480
recording smartphone videos. The files can have a wide range of codec parameters. And one of them

09:57.480 --> 10:02.680
is it's going to have a very unpredictable frame rate. Remember like the keyframe. So that's what

10:02.680 --> 10:08.440
I mean. So you have keyframe space at very different places and you want to be able to seek very fast.

10:08.440 --> 10:14.040
So the user expands in the application is I'm editing my video and I want to zoom to 4.8

10:14.040 --> 10:20.600
second, 5.2 second and I want the application to react very quickly. But how do you do that in the way

10:20.840 --> 10:26.440
that's the best possible? What you have to do really is that you have to know where the keyframes

10:26.440 --> 10:32.920
are exactly in your media. So if you want to go to 5.2, but your left keyframe is 4.8, you need to

10:32.920 --> 10:39.400
be able to know that this is your keyframe, seek right into it and decode immediately there to

10:39.400 --> 10:44.360
give yourself the best chance to be as fast as possible in your seeking. The browser can

10:44.360 --> 10:49.560
do that because the browser doesn't have demoxing capacities as I was saying. So how do you do

10:49.560 --> 10:57.000
that without having the native API is a good example? So what you would do is you lift up the

10:57.000 --> 11:07.080
wasm build of a back end library. Here we can say FFMP. So you open your file with FFMP. You scan

11:07.080 --> 11:12.520
all the keyframes inside the file so you know that that's where your keyframe is and then you seek

11:12.520 --> 11:18.200
with it using FFMP. You move past and you get exactly to the frame you want and that's

11:18.200 --> 11:22.280
the point where you're going to send me to the browser. So first step, you extract the timing.

11:22.280 --> 11:28.840
Then you seek to that keyframe at 4.8. Second thing, you read the packets until you get to your

11:28.840 --> 11:37.000
frame and then you take over with the native browser API, render it and eventually display it

11:37.160 --> 11:45.160
on the frame on the screen. So that's the last step you can see. You have this packet that's

11:45.160 --> 11:51.880
available. You can put back to your JavaScript memory. It's that packet the data at 914. Pass that

11:51.880 --> 11:57.880
to the video chunk decode it natively. That way you bridge the gap between what is available in

11:57.880 --> 12:04.120
the browsers and what is available typically as a back end. Another example, and I'm probably

12:04.120 --> 12:11.080
just going to have time for that is when browser processing is not enough you want to upload to

12:11.080 --> 12:17.800
a back end infrastructure. So as I was saying, we have the users throwing a lot of media at us and

12:17.800 --> 12:22.760
some of it we can decode and have that kind of experience that's just described. Some of it we cannot

12:22.760 --> 12:28.120
because it's a codec that's not supported because many reasons. And in this case what we want to do

12:28.120 --> 12:35.720
is we want to be able to send that to a server and retrieve transcoded media immediately that

12:35.720 --> 12:39.480
are usable. Typically they're going to have a high keyframe rate. We're going to also shrink the

12:39.480 --> 12:44.040
resolution because if they're editing on the screen they don't need a 4K video. And we can

12:44.040 --> 12:49.320
that's the time we can add all the service side background, removal, eye contact, and etc.

12:50.040 --> 12:56.200
The second piece of technology that we do in this case is this scenario which adds a lot of complexity

12:56.280 --> 13:03.000
to the application. Basically it's a user application that we want to be friendly. So we really

13:03.000 --> 13:08.360
want things to work as quickly as possible. So the way this scenario works is you upload your files

13:09.000 --> 13:15.000
and you put your files in the web application and I start displaying using the native capacity

13:15.000 --> 13:20.680
of your browser if possible as we just described for seeking. Meanwhile we start uploading in the

13:20.680 --> 13:28.760
background. And when the file is fully uploaded we are ready to hit it up with the back end server

13:28.760 --> 13:35.960
and we switch immediately to chunk of media that have been transcoded on demand just for you

13:35.960 --> 13:42.520
and are really like in a codec that's supported in encapsulation container that's supported

13:42.520 --> 13:47.160
and then your experience becomes better. But that has a lot of complexity because it means that you have

13:47.320 --> 13:52.600
a first kind of degraded experience in the application while you are uploading that we kind of

13:52.600 --> 13:59.080
do what we can and immediately switch to what we fully can do with the application on your file

13:59.080 --> 14:03.960
is uploading. So that's the second piece. If you can't really bridge the gap in the browser

14:03.960 --> 14:08.520
you bridge the gap with the back end service that kicks out as early as you can. That's the one

14:08.520 --> 14:14.840
we call this media transform server. Yeah so that server is implemented and works it's a tiny

14:15.480 --> 14:24.040
FFI layer on top of FFMP the libraries. It produces MP4 segment. Yeah it's dynamically generated

14:24.040 --> 14:28.520
because there's a lot of parameters that change as you edit your video but we also experimenting with

14:28.520 --> 14:35.720
pre-segmenting for caching. Yeah that's the second piece of the puzzle. And I have one minute

14:35.720 --> 14:41.960
to talk about the last piece which is more of an open question. I remember I was saying we use

14:42.520 --> 14:51.320
composition with GL with WebGPU or WebGL to do the final composition of the canvas to a

14:51.320 --> 14:55.960
rendered frame with text and everything. One thing that is still a challenge is how do you do the

14:55.960 --> 15:01.800
composition in a way that remember is described in the browser but needs to be rendered in the back end

15:01.800 --> 15:07.560
if possible because in the back end we may have higher quality. We can do the rendering while

15:07.560 --> 15:11.720
the client is doing something else but the client has been describing their video in the browser.

15:11.720 --> 15:16.600
So when you start peeking it out in the server you have different font but you also don't have the

15:16.600 --> 15:21.400
web API. So one of the challenges that we are thinking of at the moment is how do you do that

15:21.960 --> 15:27.800
translate that and one of the ideas we have is to use the WGPU stack because this stack that's

15:27.800 --> 15:34.440
natively in pre-mediting in rust that support the WebGPU and it exports to native volcano in Linux but

15:34.440 --> 15:40.840
also WebGPU on the web so the kind of challenge we're thinking in the future is can we rewrite the

15:40.840 --> 15:46.760
rendering so the composition is described in this stack in a way that can be rendered both in the

15:46.760 --> 15:51.960
browser and in the back end in a way that's consistent and looks the same for everyone. Thank you

15:51.960 --> 16:06.520
there was a lot to present so appreciate I hope you'll be happy to have a big problem.

16:06.520 --> 16:19.400
What FFI? It's there is a node module that's called FFI-Dush-RS or yeah what FFI do

16:19.400 --> 16:30.200
you use to interface with FFI-Dush-RS or the opposite and it's a rust tiny layer on top of the

16:30.200 --> 16:36.440
lib FFI that provides like a type script safe way to interact with FFI.

16:36.440 --> 16:44.040
No yes the binding itself is linear but all the rest of the stack is type script so we get a

16:44.120 --> 16:55.880
plug right into that. All right if there's no more question we'll yield the floor and thank you guys

16:55.880 --> 17:08.680
again.

