WEBVTT

00:00.000 --> 00:23.280
Hello, everyone. Welcome. This is a talk about the new hardware for Boston for the video

00:23.280 --> 00:29.600
infrastructure. I am Martin Brahm and I am Ankel, everyone in the team calls me Dexter

00:29.600 --> 00:37.760
and now Martin will start presenting what we did this year to move to a new revision of

00:37.760 --> 00:48.480
the video box. So this is the block diagram of the audio hardware that has been used last

00:48.480 --> 00:54.080
year and full stem and also this year on all rooms except here because this is an experiment

00:54.080 --> 01:01.200
through now. All is test and production. It's the right way. This block diagram of the audio

01:01.200 --> 01:08.960
hardware we have two SGTL 5000 audio codecs and they both have two inputs and two outputs

01:08.960 --> 01:13.520
and they work fine for audio because it was audio last year at full stem but they have very

01:14.320 --> 01:20.240
small dynamic range and that is a bit problematic if there's any setting off it either clips

01:20.240 --> 01:28.800
all the microphones or there's a lot of noise issues. So we decided to make a slight redesign to

01:28.800 --> 01:37.120
switch to some newer audio codecs on the same audio board and these are very nice 32 bit audio

01:37.120 --> 01:43.040
codecs that gives us somewhere dynamic range that we don't really have to do any analog gain

01:43.120 --> 01:56.320
anymore to set the correct audio levels for whatever you plug in and this is also split up to separate

01:56.320 --> 02:02.080
ADC chips and dark chips because we found out that you can get chips that do everything in once

02:02.080 --> 02:07.680
but they do everything badly and if you get chips that do one thing they do it well it's like

02:07.680 --> 02:19.360
it's Linux. We also changed the audio front end. This is a schematic of one of the inputs on the new

02:19.360 --> 02:26.160
audio hardware and this allows us to have basically only passives in the audio path which makes

02:26.160 --> 02:32.000
design a lot easier and we now have fentom powers so we can use any microphone and not just the

02:32.080 --> 02:40.080
wireless receiver from for them and this is a lot simplified compared to the old board

02:40.080 --> 02:45.280
it has less parts but these are all pumps and they are a bit annoying to power.

02:47.520 --> 02:56.400
The only reason the old board had less parts than the new one is that it had a lot less features.

02:56.480 --> 03:04.800
The old one was hard wired to only support the specific gain that we use for the default

03:04.800 --> 03:13.920
them microphones output and so these all pumps their gains are fixed but for the new one the

03:13.920 --> 03:24.720
chips that we use are both cheaper than this front end and also they have integrated programmable

03:25.360 --> 03:35.840
amplification at the input stage so you can have the firmware set the gain programmatically

03:36.560 --> 03:48.720
and so the new front end actually basically does absolutely nothing in terms of gain the only

03:49.680 --> 03:56.960
part that has lots of components here actually just feeds phantom power into the input

03:56.960 --> 04:04.400
if that's needed for the case when we want to use a condenser mic or something like that

04:05.600 --> 04:11.920
like a non-standard set up in order to make this more portable and even to use it for other conferences.

04:12.400 --> 04:20.800
But the audio board is not the only thing we change because if we change around all the audio

04:20.800 --> 04:27.040
if we have to do everything correctly nobody will notice a thing but we also decided to replace

04:27.040 --> 04:35.360
the way video mixing is done which is a bit more complicated and that is this setup this is currently

04:35.440 --> 04:40.800
how it does not force them there is two boxes you might have seen them they are the black and

04:40.800 --> 04:46.320
wood collard boxes in the front and back of the room one is capturing the video from the camera

04:46.320 --> 04:52.240
and one from the laptop of the person giving the presentation and these streams are sent over in

04:52.240 --> 04:59.680
network to the the cable link down here to a rack of laptops which mixes both feeds together

04:59.680 --> 05:09.600
and send it to the internet which is the cloud so that's basically three devices for every

05:09.600 --> 05:15.760
single stream you have one video box here one video box at the back with the camera and one

05:15.760 --> 05:21.840
laptop the mixes the two streams that come over network from the room and then

05:22.000 --> 05:33.680
so this is an experiment I made I now half year ago or one and a half year one half year ago

05:34.240 --> 05:39.440
I decided it would be fun to learn go by doing some complicated graphics programming

05:40.080 --> 05:45.920
this is the first running version of a software video mixer that's should be able to do everything

05:46.000 --> 05:54.000
for that's the three boxes this in one setup and here it is that image is the wallpaper of

05:54.000 --> 06:01.040
my laptop it is captured over HDMI and that is the internal camera of laptop and this is all

06:01.040 --> 06:11.280
being composited in open GL together and output it and this is basically all that is really needed

06:11.280 --> 06:24.720
to make video for for them but this is very slow this is the way I designed it as a novice in

06:24.720 --> 06:29.760
GoPro programming I thought it would be a great idea to just capture the frames in a go routine

06:30.560 --> 06:35.680
sent them an old channel to the rendering threads and then rendered them but that

06:35.760 --> 06:42.560
falls copying a lot of data and copying data in video streams is slow so that's when

06:42.560 --> 06:48.960
dexter help because dexter actually knows how to do go programming yes well at least better at

06:48.960 --> 06:55.520
me and then we spend a lot of time refactoring and optimizing it so do you don't have any copies anymore

06:55.920 --> 07:07.760
yeah so in the current version of the mixing pipeline we basically

07:09.760 --> 07:16.240
use the video for Linux driver and so basically for the video for Linux driver what you have to do is

07:16.240 --> 07:23.520
you hand it a buffer it fills it at some point and then it gives you indication that this

07:23.600 --> 07:29.600
buffer is ready to be consumed while it fills the next buffer and you have like a circular

07:29.600 --> 07:39.840
queue of buffers and then what we did was we have this frame forward the objects that keep track

07:39.840 --> 07:48.240
of each such buffer and what we do is the buffer that comes from video for Linux that is

07:49.120 --> 07:58.320
that has just now been filled gets directly sent to the GPU from the other thread via

07:59.360 --> 08:09.520
sent pixels and we get the content of the buffer with just a single copy operation in the GPU

08:09.520 --> 08:19.680
memory and then with the shader we can do whatever we like in terms of processing namely because

08:19.680 --> 08:26.800
our capture cards output YUIV data we do that in the shader this is probably

08:29.600 --> 08:37.440
quite a stupid idea but it works in practice we just use a shader to do all of the conversions

08:37.440 --> 08:48.800
in the entire video mixing process and then we have two output streams one of them goes out into

08:50.720 --> 08:58.080
window like a regular openGL window that gets displayed here on the projector so this

08:58.560 --> 09:07.760
these frames you're seeing on the stream on the projector right now are being displayed by

09:07.760 --> 09:18.080
this window and then we have other render passes that run the same pipeline for other targets

09:18.080 --> 09:26.480
and namely the other target we use for false them is the stream that gets sent over the network

09:26.480 --> 09:36.000
that's basically composited the stream from the laptop and the stream from the camera and the

09:36.000 --> 09:45.520
nice false them logo and all that and that gets basically fed into the standard input of FFM

09:45.520 --> 09:57.120
peg and FFM peg just encodes and streams that over the network and our most challenging aspect

09:57.120 --> 10:07.120
of this has been to optimize this video mixing system in terms of memory bandwidth used because

10:07.680 --> 10:18.320
these boxes the CPU and them is the Intel and 100 it's it's actually it has a pretty good

10:18.320 --> 10:25.280
Intel graphics GPU but the bottleneck there is the memory bandwidth between the GPU and

10:26.880 --> 10:34.240
the shared memory between the GPU and the CPU because that's only six megabytes per

10:34.480 --> 10:45.360
six gigabits per second and we've had to optimize everything just to lower the memory bandwidth usage

10:46.800 --> 10:57.920
at the cost of more CPU and more GPU usage yeah the video mix is also set up pretty modular

10:58.880 --> 11:05.680
with a several ways to input and output data the most important one the first one I wrote

11:05.680 --> 11:11.600
is the feed you for Linux source that just directly asked kernel for frames that relatively easy

11:11.600 --> 11:19.280
to implement we have an image source that is used to display the little background behind the

11:19.280 --> 11:25.520
two screens that are always visible and we have an FFM peg source because it's very convenient

11:25.520 --> 11:31.760
to be able to input anything at the FFM peg support because FM peg supports a lot of different

11:31.760 --> 11:38.960
types of input and if it's not supported you compile a different one also let's not being used right

11:38.960 --> 11:44.000
here now but that's open media transport support that's a relatively new network protocol that

11:44.000 --> 11:50.560
can stream high quality basically relatively in-groups video with low latency of the network

11:51.520 --> 12:00.080
and I added an HTML renderer because it's nice to display graphics and well kind of less

12:00.080 --> 12:04.640
minute I also added a PDF renderer it's not in here yet but that's currently being used to display

12:04.640 --> 12:11.200
this PDF and because this presentation is running also in for some ticks on my laptop

12:11.200 --> 12:16.560
being sent to the box running for some ticks that makes it for the projectors so we are

12:16.560 --> 12:23.680
nesting for some ticks mixers here and for the outputs we have the OpenGL window but that's

12:23.680 --> 12:29.280
what you're currently seeing and FFM peg again for text quality told for streaming and we have

12:29.280 --> 12:34.320
another outputs to also send out open media transport streams that's also currently not being used

12:36.880 --> 12:44.800
so this is the the HTML source that is rendering the current schedule of this room and it has some

12:44.800 --> 12:50.720
custom CSS injected to scale things a bit and remove the background as a demo it's not very

12:50.720 --> 13:00.480
readable at the top it that's what demos are for this is an FFM peg source it has a simple loop

13:00.480 --> 13:05.120
the start-for-ing in the loop is something an FFM peg not even some ticks and I have not been

13:05.120 --> 13:13.120
looking into how to do seamless loops in FFM peg but it's working and if it works yes also the

13:13.120 --> 13:25.600
camera source works if he rotated yes perfect so yeah this is basically what some ticks does

13:25.920 --> 13:44.080
so we've had so far we've managed to develop this and it sort of works okay in production

13:45.440 --> 13:54.000
like for example this box here that streams out of this room but there's still a few challenges

13:54.000 --> 14:05.040
that need to be addressed in order to make it a production ready video mixer for example

14:06.720 --> 14:16.480
we currently support multiple sinks like multiple video streams being outputed but in practice

14:16.560 --> 14:29.120
there has there always has to be exactly one window based output because that's where we run

14:29.120 --> 14:38.080
the render loop it basically runs the same way a game engine would run there's like this window that

14:38.080 --> 14:46.080
you render to and then we piggyback on that render loop to actually do the other render steps for

14:46.160 --> 14:52.240
the other outputs that are based on FFM peg but currently we can't open a second window

14:52.240 --> 15:01.680
and we can't just stream to FFM peg by having zero window outputs because we need a window

15:01.680 --> 15:12.240
output for something to run the shader and that's something that we would have to research how to

15:12.320 --> 15:24.080
do because we're not really graphics people we're more like just hardware people who happen to

15:25.440 --> 15:34.720
be looking into how to do some stuff with graphics the other interesting issue that we currently

15:35.680 --> 15:55.040
have with the implementation is the fact that the YUIV conversion and basically any pixel format

15:55.040 --> 16:04.080
conversion is done directly in the shader in a way that just samples the pixels from the other

16:04.080 --> 16:15.760
texture directly so interpolation is basically broken because you cannot do that because the source

16:17.200 --> 16:25.920
pixels would not be correct if you interpolate them and then assume they are correct why

16:25.920 --> 16:35.040
you IV data so this is another thing that we have to look into so we're basically also

16:35.040 --> 16:45.440
soliciting contributions the code is under false them video phasantics repo and get help

16:45.520 --> 16:57.760
and you're all welcome to look at it and messages and ellitas for extremely stupid things that we

16:57.760 --> 17:05.280
might have done because we basically don't know much about graphics programming in general we just thought

17:05.280 --> 17:12.880
that how hard could it be to write a video mixer and we just did some stuff and it happens to work

17:12.880 --> 17:23.360
and it's actually fast but there's still some rough edges. Martin can tell you more about

17:24.400 --> 17:33.680
shader pipeline while I try to do a quick demo of the scene transitions.

17:34.640 --> 17:45.680
Yeah the currently the feeder shader is for all the most of these sources are why you

17:45.680 --> 17:50.640
Wi-Fi data and I'm currently loading that into an RGB A texture because apparently there are

17:50.640 --> 18:00.720
no Wi-Fi formats and open gel and then in the shader I do a rough interpolation of the

18:00.720 --> 18:07.840
Wi data to make it look not crap and this only works as long as I have nearest neighbor

18:07.840 --> 18:13.520
interpolation enabled because when I don't have a scale one of the sources down but like the

18:13.520 --> 18:19.440
webcam currently is and I don't longer have one to one mapping between the pixels resource

18:19.440 --> 18:27.040
and the destination due to interpolation the edges will start to appear two pixels away

18:27.040 --> 18:32.080
this looks like some really weird ghosting and I'm pretty sure the solution is to

18:33.680 --> 18:41.600
do the shader in multiple steps but the first decodes the Wi-Fi data to RGB and then in the next shader

18:41.600 --> 18:47.040
actually do the video mixing but I have no clue what the performance impact of that would be

18:48.640 --> 18:52.960
but I guess most of the people here are graphics people so maybe one of you will know

18:53.920 --> 19:03.280
hopefully and another big performance issue here is when adding more sources we are rendering twice

19:05.520 --> 19:12.160
by calling the shader again and will initially using the gel get pixels for something to get

19:12.160 --> 19:17.200
the frame buffer for the second output which is very slow and I've migrated that over to

19:17.360 --> 19:28.320
one of the aging APIs and it seems faster but I still have to feeling I'm doing it wrong

19:29.520 --> 19:31.120
so patches welcome

19:34.240 --> 19:40.640
so the other thing we currently have implemented is seen transitions so now if I

19:40.640 --> 19:48.560
just unplug the HDMI from the laptop they should transition to another scene that shows the place

19:48.560 --> 19:57.040
holder image and now hopefully if I plug it into the other laptop it should smoothly transition to

19:57.040 --> 20:07.120
the output of that laptop fancy stuff and this is the web user interface of the video mixer so here

20:08.000 --> 20:16.800
you can see the two output stages the one for the projector and the one for the stream and for

20:16.800 --> 20:28.640
example if I switch the projector to camera full screen this will happen and you will you will be

20:28.640 --> 20:39.360
able to see me on the projector so this works on the false them video box and yeah basically

20:40.320 --> 20:51.280
that's the this is the scene that gets sent out to the stream online and the other thing I want to

20:51.280 --> 21:03.120
show you before we wrap up let me just do this is to just open the code

21:03.520 --> 21:25.520
and show you the actual video mixing because it's a really short piece of jail a cell code

21:25.840 --> 21:34.560
the only thing we have is these functions that sample the input textures for example this is

21:34.560 --> 21:42.640
the function that samples the YUI V texture that gets sent from the video for Linux buffer and

21:44.480 --> 21:52.720
the whole like these are all separate functions to sample the different inputs input formats that we

21:52.720 --> 22:01.280
might have and so basically we have this step here that's a like switch function to run depending

22:01.280 --> 22:10.080
on the format of the current input being sampled and the whole mixing procedure is just this we sample

22:10.960 --> 22:22.080
each input layer this code gets generated so that this block here is present for every single input

22:22.080 --> 22:31.840
that we have so we just sample the input pixels and then we blend them together that's it that's

22:31.840 --> 22:43.440
the whole video mixing and basically the layer data that we have here just tells us how to transform

22:43.520 --> 22:54.800
this layer so it's basically the transformation matrix that just tells us that this texture gets rendered

22:54.800 --> 23:02.640
into this rectangle and this one gets rendered into this rectangle and as you can possibly see the

23:03.520 --> 23:13.600
text there on the slides is quite jagged and not really well anti-aliased that is because of this

23:14.320 --> 23:24.080
function here that samples the YUI V texture and it samples it by getting pixels with nearest neighbor

23:24.080 --> 23:32.080
interpolation it just takes a pixel and that's why it looks like this so if this gets improved

23:32.160 --> 23:42.000
if hopefully one of you sends us a patch or an idea of how to improve this we will be able to

23:42.960 --> 23:47.200
actually run this in production with feature parity with the current set up

23:47.200 --> 23:56.480
oh no questions times up