WEBVTT

00:00.000 --> 00:14.800
Hello everybody, welcome here at the Lightning Talks in Brussels here at Firstum. I want to introduce

00:14.800 --> 00:22.960
to you Phillip Stanna. He's talking about how to clean up a 16 year old Linux kernel API

00:22.960 --> 00:28.960
give him a warm welcome and enjoy the talk.

00:31.360 --> 00:37.200
So, hi everyone, I hope you're the good for us them and not too many beers. So, I'm Phillip,

00:37.200 --> 00:41.760
I'm currently a Linux kernel engineer for graphics cards, mostly, and infrastructure they need.

00:42.640 --> 00:47.200
And since this is Lightning Talk and time is of course resource, we jump right in.

00:48.160 --> 00:55.040
Just one disclaimer, we are looking here and quite some old code that is quite bad.

00:55.760 --> 00:59.600
And I just want to point out we just criticize the code right and we don't want to condemn

00:59.600 --> 01:04.080
the offers, because you know what, it's like people are sometimes on the time pressure are doing

01:04.080 --> 01:08.080
things for the first time, sometimes we have 10 years more experiences, yeah, it was a better

01:08.080 --> 01:13.760
of you actually. As an user, I can hear a quote that I like quite a bit from a former colleague of mine,

01:13.840 --> 01:17.680
who said, yes, it's often developed, you know, at first you think that the others can't code.

01:17.680 --> 01:22.960
And ultimately, you realize that no one really can. I think approximately it's true,

01:22.960 --> 01:31.360
maybe quite well. And so, we are looking into a specific subsystem of the Linux kernel,

01:31.360 --> 01:39.200
the PCI subsystem, and notably the PCI Express port, and PCI Express is probably the most important

01:39.280 --> 01:44.240
boss on your computer, like virtually everything your computer does, and I know it's multiplexed

01:44.240 --> 01:49.680
through PCI Express lanes ultimately. And it's quite old, but I think it's from 2003 or something

01:49.680 --> 01:54.240
someone in 20 years, it's quite an old code, probably older than many people in the audience, right?

01:55.440 --> 02:00.960
And the subsystem in the kernel, where this code lives, currently is only one full-time antenna,

02:00.960 --> 02:08.080
which is not that that many. And as far as I know, it has like two free APIs that are kind of broken,

02:08.160 --> 02:12.880
should be repelled with a very difficult throughput. And we're looking into one of those APIs,

02:12.880 --> 02:19.920
which courses will overflow, some potentially only find behavior. And we don't have to understand

02:19.920 --> 02:26.800
that much about PCI. All that's important for our talk here is Jonathan DeFolling. So in PCI

02:26.800 --> 02:32.640
of PCI devices, for example, your graphics card, and the card consists of so-called bar

02:32.640 --> 02:37.040
spasic address, which is still some memory regions, basically, physical memory regions.

02:37.040 --> 02:42.240
The best example probably is your video ran, that's a PCI bar. And now as a driver, when you're

02:42.240 --> 02:46.240
at a graphics driver, for example, you want to move shaders and stuff into your video memory,

02:46.240 --> 02:50.960
so you need to access it from the CPU. And to do that, you need to do something that's called

02:50.960 --> 02:57.120
an IO remaps. You need to map the physical addresses into your virtual address space, which is

02:57.120 --> 03:00.640
in the right set here. And then you get an IO pointer, and with that pointer, you can access

03:00.800 --> 03:06.880
the device and copy stuff on the school. And then PCI does several functions to do that, like

03:06.880 --> 03:11.360
depending on what precisely you want to do, the ranges you want to use, and other requirements.

03:11.360 --> 03:16.560
Those details are interesting for us, but we're looking into those IO remaps functions in this

03:16.560 --> 03:25.840
talk here. And so one day, I was tasked with some GPU stuff, and I wasn't expecting anything

03:25.840 --> 03:29.920
able to happen on that, and I was just about this function here. I was just called PCI and I

03:29.920 --> 03:36.480
am at table, which quite an interesting prototype. We now see languages, sometimes a bit

03:36.480 --> 03:42.640
nasty with those, and then I was like, okay, what's that? And this is a function, it takes a PCI,

03:42.640 --> 03:49.120
there was this parameter, and the returns are points pointed to a constant pointer. And then

03:49.120 --> 03:54.960
really get what is it doing. And then do some good blame, because it's from 2007-2009, right? It's

03:55.040 --> 03:59.920
quite some old thing. And if you don't understand how something works, then the best thing to

03:59.920 --> 04:04.880
do in the kernel is look for users. Look who is using that function and then try to get, and

04:04.880 --> 04:11.760
alternatively could also read the documentation, but who does that, right? So, so we look for users

04:11.760 --> 04:17.280
and then you just cover the following. So this function always occurs as a pair with another function,

04:17.280 --> 04:22.080
which is PCI and my own imaginations. This precisely is the own met function, which does the O remapping

04:22.080 --> 04:30.480
into the virtual address space. And, well, what it does is, it does the O map, and the

04:30.480 --> 04:37.680
memory request. And for some reason, you specify which bar you want to map with this bit mask.

04:37.680 --> 04:45.040
So you set the 0, bit, and you're going to get the 0 of bar. And then, for some reason, the actual

04:45.040 --> 04:50.880
IO address, your driver obtains through the set O map table function. And you do it by indexing

04:51.760 --> 04:57.600
over the function. Well, compile people would beat me. You can't actually index over function,

04:57.600 --> 05:01.680
but you index over the function returns and so on. But you get the deal. I had never seen something

05:01.680 --> 05:05.760
like that. And see, I didn't even know it was possible. But the miracles are covered by themselves,

05:05.760 --> 05:13.920
I guess. So, and then you ask us why did the implement that, like 16 years ago, it that way?

05:14.240 --> 05:22.480
Well, why? What was the background? And the background is the following. I assume I don't know

05:22.480 --> 05:29.040
for sure. But it seems the offers, intention was that this bit mask functionality would allow

05:29.040 --> 05:36.080
for mapping many bars, more than one simultaneously, by setting several bits. And that now,

05:36.080 --> 05:40.800
seems to be the reason why the second function became necessary. Because, and see, there's not really

05:40.800 --> 05:45.360
a smart mechanism to determine several IO pointers or pointers in general to the caller.

05:45.360 --> 05:49.920
I mean, you can do some tricks, I know, but, you know, and hit a graphic shows you what's been

05:49.920 --> 05:55.360
done. So, first you do the IO map. Then that function enters the IO pointers into some global

05:55.360 --> 05:59.280
table that's administrative somewhere. And then through the home map table function, you index

05:59.280 --> 06:03.760
over that. And then you get the IO address. And you can pass it to your driver and then the driver

06:03.840 --> 06:15.840
can do both of the stuff. Yeah, and that's bad. Why is it bad? First of all, most notably,

06:17.440 --> 06:22.960
this awkward function index mechanism doesn't allow for bounce checking. So, neither the function

06:22.960 --> 06:26.400
can check the parameter you're passing here. And to see long, which is for us, I mean, we're

06:26.400 --> 06:30.960
also as an mechanism to check that. So, if you had a few beers too many in the start hacking,

06:31.760 --> 06:36.160
and you can just randomly overflow this table, we just saw the other graphics. And then you

06:36.160 --> 06:40.080
have the archetype, the par excellence for undefeated behavior and see because you just

06:40.880 --> 06:45.280
get some random point of a memory positive to the driver, the driver might try to write into

06:45.280 --> 06:53.040
the random point somewhere in the memory and everything explodes into your face. Well, and the other

06:53.040 --> 07:00.320
things, PCI currently supports like a maximum number of PCI bars of six. So, most of us are

07:00.320 --> 07:05.760
like three, but a one bar, or maybe three. And now, the entire mechanism becomes questionable

07:05.760 --> 07:10.240
because a bit more, the bit more is an integer, we're like 32 bits at most, it doesn't even

07:10.240 --> 07:15.200
extensible. If PCI wants at one day one day introduces like one thousand bars, you know.

07:16.480 --> 07:21.600
So, that's not good, we don't really want that. Because, you know, Colonel APIs,

07:21.600 --> 07:25.360
centrally prepared, APIs should be robust. It should be as robust as possible because

07:25.360 --> 07:29.200
drivers, for example, are not always written by people like Ken Thompson and Super Geniuses,

07:29.200 --> 07:32.720
there's sometimes really written by interns and all that. So, the more robust you central

07:32.720 --> 07:38.160
infrastructure is better. And then other things is that there are people who like to be clever.

07:40.320 --> 07:46.160
That's actually code from the Colonel, here's the file, that's what this is going on,

07:46.160 --> 07:50.960
that's only part of the code, by the way. So, here he is, it's okay, x3 because someone was

07:50.960 --> 07:56.800
too lazy to set the bit in a readable manner, and then you know, okay, x3 is like

07:56.800 --> 08:01.120
the lowest to bits being set. So, it should be the first two PCI bars that the drivers

08:01.120 --> 08:06.320
requesting, but there is being shifted with a base and I'll tell you that on how I code is in a loop.

08:06.320 --> 08:12.000
So, you have no clue what the code is doing anymore. And we don't want, at least I'll

08:12.000 --> 08:17.120
say we won't want people to write super double clever code, you know. We wouldn't, we could

08:17.120 --> 08:21.120
blame the offer now here, but we don't want to encourage it. And if you want your children to

08:21.120 --> 08:24.800
eat green vegetables, then where they shouldn't do is you shouldn't give them two pieces, right?

08:24.880 --> 08:28.240
You don't have to, it's not a good strategy, you shouldn't, shouldn't provoke the problem.

08:30.640 --> 08:33.440
Yeah, and then when you search for the Colonel, how's it actually being used?

08:33.440 --> 08:38.080
There's when I started working on the Earth, there's like 131 users, almost all of them

08:38.080 --> 08:45.600
because one bus, it is at one bit in the bit mask thing. And as I said, the maximum puzzle thing

08:45.600 --> 08:49.440
would be six bars, and if you're really in six bars, then there's a revolution in a

08:49.440 --> 08:53.040
recurrence of the programs called the loop, then it goes just a function that's a single ball

08:53.040 --> 08:58.640
like six times, and that's it. So our conclusion is this RPAPS overengineered, we would like to replace

08:58.640 --> 09:03.760
it with a simpler alternative, when I own a region without S function, where you're just pass a bar

09:03.760 --> 09:08.160
index as a number, and that returns immediately to pointer, and then you're done.

09:09.920 --> 09:17.280
And now, okay, you can say you can easily remove it, can't you? But there's some obstacle,

09:17.280 --> 09:25.360
so now we saw, there's 131 users, that means this code is scarred around dozens of subsystems

09:25.360 --> 09:30.720
and drivers, and the Colonel pitches a typical merge per subsystem, so you need to know knowledge

09:30.720 --> 09:36.480
and act the approval of each of the maintainers, and should be reviewed under reasons to the

09:36.480 --> 09:42.720
kind of tricky. And so effectively for Colonel political reasons, it's not possible to replace

09:42.880 --> 09:47.280
the API in one go, that's very difficult, and despite the way something that's surprising

09:47.280 --> 09:50.720
for people who are new, the Colonel is not only able to acknowledge it, it's a lot about

09:50.720 --> 09:56.560
process, about discussions, about Colonel politics, you could call it. Anyways, we want to

09:56.560 --> 10:01.120
get rid of it, how would you do that? So that's the pre-solution state what it was like

10:01.120 --> 10:08.000
one year ago, then you have all the drivers up there, they are omaped their PCI bar, this function

10:08.000 --> 10:12.720
unpasses the outputter into the magic table, and then the obtained the address through this

10:12.720 --> 10:21.040
magic in the expansion thing stuff. Okay, and how can we phase it out out of the grammar?

10:21.040 --> 10:27.680
You can do it by providing our simpler alternative, then I just address which is prototype,

10:27.680 --> 10:34.240
and that would be a function that just requests our IONF's one bar, and where do we implement it?

10:34.240 --> 10:39.120
So the way I did it is I implemented this as the new base of the omegreations function,

10:40.000 --> 10:44.960
it just returns one mapping address, and the only user at the beginning is the regents function,

10:44.960 --> 10:51.120
the broken one, and the rest is not changed. And this is the great advantage, Colonel process

10:51.120 --> 10:55.600
was if you blow things up and call the regression, it's very easy to revert, because you have only

10:55.600 --> 11:01.120
one place, the PCI subsystem, where you can repair the function, and to get revert and you're done,

11:01.120 --> 11:06.480
instead of 141 places. So that's already good, the other thing is you immediately recognize if it's broken,

11:06.480 --> 11:11.280
because all the users will use your code, which is also handy when then later convincing,

11:11.280 --> 11:14.560
maintain us to merge more of your code because you can say, hey, you're already using it,

11:14.560 --> 11:22.960
below the surface. And now that we have this simple alternative, we can expose this new pair,

11:22.960 --> 11:26.960
maybe make a public, if a new driver like F here comes along the upper right corner,

11:27.040 --> 11:30.880
it can immediately use the new function, obtain and smapping directly, we felt this awkward table

11:30.880 --> 11:36.880
index thing, and now that the new alternative is public, we can deprecate the old API,

11:36.880 --> 11:41.920
and say, please don't use it anymore, and programmers always read documentation, so it will definitely

11:41.920 --> 11:45.760
not happen to someone who will use the old API, and it will definitely not happen two days before

11:45.760 --> 11:53.200
my talk at Boston. Yeah, and then due to the step by step, we'll report more and more drivers like

11:53.200 --> 11:59.520
D&D, and one one day, 131 users later, you're done, then you can delete the old code,

12:00.320 --> 12:05.680
and if one clean function that does this one simple thing, net the stuff, return the address,

12:05.680 --> 12:11.760
or an error if there's more so problem, and classic unix philosophy, you know, do one thing and do it well.

12:11.760 --> 12:26.960
So, down some conclusion for, because many people would like to contribute in the current,

12:26.960 --> 12:34.240
maybe some of you would like, I would recommend this guide here, I would recommend browse code,

12:34.240 --> 12:39.520
did you interest them, preferably code that you have some experience with already, that would good,

12:40.000 --> 12:46.240
and if you see something that looks broken, then it actually, most likely, is broken,

12:46.240 --> 12:50.960
because my reaction with that code, for example, is there must be some genius reason why it's that way,

12:50.960 --> 12:55.200
I haven't understood it, but that's what it's often the thing when you're doing things in the

12:55.200 --> 12:59.680
current for the first time, because some of the things, the feeling some people seem to

12:59.680 --> 13:03.680
perceive the current engineers a bit like the elves in order of the rings, you know, the most

13:03.680 --> 13:09.120
noble and wisest of all beings, but if it looks broken, likely it's broken, so I'll recommend

13:09.120 --> 13:13.360
try to repair it, and in case of that, ask on the mailing list, ask the maintainers,

13:13.360 --> 13:19.360
is the maintainers duty, in my opinion, to guide you, to guide new contributors, through a patch series,

13:19.920 --> 13:23.440
if the maintainer doesn't do it, then he doesn't have to write on my opinion to complain about burnout,

13:24.800 --> 13:30.240
yes, so yeah, don't hesitate, repair things, also that it's good for your

13:30.240 --> 13:33.920
reputation, in my opinion, because it's good to be known as someone who repairs stuff, right,

13:33.920 --> 13:37.200
because many people only want to implement new features, because that's the exciting thing,

13:37.760 --> 13:42.720
but I think it's good for us to be in the reputation, so I repaired something that was broken for

13:42.720 --> 13:49.680
16 years, it's a cool thing, so, and then you have success, and that's good, and in that sense,

13:49.680 --> 13:56.480
if you'd like to jump into the kernel, into kernel work, would write, ask, and if you have

13:56.480 --> 14:00.880
a good remaining form Lewis.

