WEBVTT

00:00.000 --> 00:11.240
My next speaker is Brian, Brian has been speaking the good of room before, even online if

00:11.240 --> 00:16.800
I remember correctly, is some amazing complex subjects of how GoWorks internally and has been

00:16.800 --> 00:18.760
doing an amazing job.

00:18.760 --> 00:24.360
So I invited him back this year, well, actually, he asked, to talk about Swiss maps, which

00:24.360 --> 00:29.520
I previously showed to wrongs his map, he's going to talk about the actual Swiss map,

00:29.520 --> 00:31.360
so round of applause for Swiss maps.

00:31.360 --> 01:00.160
This is a Swiss map, all right, no, well, okay.

01:00.160 --> 01:11.680
So Swiss map is a new map implementation in Go124, and so we understand why we're here.

01:11.680 --> 01:18.920
Hello, my name is Brian Borum, I work at Grafana Labs, where pretty much all of our backend

01:18.920 --> 01:28.160
codes are written in Go, I've been working in Go for 10 years or so, and at work I work on

01:28.160 --> 01:38.160
these things, open source projects for storing metrics, logs, traces.

01:38.160 --> 01:39.400
Why am I talking about this?

01:39.400 --> 01:45.000
Well, I watched this talk, map Cooler Cundus, is Matt in the room?

01:45.000 --> 01:46.640
No, okay.

01:46.640 --> 01:50.880
Is anyone who worked on this in the room?

01:50.880 --> 01:55.680
That's good, I can just tell you anything, okay.

01:55.720 --> 02:04.320
So yeah, I watched this talk, which is about the C++ implementation of Swiss maps, and

02:04.320 --> 02:11.760
I find it very engaging and funny, and so this talk is, this Matt's talk is an hour

02:11.760 --> 02:16.840
long, and in fact, this is part two, so he took two hours, and I have like 25 minutes,

02:16.840 --> 02:22.040
so we're going to go.

02:22.040 --> 02:25.240
So there's a map, right?

02:25.240 --> 02:36.400
There's a map in Go, a map is a key value store in memory, and the one thing that we need

02:36.400 --> 02:42.760
to get out of a map is that everything is pretty much constant time, so putting things in,

02:42.760 --> 02:48.720
looking things up, and deleting things constant time.

02:48.720 --> 02:52.560
So how do we do that?

02:52.560 --> 03:00.440
So this is kind of an abstract, how a map works, a hash table, how does the hash table work?

03:00.440 --> 03:09.600
Well, we take the key, I can't see the red dot, oh well, we take the key, which is over

03:09.600 --> 03:15.840
on the left, we put it through a hash function, hash function generates a 64 bit number,

03:15.840 --> 03:22.400
and that number changes for every different key, hopefully, 64 bits might be smaller than

03:22.400 --> 03:28.640
your key, so they might sometimes map onto the same hash, but hopefully you get a wide diversity

03:28.640 --> 03:39.720
of hash, and then you take that number and modulate it with some buckets, and if two things

03:39.720 --> 03:44.440
map onto the same bucket, then you make a chain with pointers.

03:44.440 --> 03:50.760
So this is called open hashing, this is kind of classical technique that you might have learned

03:50.760 --> 03:53.960
in university, something like that.

03:53.960 --> 04:00.040
There is another way of doing it, this one's called closed hashing, where instead of making

04:00.040 --> 04:06.840
a chain you just put it in the next slot, that's not entirely true, I'm going to lie

04:06.840 --> 04:13.440
from time to time in order to go faster and simplify what really happens.

04:13.440 --> 04:21.480
So this one is called closed hashing, so I put those words in big text, focus on the

04:21.480 --> 04:37.800
first two letters, so this is why it's a Swiss map, because CH is the country code for

04:38.800 --> 04:48.000
it, it didn't hurt that the people mainly working on it worked in Google Zoo, but apparently

04:48.000 --> 04:54.960
this is the story, it's called a Swiss map, because it uses closed hashing, alright, let's

04:54.960 --> 05:04.240
get into, so the previous version, the app to go to 1.23, just a contrast, how does that work?

05:05.160 --> 05:13.040
I'm going to have a spirited attempt to get a dot, no, how about if I move this somewhere

05:13.040 --> 05:14.040
over here?

05:14.040 --> 05:20.480
Yeah, right, so this is the data structure, right, in memory, it starts with a little bit

05:20.480 --> 05:28.160
of sort of header information, and there is one slice of buckets, just like we saw before

05:28.160 --> 05:35.880
we take the hash value, it's always a power of two, so modulus is cheaper, because we're

05:35.880 --> 05:41.600
very, very focused on performance in this whole thing, and then was a bucket, look like

05:41.600 --> 05:48.840
a bucket, in this case in the go map has eight up to eight things in it, and they are laid

05:48.840 --> 05:55.360
out all the keys, then all the values, a little bit of metadata for the bucket, and then

05:55.360 --> 06:02.920
if you manage to fill the bucket, another one will be created with an overflow pointer,

06:02.920 --> 06:08.040
so there's this chaining thing going on, but eight at a time, and if you get a lot of

06:08.040 --> 06:14.140
overflows, then the whole table is resized, doubled in size, so that's how the previous

06:14.140 --> 06:27.260
go map worked, this is the new one, it's pretty similar, there are a couple of kind

06:27.260 --> 06:35.020
of main important changes, that the overflow pointer has disappeared, so true to the name

06:35.020 --> 06:43.020
we're using closed hashing, we do not make chains of buckets, but the other big change,

06:43.020 --> 06:48.660
I mean the keys go key value, key value, that's a little bit of a change, but the big

06:48.660 --> 06:57.420
big change is instead of just one table of buckets, there are multiple, and we look a

06:57.420 --> 07:01.860
little bit more about how that works, but basically what that allows is it allows the table

07:01.860 --> 07:10.020
to grow more gently, or it can sort of grow each table can grow independently, so we

07:10.020 --> 07:18.420
don't have to double the size of the whole thing all at once, right, we look at a little

07:18.420 --> 07:30.540
bit more detail, so first of all let's just talk about what is that metadata, okay, so this

07:30.540 --> 07:36.620
in example I've got four entries, four tables, I've got three tables, but two of the entries

07:36.620 --> 07:47.340
point to the same table, that's how it allows it to grow gently, you can have one table

07:47.340 --> 07:54.580
two, four, eight, whatever, you can have five tables, and the directory doubles in size as

07:54.580 --> 08:00.100
these things grow, and we take a certain number of bits off the top of the hash in order

08:00.100 --> 08:03.580
to look up the directory, so in this case we've got four entries in the directory, we take

08:03.580 --> 08:12.660
two bits, it's going to get more complicated, strapping, we take two bits off the top

08:12.660 --> 08:21.300
of the hash to look up the directory, we take the next bits down to bit 57 to look into

08:21.300 --> 08:26.940
the bucket, to pick a bucket by modulus the size of the table, and then what do we do with

08:26.940 --> 08:34.500
the other seven bits? Oh, we'll come back to that, I thought you'd be wanting to see

08:34.500 --> 08:53.420
some benchmarks, so I put a benchmark in next, it's faster, so the x-axis is a number

08:53.420 --> 09:08.500
of elements, so this is a benchmark from the go runtime, so we make a map and add a certain

09:08.500 --> 09:16.380
number, like a million or whatever, elements to it, and then we look up all of the elements,

09:16.380 --> 09:25.540
so every time we look it up in this benchmark it's a hit, and then we run this as a go benchmark,

09:25.540 --> 09:34.820
so we get nanoseconds pair up, and so just to point out the numbers, it's basically

09:34.820 --> 09:43.180
the same for small maps, and the big picture, why does it slow down, because the map

09:43.180 --> 09:51.340
no longer fits in cache, right, your CPU has an L1 cache, which is tiny, a few tens

09:51.340 --> 09:57.100
of k, something like that, as a level 2 cache, which is maybe a megabyte, it has a level

09:57.100 --> 10:04.540
3 cache, which might be 32 megabytes, once you get out of the fast L1 cache, it starts

10:04.540 --> 10:13.140
to slow down, and so basically with everything in cache they're going at about the same speed,

10:13.140 --> 10:18.940
and as the map gets bigger, this kind of a bit in the middle, where it's like two or

10:18.940 --> 10:26.340
three times faster, and then it gets in kind of asymptotically reaches a few percent faster,

10:26.340 --> 10:36.300
on this benchmark, benchmarks are basically lies, right, so I run it with a real program,

10:36.380 --> 10:44.060
which is Prometheus that I work on, and unfortunately it's like, it's about the same,

10:44.060 --> 11:01.540
it's, very little difference observed between go 1.23 and go 1.24 RC2, and the reasons for that

11:01.620 --> 11:09.460
are probably complicated, and I have had some discussions with Michael Pratt, who did a lot

11:09.460 --> 11:16.180
of the work on the goal version of this, so hopefully we can get this up, but yeah, that's

11:16.180 --> 11:21.540
a little disappointing, so I thought I'd stick that in the middle, so we get the sadness over

11:21.620 --> 11:35.140
with, okay, so where's Michael Clicker, right, back to the metadata, so what we're going to do

11:35.140 --> 11:42.340
with those seven bits, we're going to store them in the metadata, so once again we take the key,

11:42.340 --> 11:49.140
we hash it, just the ones on the way in, or we need to rehash it if we did this thing of doubling

11:49.220 --> 11:57.540
the table, we need to rehash all the elements, but generally speaking just once, we call the hash

11:57.540 --> 12:05.940
function, we get 64 bits, and we take the bottom seven bits and stick that in one bite of the

12:05.940 --> 12:13.460
metadata word, why seven, well we have one more bit which tells us if there's anything there are

12:13.460 --> 12:21.620
tall, so the one bit says that it's empty or deleted, and the other if it's not emptier deleted,

12:21.620 --> 12:30.100
the other seven bits are from the hash, one bite per element, so I've got three elements in this

12:30.100 --> 12:36.980
bucket, so we get three hash values, and then the other five bytes here are going to say that it's empty,

12:37.060 --> 12:50.180
that's the bit pattern with a one at the top means empty, let's look an example of how this

12:50.900 --> 13:00.420
works in the code, so I've made up some numbers here, once again the eight zero means empty,

13:01.380 --> 13:09.300
we're going to look for something that has the seven bits three, can anyone see it,

13:10.740 --> 13:20.500
yes it's easier if you're human, but if you're a computer, so the the obvious way to do this,

13:20.500 --> 13:28.420
and in fact the go one dot twenty three implementation loops, so a little four loop it says is it

13:28.500 --> 13:36.260
the first one is it's second one is it's third one, loops are kind of slow, so cash misses and

13:36.260 --> 13:47.380
branches are the two things that are slow right at the lowest level in your CPU, so we do something

13:47.460 --> 13:56.660
extremely funky, I say we I didn't write this, we replicate that byte eight times,

13:58.500 --> 14:08.260
we exclusive or, seriously, so the one that we're looking for is the one where we got zero,

14:08.660 --> 14:18.660
so we do a funky thing to basically turn the zero into FF and then we mask that down to just one bit,

14:21.060 --> 14:28.740
so in just a few instructions with no branches we have found the element we're looking for

14:29.700 --> 14:39.300
and I thought you might think I was lying, so I used to go bolt, so this is the actual

14:40.340 --> 14:45.700
machine code for this function, the go code is on the left it kind of looks like noise,

14:48.580 --> 14:50.980
the machine code is easier to understand,

14:51.380 --> 15:06.660
and that is, it's on the right here, and yeah, so we take the thing, we multiply it by that,

15:06.660 --> 15:13.060
I mean that's the decimal version of that one zero one zero thing, we do multiply, we do x or

15:13.060 --> 15:26.420
add not and done, no branches, so that's pretty cool, but there's more, if you read the description

15:26.420 --> 15:36.900
of this talk, I promised you Cindy, what is Cindy, well this is not Cindy, so the sort of typical

15:36.900 --> 15:40.420
traditional way that computer works, you have a stream of instructions, you have a stream of

15:41.380 --> 15:47.620
and we compute some sort of results, the green bit is the is the logical processing unit in

15:47.620 --> 15:54.820
the middle of your your computer, so this is Cindy, we still have one stream of instructions,

15:54.820 --> 16:01.380
but we have multiple sets of data coming in at once and producing multiple sets of results,

16:02.340 --> 16:09.140
so this was this kind of idea was invented really for graphics, or it can also be used in

16:10.020 --> 16:16.500
sound processing, but once it's in the CPU, people use it for all kinds of things, and

16:17.940 --> 16:25.300
we're going to use it to look up on map, how do you do that, how do you write Cindy in go,

16:25.300 --> 16:36.420
well you can't, I mean the only the way this works is inside the compiler, so this is in the go

16:36.420 --> 16:46.660
compiler source code, internal ssa.gen and trinzics.go, it matches on the name of that function,

16:47.300 --> 16:59.940
which we had earlier, if you set this is an environment variable go AMD64, so that takes a value v1 v2 v3 v4,

16:59.940 --> 17:06.740
and that tells the go compiler how capable your processor is, the default is v1 which is like

17:06.740 --> 17:17.300
a, I don't know, Pentium from 1980 or something, it's nobody use, every processor pretty much has

17:17.300 --> 17:27.300
got these nice Cindy instructions, but you do need to set the variable to v2 to convince the compiler

17:27.300 --> 17:38.500
that it should be using them, and yeah, I couldn't use Godbolt because you can't set the environment,

17:38.500 --> 17:44.180
but you can't persuade Godbolt to compile that exact function, so I can't show you it that way,

17:44.980 --> 17:51.300
but I'll show you, I'll show you in this visualization how the Cindy instructions work,

17:51.380 --> 17:57.860
I mean the basic point of this is that things like that multiply that we saw earlier,

17:57.860 --> 18:04.260
multiplies pretty slow instruction in the CPU, it might take 14 nanoseconds,

18:05.700 --> 18:11.140
whereas mostly instructions we're going to look at here take one nanosecond, plus or minus,

18:12.100 --> 18:20.820
I lie a lot, okay, so we saw this before, right, there's some, these are the seven bit hashes,

18:21.300 --> 18:31.380
three of them, and 80 remains empty, we, this instruction is called a broadcast, so we go from the

18:32.500 --> 18:37.220
one byte to eight bytes all the same in a single instruction in a single clock cycle.

18:38.180 --> 18:45.940
We don't have to do XOR and an end and a knot, because there's a single instruction

18:47.220 --> 18:52.500
that matches each byte individually, so we find the one that we're looking for,

18:53.780 --> 19:03.540
and then we squish that down to a single byte where one bit tells us which one we found it at,

19:04.500 --> 19:09.940
and it's a little confusing because the bytes and the bits are the opposite ways round, so

19:10.820 --> 19:20.900
blame until. So, yeah, this is, this is, the bias complicated as my talk gets,

19:22.100 --> 19:28.260
so what are we looking at here, we've used Cindy instructions where each instruction

19:28.420 --> 19:36.500
executes more or less in one clock cycle. We have from a bucket of eight up to eight elements,

19:36.500 --> 19:44.100
we've found the one, in my example, that matches, because only seven bits, there's, there's a one

19:44.100 --> 19:51.940
in 128 chance that you get a false positive, so it's a little bit better than 99% of the time,

19:51.940 --> 19:59.300
you find the exact right one instantly, but if you've got a, what we need to, we need to check

19:59.300 --> 20:03.940
the whole key anyway, once you find a match, we need to check the whole key just to,

20:05.300 --> 20:12.100
because the hashes are small anyway. So, false positives are a little bit of a drag,

20:12.900 --> 20:20.740
but less than one percent at a time, and so this, this, it blows my mind, this is really why I

20:20.820 --> 20:28.100
wanted to do this talk, because this code is so clever and so tight, and I believe this is

20:28.100 --> 20:36.020
down to Jeff Deenan's surgery, again, who are the sort of superbeings of Google.

20:38.260 --> 20:44.660
All right, what else we got? More benchmarks. This is memory size.

20:45.620 --> 21:00.500
The 124 map is generally a bit smaller, and sometimes it's a lot smaller, so let me just explain

21:00.500 --> 21:06.820
how I drew this chart, how I measured this. I actually, when you run go benchmarks, it gives you a

21:06.820 --> 21:12.580
memory number, but that includes memory that was allocated and free. This is just the allocated

21:13.540 --> 21:17.380
memory. The stuff that's left live, what your program actually needs to run.

21:19.860 --> 21:25.300
And what I do is I call make, and I don't give it the right size. I don't give it any size at all.

21:25.300 --> 21:31.220
So the map has to grow, and what we see with those humps is all of those overflow buckets,

21:32.180 --> 21:40.740
and then the doubling. Those two things give you this massive jump. So the go 124 map can be a

21:40.820 --> 21:48.980
lot smaller, just depending on how unlucky you get. It tends to always be a bit smaller,

21:50.100 --> 21:56.820
because it is packed tighter, what we call the load factor. How many elements will it allow

21:56.820 --> 22:01.380
in a map before doubling the size? That went up slightly.

22:01.860 --> 22:10.820
Yeah, I just wanted to, that's the overflow buckets. That's why those big humps are the overflow

22:10.820 --> 22:24.820
buckets. Okay, shut up. This is the worst case for memory. So this is a map of an empty struct,

22:25.780 --> 22:32.260
right, a structment nothing in it, and people use this as a set type, just we just know if it's

22:32.260 --> 22:41.780
in the map or not. The 124 map ends up allocating eight bytes for each empty struct,

22:43.140 --> 22:54.500
which is a bit of a mistake. Sorry. So yeah, this is the, this is the showing the go 124

22:54.580 --> 22:58.740
the Swiss map in the worst possible light from all the benchmarks and all the

22:58.740 --> 23:04.020
investigation that I did, this is the worst thing it does. Probably this will get fixed.

23:05.060 --> 23:12.100
Oh, by the way, release. Oh, yeah, let me talk about that in a second. So this is, this is the issue

23:12.100 --> 23:20.660
to track 71368. If you want to know when they fix the empty struct bug or inefficiency.

23:24.900 --> 23:34.180
The release of 124 is delayed, because I found this bug. It's not a bug in go. This is a bug

23:34.180 --> 23:40.820
in the library called modern go reflect two, which I had not heard of, but it is used in JSON

23:40.820 --> 23:50.660
iterator, which I use all the time. JSON iterator was using unsafe techniques to look at the

23:50.740 --> 24:01.700
internals of map. The internals have changed a lot. It's broken. So there's going to be an RC3

24:03.300 --> 24:16.340
go 124 RC3 in like on Tuesday roughly. So that pushes back the release by like another week.

24:16.660 --> 24:25.140
What they did in, so this is the tracking bug in where the problem exists, but this library

24:25.140 --> 24:37.540
seems to be abandoned. So you know, just stop using it. So in go there's a layer, they basically added

24:37.540 --> 24:41.940
an extra layer of indirection, because that's the way you fix every problem in computing.

24:42.020 --> 24:51.140
So basically when you use this or any library that is going behind the back of map to look at

24:51.140 --> 25:00.580
the internals, it actually gets a compatibility layer, which then now works. But adding the

25:00.580 --> 25:07.060
compatibility layer slows you down. So it only slows down the people using the the back door

25:07.060 --> 25:14.900
access, the unsafe access. But that's kind of worth knowing about. All right, we're basically

25:14.900 --> 25:20.100
done. I wanted to put up all the people that really worked on it, just to be clear, I did not work on

25:20.100 --> 25:31.060
this. So very, very clever people. And yeah, I could take questions. I have some links here

25:31.060 --> 25:37.060
if, uh, and some credits.

25:43.460 --> 25:48.500
Are there any pressing questions? I see a very raised hand right over there. Please speak as

25:48.500 --> 25:55.620
close as you can to the microphone. Thank you for the talk. If we go so button to the

25:55.780 --> 26:07.460
instructions, it means it's already architectural dependency for the processor and yeah, yeah.

26:07.460 --> 26:19.300
So the code I showed was was Intel AMD code. And so they, they have not written, yeah, they've

26:19.300 --> 26:27.380
not written this code for ARM or any other architecture yet. But it can be written. And then the

26:27.380 --> 26:34.580
other version with the multiply, that does work on ARM. So it's already faster on ARM, but we don't

26:34.580 --> 26:42.100
have the SIMD implementation on ARM yet. Thank you. If you have more questions, please be

26:42.100 --> 26:49.940
going to be in the hallway. We have to switch fingers now, last forever plus.

