WEBVTT

00:00.000 --> 00:09.400
I'm going to do the introduction to this might now.

00:09.400 --> 00:15.840
We are still in the deep diving go track and you heard we talk about garbage collection

00:15.840 --> 00:19.840
a lot which reminds you remember if you have garbage you want to collect the garbage

00:19.840 --> 00:21.840
bin is there.

00:21.840 --> 00:26.080
We are a garbage collected language I don't want to see any garbage at the end of the day

00:26.080 --> 00:28.760
only if you do reference counting.

00:28.760 --> 00:34.200
But we are going to talk about going easy on memory and Samar here is an amazing person

00:34.200 --> 00:40.240
to talk about this well with a very cute graphic here explain us how we can treat our

00:40.240 --> 00:44.720
memory of our servers and our computers way better run for plus.

00:44.720 --> 00:53.200
Thanks guys thank you I'm really happy being here.

00:53.200 --> 00:59.480
So my name is Subagip I am a software injury and a full-time dead of three monsters

00:59.480 --> 01:04.600
basically I contribute to open source as much as I can but these days it's a bit hard.

01:04.600 --> 01:10.720
I'm mostly interested in the observability, distributed systems and databases I have spent

01:10.720 --> 01:17.080
like all my life coding basically I'm a systems guy before go I was just writing profilers

01:17.080 --> 01:22.280
in Python as a sea extension but we as a company decided to implement a continuous

01:22.360 --> 01:28.000
profiling ingestion pipeline and go length we turned out to be an excellent idea.

01:28.000 --> 01:34.640
So my main motivation for this is for this talk is that there is tons of material around

01:34.640 --> 01:38.640
like how garbage collection works like what types of garbage collection collectors are

01:38.640 --> 01:44.080
there but there's less emphasis on writing garbage collection and weird code which means

01:44.080 --> 01:50.240
like I feel it like when you're writing code you will know like what kind of tricks you

01:50.320 --> 01:55.360
have in your bag like what should you do to write more efficient code in terms of garbage

01:55.360 --> 02:03.080
collection and even though the topic is very language-agnostic like you mostly see the same

02:03.080 --> 02:06.080
things in Java as well.

02:06.080 --> 02:11.800
I aim to be practical as possible and let me first start with some real work data so this

02:11.880 --> 02:20.880
is one of my favorite conferences and Pekka he is the CEO too so it is a horizontally scalable

02:20.880 --> 02:26.560
SQLite database and mentions about like avoiding dynamic memory management is the only

02:26.560 --> 02:31.560
thing that matters for low latest it's kind of a bolt statement but I think really it's

02:31.560 --> 02:36.640
after some point it becomes true because you optimize all the low hanging fruits in your

02:36.720 --> 02:42.240
code and maybe like do lots of optimizations in your algorithm and data structures but

02:42.240 --> 02:47.520
then what your left is just memory management and you need to do much better in that area

02:47.520 --> 02:55.680
as well and yeah this talk is from brands like previous talks if you I'm not sure if you

02:55.680 --> 03:04.200
can see everything but view works is horizontally scalable database and it is like if you

03:04.280 --> 03:10.440
can some of these numbers you will see that half of the time is actually spent on garbage

03:10.440 --> 03:16.120
collection and this is from a production system and one other example is from this talk and

03:16.120 --> 03:22.680
receive video and this is I think this is a perfect put and if you see the first function it

03:22.680 --> 03:29.400
basically takes like 40% of the time and it's just runtime gc bar it's basically the marking

03:29.480 --> 03:36.840
process marking gorytine of the garbage collection and if you want some more this is from our

03:36.840 --> 03:41.720
ingestion continues profiling ingestion pipeline here is the like half of the time is actually

03:41.720 --> 03:48.360
the handler code the ingestion code and you can see that 25% of the time is actually spent on

03:48.360 --> 03:54.440
garbage collection as well and this is a recent data dog blog again this is a metric database

03:54.440 --> 04:00.600
from from day by the way the the code that I showed you from Uber was M3db which was their

04:00.600 --> 04:05.880
metric database as well and this is from this is a recent blog post from the data dog they switched

04:05.880 --> 04:14.680
to us because 30% of resources on garbage collection and like there is this gc friendly libraries

04:14.680 --> 04:21.640
there are lots of them like go or 0 log go reflect all of them are just advertised there's

04:22.600 --> 04:27.960
themselves as being zero allocation and that kind of stuff and the people just download

04:27.960 --> 04:35.160
it which means that there's there's a really neat for this and yeah I don't language I'm not sure

04:35.160 --> 04:41.960
if you have ever heard the term like that spiral that that's I that was a term I heard a lot

04:41.960 --> 04:48.840
in java there are that was you guys sitting behind me not myself but the that spiral means

04:48.840 --> 04:54.600
that your application is basically doing nothing useful just trying to keep up with the pace

04:54.600 --> 05:00.440
of the allocations so I'm going to have a very overview of very I'm going very fast here

05:01.480 --> 05:07.880
to just mention a bit about like how memory works in go lang basically we have two types

05:07.880 --> 05:14.600
stack and heap and the both are on memory of course and the stack is basically pre allocated memory

05:14.680 --> 05:20.200
dynamically grows like it basically starts with one megabyte by default for a go routine and then

05:20.200 --> 05:25.960
it can dynamically grow it is fashion because it is when you eloquate it's just incrementing

05:25.960 --> 05:32.440
and decrementing a pointers so it's very fast because it's known the access patterns are very

05:32.440 --> 05:38.680
known and the faster access because you most of the time when you enter a function you access

05:38.760 --> 05:46.280
variables most of the time the same stack will use that means that your by nature your code

05:46.280 --> 05:52.920
your stack memory will be cash friendly because it will be cashed in L1 L2 whatever and it's

05:52.920 --> 06:00.360
managed by the compiler in go which means whenever you compile the binary the stack memory will be

06:00.360 --> 06:07.160
known and the heap is dynamic allocated and it's managed by garbage collection the only thing I would

06:07.240 --> 06:13.960
like to say is that whenever a variable escapes to heap you will you can see it as a pressure

06:13.960 --> 06:19.880
to garbage collection and one more thing I would like really would like to mention is that like

06:19.880 --> 06:27.560
it is the stack is cash friendly so it's mostly in L1 so you're mostly accessing the same

06:27.560 --> 06:34.600
variables again and again which puts them closer to the CPU that's how CPUs work and if you look

06:34.600 --> 06:39.800
at the numbers you can see that accessing RAM is much much worse than just being in the cash

06:40.520 --> 06:46.360
so and these numbers are usually hidden you cannot see them in your CPU profiles whatsoever

06:46.360 --> 06:52.440
so what I'm trying to say is that do your best to be in stack if you've wild writing code if

06:52.440 --> 06:56.760
there's a possibility you can put that variable in the stack please put it in the stack

06:57.400 --> 07:02.360
and to move before moving forward into the tricks I also would like to mention a little bit

07:02.360 --> 07:10.360
about like understanding garbage collection so how garbage collection works before that I just want to

07:10.360 --> 07:17.160
say that it is a very complex software and it is probably do inherent wide range requirements

07:17.160 --> 07:24.280
of all the applications it need to support the allocation rate of an application can be very high

07:24.280 --> 07:31.240
the volume can be very high there might be few goretings like there might be millions of goretings

07:31.240 --> 07:36.360
running in your application there's fragmentation issues it needs to consider and there's also

07:36.360 --> 07:43.720
the pacing like it needs to basically say no to say no to allocations when the time comes

07:43.720 --> 07:47.880
and it needs to do it in a minimum later otherwise people just switch to other languages like

07:47.880 --> 07:55.800
rust so yeah and how garbage collection works it basically works in three phases this is again

07:55.800 --> 08:03.400
like two thousand feet overview of this so I don't want to shame the people work on this one

08:03.400 --> 08:10.600
because it is like an extremely complex software so in the first phase this is called the initial

08:10.600 --> 08:16.680
mark phase there is this stop the world so this is the stop the world well let's stop the world

08:16.680 --> 08:22.840
means is that it basically stops your all your application no nothing runs at this period this

08:22.840 --> 08:31.320
is done because it needs to get a consistent snapshot of the variables that point to the heap

08:31.320 --> 08:36.520
so it basically stops the application nothing is running it then works over the stack of your

08:36.520 --> 08:42.840
goretings try to identify which which ones are pointers there's lots of optimizations on

08:42.840 --> 08:48.680
in doing so but just I'm just saying in this initial phase it is stopping the world so effectively

08:48.680 --> 08:55.640
it it means a lot for the for the latency and in the second phase it can then concurrently because

08:55.640 --> 09:02.680
in the initial phase it eject it basically enabled some right barriers so what right barriers mean

09:02.680 --> 09:09.880
is that in this concurrent mark phase when the right barriers are in place what you can do is you can

09:09.880 --> 09:16.280
basically get updates or rights happen to those pointers and then you can understand you can

09:16.360 --> 09:22.680
basically detect them you can think a white right barrier is a callback to your pointer change

09:24.200 --> 09:31.400
so this this all happens concurrently and after like marking all the reachable objects in the heap

09:31.400 --> 09:36.120
unreachable ones are just sweep which is the third phase of the garbage collection

09:36.120 --> 09:42.520
what I would like to just highlight is that it is extremely not extremely but very unproductive it

09:42.600 --> 09:47.640
might be very unpredictable like this because of the stop-to-world you basically stop the

09:47.640 --> 09:53.000
application and the garbage collection is also very dependent to your application because it is

09:53.000 --> 10:00.360
linear with the number of pointers you have in the heap so and it acts very differently on pressure

10:00.360 --> 10:06.680
whenever you are limit in the pressure it acts differently it basically kicks in just more

10:06.840 --> 10:12.600
and more and one thing that's overlooked is that CPU cash flushes that I just talk about before

10:12.600 --> 10:17.960
garbage collection it is like this everything is green and everything works perfectly but after

10:17.960 --> 10:23.400
garbage collection it is just empty because if you think about it garbage collection is kind of a

10:23.400 --> 10:31.320
software because it basically traverses all heap it is kind of very cash unfriendly if you think about it

10:31.400 --> 10:36.600
so to wrap up the garbage collection can high impact we see that real work data makes it up

10:36.600 --> 10:44.200
40-50% usually goes unnoticed and it can be unpredictable so let's keep garbage collection happy

10:46.360 --> 10:54.040
okay I shamelessly copy and paste this from Brian's talk hello brand Brian and this is from

10:54.040 --> 11:01.480
Histalk that like reuse reduce recycle this is an MRI and my motor phrase that is basically

11:01.480 --> 11:06.920
just to eliminate waste and I think it's a very nice analogy for this one so let's start with

11:06.920 --> 11:14.760
reduce basically so by the way reduce is not all about right reducing allocations or pointers it is

11:15.960 --> 11:23.720
it is reducing size like like reducing any size of your memory always have compounding benefits

11:24.360 --> 11:31.640
like I was a piting guy and I know that they have done for example from 2002 to 2007 to

11:32.120 --> 11:38.680
2003 that 13 they have shrinks the base object base everything is an object in piting by the way and

11:38.680 --> 11:46.760
basically they shrink the size by a 95% and they see like 60% on runtime without doing any optimization

11:46.760 --> 11:53.720
that's how you benefit from cash locality and that kind of stuff so mechanical sympathy is very

11:53.720 --> 12:00.200
important yeah I have written a simple blog post about it yeah so the first thing I would like

12:00.200 --> 12:06.760
to mention is like the stack versus heap so whenever you return something some reference type or

12:06.760 --> 12:14.280
pointer it returns it basically escapes to heap so you should try to be if it is possible you should

12:14.280 --> 12:21.880
try to try to design your APIs such that you accept it instead of returning it like the

12:21.880 --> 12:27.880
reader interface is a good example and I saw this example from Jacob Stalkier he mentions that

12:27.880 --> 12:34.680
whenever you return it it means that the allocation is already done it escapes to heap but if you

12:34.680 --> 12:39.800
accept it there is a chance that you can allocate this on stack and whenever you call the function

12:40.600 --> 12:48.280
it might still be in the stack there is there is no rule on that so returning escapes to heap but

12:48.280 --> 12:56.520
calling does not and yeah closure variables are tricky just if you if you use closure variables

12:56.520 --> 13:03.720
just be cautious about them they might escape to heap. Interface in generating escapes to heap as

13:03.720 --> 13:10.200
well because compiler doesn't know the type that's the size there's slower on hotpets please

13:10.840 --> 13:17.880
prefer using concrete types so the the most important thing I think is this line here in

13:17.880 --> 13:24.600
the Stalk is that avoiding pointers so whenever you write I was mentioning some mindset before

13:24.600 --> 13:31.320
the Stalk I think it is this whenever you write the pointer you use a pointer just try to be

13:31.320 --> 13:38.760
mindful about it because it basically means more garbage collection pressure it is linear with the

13:38.760 --> 13:44.840
pointers and if you don't use pointers it can skip entire regions without pointers I'm talking

13:44.840 --> 13:51.080
about the garbage collection it is more cash friendly to to be to not to use pointer by the way

13:51.720 --> 13:56.840
and compiler generates extra checks like like if you panic for example you need to have some checks

13:56.840 --> 14:03.160
in the compiled output and sometimes these pointers can get unnoticed like I didn't know before

14:03.160 --> 14:08.760
this talk the time that time for example has a pointer inside so if you for example have a slice of

14:08.840 --> 14:14.360
time objects they basically contain a pointer which means garbage collection pressure

14:16.120 --> 14:23.480
yeah string time that time all contain pointers in self so careful maps with reference types

14:23.480 --> 14:31.720
slice values string keys and many more so one one one technique that there is being used is that

14:31.720 --> 14:37.800
if you have like this kind of struct like basically two integers region and tenon tidy instead

14:37.800 --> 14:44.760
of just formatting them to a string and using this string as a key please just use the struct as a key

14:44.760 --> 14:54.360
because you will be avoiding pointers for free and this this one is also interesting for me as

14:54.360 --> 14:59.560
well I don't know if you remember that's just a Swiss map talk from Brian that just a previous

15:00.360 --> 15:07.400
thought he was measuring the buckets like so if you allocate like a struct value or key

15:08.120 --> 15:17.960
more than 128 bytes this means that the map implementation needs to allocate more memory and use

15:17.960 --> 15:25.480
pointer instead of inlining the value inside the bucket but if you if you are smaller than

15:26.120 --> 15:33.080
this special value you will be inline integral so there is no extra allocation you can

15:33.080 --> 15:37.800
test it yourself just benchmark the code with the allocations you can see that it is different so

15:37.800 --> 15:49.320
I will just head up for this one yeah one one more thing is that as guys coming from sea

15:49.400 --> 15:57.000
world like me there is this myth of copying is expensive but if you think about it like in terms

15:57.000 --> 16:06.120
of CPU it is it is a myth because basically like copying a 64 bytes is just the same as copying

16:06.120 --> 16:15.160
a pointer because basically CPU and RAM just operates in cash line rate yeah I mean

16:15.320 --> 16:22.520
prefer to use non pointer versions of data structures like linked list you would be amazed

16:22.520 --> 16:28.840
like how many data structures can be just implemented without using pointers this one might not

16:28.840 --> 16:34.440
be avoiding pointers but I just wanted to mention because this bit me in production there is a big

16:34.440 --> 16:40.360
peeper of payload for we we used in our ingestion pipeline and we were cashing a small

16:40.360 --> 16:47.080
struct inside it which was not a pointer if you do if you do this this thing I mean the big

16:47.080 --> 16:54.440
struct will not get the allocated it is I mean it is obvious but just keep in mind while doing

16:54.440 --> 17:01.400
these kind of things because we get in in the in the head and yeah it took us lots of time we

17:01.400 --> 17:09.160
our application was out of memory like crazy so basically avoid holding references inside large

17:09.160 --> 17:16.840
objects and the remember zero kitchen libraries use them they're awesome there is lots of them

17:16.840 --> 17:22.440
and if you wonder how they work the main trick is just pre-allocating memory and use integer

17:22.440 --> 17:28.840
indexes to reference objects so it's just basically another way of avoiding pointers that's all

17:29.000 --> 17:38.680
and reusing okay reuse I think it should start with singtapool definitely because it's the basic

17:38.680 --> 17:46.440
tool singtapool is again coming from a serial singtapool is not a fearless like you cannot just

17:46.440 --> 17:54.120
put something and wait for it to be present whenever you request it it is different because

17:54.200 --> 18:02.040
the values when when you put them in the singtapool they can be garbage collected and so the main

18:02.040 --> 18:10.120
trick here is that if you would like to use the singtapool you should have allocations that you would

18:10.120 --> 18:18.760
like to use reuse between two garbage collections cycles basically I say too because whenever you put

18:18.840 --> 18:26.040
something into the singtapool it stays one more time it's persist to the garbage collection

18:26.040 --> 18:34.600
cycle one more time and one more thing is that it is defined and singt this one yeah it is very

18:34.600 --> 18:41.000
simple thing but I mean if you think about it sometimes you don't see it so it's not in

18:41.000 --> 18:46.600
memory that pull or something like that it's singtapool because it is very very optimized for

18:46.600 --> 18:52.600
concurrent access there is no looks inside the implementation of the singtapool there is a very good

18:52.600 --> 18:57.800
blockpost about it from the pictorial matrix I just read and basically what it does is under the

18:57.800 --> 19:06.840
hood in the runtime it has a go routine caches basically uses very small caches but it's somehow

19:06.840 --> 19:16.120
managed to do it without using any mutex or like yeah it is very useful but a bit misunderstood so

19:16.280 --> 19:23.480
you need to understand very well one of the tricky things that be careful with returning non-pointers

19:23.480 --> 19:31.720
why because so if you look at the example in the in the red here if you return a slice header

19:33.320 --> 19:41.480
and return this one like that what happens is that when you put it put the buffer as a slice

19:42.280 --> 19:49.640
the as the put accepts an interface there is a conversion so the the buffer as the buffer slice

19:49.640 --> 19:55.000
other escapes to heat which means that you have another allocation unnecessary allocation so

19:57.080 --> 20:04.040
so the the reason you basically use singtapool is that reusing not allocating but you are

20:04.040 --> 20:09.320
allocating more so this is exactly the opposite and that is why you optimize for allocations but

20:09.320 --> 20:15.800
introduce one for for the next cycle and yeah this might be hard to have a support on production

20:15.800 --> 20:21.400
and that's why we have this static check and please just use static checks in your code as well

20:22.440 --> 20:27.960
and sorry like if you read the static check when passing a value there is not a pointer to a function

20:28.680 --> 20:33.800
the value need to be placed on the heat which means additional allocation and this is the reason

20:34.280 --> 20:41.560
and I'm not going to too much detail on this one this is very simple I guess like if you have any

20:41.560 --> 20:48.120
chance of like you allocate basically a slice of numbers and if you have any chance of reusing it please

20:48.120 --> 20:55.160
reuse you can just point to the start of the slice by this trick like a pan trick and please again if

20:55.640 --> 21:03.640
you just allocate a big or small it doesn't matter a map or slice please a pre-allocate because

21:04.280 --> 21:12.200
if you let the dynamic if you let your slice or map to dynamically grow it it might lead to

21:12.200 --> 21:19.640
fragmentation as well and yeah strings that build their bytes but buffer like all the stuff

21:20.280 --> 21:27.880
no they're optimized for in not making no intermediate allocations as well oops okay recycle

21:27.880 --> 21:32.920
so for me recycle is all about like tuning the garbage collection tuning the parameters

21:33.480 --> 21:40.680
and if you look at it like I really think that Golank has Golank team has done great job at just

21:40.680 --> 21:45.800
abstracting all these garbage collection details into two basic configuration parameters

21:46.440 --> 21:54.200
one of them is the Golk gc which means that okay like if you set it to 100 person for

21:54.200 --> 22:01.160
example what it means that I have this the current heap it is x bytes whenever it becomes like it

22:01.160 --> 22:08.280
doubles it becomes one person more sorry 100 person more just trigger the gc that's how it that

22:08.280 --> 22:14.280
way what it does and the go map limit is basically saying that okay it is the it is the hard limit

22:14.280 --> 22:21.240
for the go run time do not do not get this threshold but it is also a nice thing it is also

22:21.240 --> 22:26.440
a soft limit for the operating system because you would like to avoid out of memory errors because

22:26.440 --> 22:32.360
whenever out of memory occurs it might end up very weirdly this gives you a chance to set a

22:32.360 --> 22:39.320
proper value without out of memory and the go run time can act be like agree and by the way

22:39.320 --> 22:46.360
when garbage collection near this limits it may be near this go map limit it acts more aggressively

22:46.360 --> 22:53.480
try to reduce there is a nice tool that show you can just select go gc memory and play with it

22:54.920 --> 23:01.560
okay two solar reef I have a few minutes left so this is overview for reason I think go is

23:01.560 --> 23:07.480
unparalleled when it comes to observability it is like in the runtime you have lots and lots of tools

23:07.640 --> 23:13.480
I'm not going very detail on them there's tons of material I just would like to mention like

23:13.480 --> 23:20.120
profiling memory you can basically use heap life objects size to debug memory leaks you can use

23:20.120 --> 23:26.600
allocation count on per function and line level by the way line level is also very interesting as

23:26.600 --> 23:33.400
well you can basically observe allocation frequency and maybe you think that pull and this is again

23:34.280 --> 23:41.960
like an older I shame this I was really really lazy to just do it myself I used Ryan Stalk again

23:41.960 --> 23:47.960
but this is line profiling I'm still amazed by it you can do line profiling and it shows you

23:47.960 --> 23:55.400
per line allocation information and the escape analysis again try to be I said try to be on the

23:55.400 --> 24:01.880
stack not heap so this is your tool to do it if you run this tool it basically compiles and

24:01.880 --> 24:07.480
basically says you I will be on the stack or not so you can use this tool to decide if you're

24:07.480 --> 24:13.960
on the stack sometimes it's not easy to spot and one tool that I'd like to mention because I

24:13.960 --> 24:20.840
feel it's kind of underrated execution trace is one of them I feel that it is it is the most

24:20.840 --> 24:27.720
cinematic visualization what I mean by that is that it offers the most realistic view on what happens

24:27.720 --> 24:34.600
inside your application because it shows the time and it shows the important events happening

24:34.600 --> 24:41.480
in your application it can be look it look acquire it can be garbage collection whatever like

24:41.480 --> 24:46.920
you will see a timeline of events happening and you can also see the garbage collection

24:46.920 --> 24:54.120
pressure and phases and the latency like stop the world every every phase that I just mentioned

24:54.120 --> 24:59.560
you can see and visualize it and yeah you can debug look condition issues as well and it is

24:59.560 --> 25:05.800
safe on production kind of safe why I say that because we with one that 21 overhead drops the

25:05.800 --> 25:12.600
one to two percent of time thanks to kudos kudos to Felix Gaysander for who works on these

25:12.600 --> 25:20.040
kind of topics basically they optimize the stack online encode to drop it from 10 percent so it is

25:20.040 --> 25:27.320
kind of safe to use on production so yeah this is how it looks if you have haven't seen already

25:28.520 --> 25:35.320
and for me there's one more M1 environment where I will that you said like this but it is for me

25:35.320 --> 25:40.200
like kind of a CLI way of doing execution tracer I just would like to mention it

25:41.400 --> 25:48.840
okay just I would like to just wrap up reduce preferred stack over here if you can avoid pointers

25:49.080 --> 25:54.520
make it a habit I mean there's there's no way you can avoid pointers I'm just saying be in the

25:54.520 --> 26:01.000
mindset of that they put pressure on garbage collection use sparingly I mean use interface and

26:01.000 --> 26:09.000
generic sparingly sing pull is your friend but understand it very well reuse pre-allocate maps

26:09.000 --> 26:17.800
slices whenever possible and usable observability tools I think anytime spent on using observability

26:17.880 --> 26:25.800
tools is just a well time spent I mean profile benchmark execution tracer and one bonus one I see

26:27.000 --> 26:35.480
the morning talk from Martin that about memory just I was very I didn't decide since one hour ago

26:35.480 --> 26:43.480
I should put this or not but memory just like before this one arena was proposed which is to just

26:43.560 --> 26:50.120
a way of deciding you're like implementing your own stack in some way like basically you you say that

26:50.440 --> 26:58.120
I don't want to involve garbage collection anyway in this namespace and regions are awesome and

26:59.720 --> 27:06.680
they're better than arena because arena just found to be less ergonomic in the in the previous

27:06.680 --> 27:13.960
implementation but I don't know it I think it will be good so it and it was minimal garbage

27:13.960 --> 27:17.000
collection impact as well with that thank you very much

