WEBVTT

00:00.000 --> 00:09.000
Thank you, everyone, for coming to this.

00:09.000 --> 00:12.000
The objective of this talk is to give a few pointers and

00:12.000 --> 00:16.000
Heidi on how you can maximize performance when using

00:16.000 --> 00:20.000
GRPC and specifically the go implementation of GRPC.

00:20.000 --> 00:25.000
A lot of people are already familiar with GRPC, but I will just make a few

00:25.000 --> 00:28.000
basic introduction to reintroduce and re-explain what it is.

00:28.000 --> 00:31.000
GRPC is an RPC protocol made by Google,

00:31.000 --> 00:34.000
VATTEX is a team of first approach.

00:34.000 --> 00:38.000
So you first need to define your service specification in a

00:38.000 --> 00:39.000
Protome of fine.

00:39.000 --> 00:42.000
You will list all of the end points that exist on your service.

00:42.000 --> 00:47.000
All the message type that they take in input and that they output.

00:47.000 --> 00:51.000
From this specification, you can then generate

00:51.000 --> 00:57.000
Client and server code in any language that you want.

00:57.000 --> 01:00.000
It means that with GRPC it is easy to have

01:00.000 --> 01:03.000
Cross language support to have one client written in one language

01:03.000 --> 01:07.000
talking to server and another language without having to do

01:07.000 --> 01:08.000
much work.

01:08.000 --> 01:13.000
GRPC is based on top of HTTP2 and it uses binary encoding for

01:13.000 --> 01:15.000
the payloads.

01:15.000 --> 01:20.000
It supports either unary RPC or streams.

01:20.000 --> 01:25.000
So let's have a look at how it would work to do a unary

01:25.000 --> 01:27.000
request.

01:27.000 --> 01:32.000
I define in a Protophile my service that has a single endpoint

01:32.000 --> 01:34.000
Create user.

01:34.000 --> 01:37.000
It takes a specific message type in input.

01:37.000 --> 01:41.000
Create user request and return the specific message type in

01:41.000 --> 01:42.000
Output.

01:42.000 --> 01:46.000
From this Protome of specification, I will be able to

01:46.000 --> 01:49.000
Involve the Protome of the Protome of the Protome of the

01:49.000 --> 01:52.000
Compiler to generate a associated go code.

01:52.000 --> 01:56.000
And so it will generate a go script with all of the fields that

01:56.000 --> 01:59.000
have defined in the Protome of the definition.

01:59.000 --> 02:03.000
With this generated code, it is easy to implement

02:03.000 --> 02:05.000
The server itself.

02:05.000 --> 02:08.000
I just need to create a type that has a method.

02:08.000 --> 02:13.000
Create user that has a signature where it takes my message

02:13.000 --> 02:17.000
type in input and that returns the message type in Output.

02:17.000 --> 02:20.000
The implementation here is very dumb.

02:20.000 --> 02:23.000
It's just echoing back all of the field of the request in

02:23.000 --> 02:25.000
the response.

02:25.000 --> 02:28.000
And we have this implementation.

02:28.000 --> 02:32.000
I can have a quick benchmark to see how efficient that is.

02:32.000 --> 02:36.000
To do that, I use the standard go benchmark tooling.

02:36.000 --> 02:39.000
And I create a benchmark function.

02:39.000 --> 02:41.000
First, I will set up the server.

02:41.000 --> 02:43.000
I will bind to a local response.

02:43.000 --> 02:45.000
I will create a new JavaScript server.

02:45.000 --> 02:48.000
Attach my implementation that I should previously

02:48.000 --> 02:51.000
To it and ask it to set request.

02:51.000 --> 02:54.000
On the second hand, I will create the client.

02:54.000 --> 02:58.000
I attach to a specific local response and create a client

02:58.000 --> 03:00.000
object from it.

03:00.000 --> 03:05.000
With this client object, I can then benchmark the

03:05.000 --> 03:10.000
act of calling client.create user on the client object.

03:10.000 --> 03:14.000
And while this looks like just regular function call,

03:14.000 --> 03:17.000
it is actually doing a remote procedure call and doing all

03:17.000 --> 03:19.000
of the following steps.

03:19.000 --> 03:23.000
It will first marshal the request object to transform it to a

03:23.000 --> 03:26.000
byte array that is the wire representation of it.

03:26.000 --> 03:28.000
Send that to the network connection.

03:28.000 --> 03:30.000
The server will receive it.

03:30.000 --> 03:33.000
You will get back again the request object.

03:33.000 --> 03:36.000
Execute our handler implementation.

03:36.000 --> 03:40.000
Which was the basic figure that I showed earlier.

03:40.000 --> 03:43.000
Then it will get a response object that it could then marshal

03:43.000 --> 03:45.000
to a wire representation.

03:45.000 --> 03:46.000
Send that for the network.

03:46.000 --> 03:48.000
The client will receive it.

03:48.000 --> 03:51.000
And finally, we will unmark it to get a response object that

03:51.000 --> 03:55.000
can be returned from the function.

03:55.000 --> 03:59.000
And we can run the benchmark to have a rough ID of what is

03:59.000 --> 04:02.000
the performance of basic implementation.

04:02.000 --> 04:07.000
And it takes roughly 36 microsecond to do all of these steps.

04:07.000 --> 04:12.000
And so the overhead of a grpc-pc-pc-pc seems to be around

04:12.000 --> 04:17.000
36 microsecond because I get nothing interesting in my handler.

04:17.000 --> 04:20.000
And it's actually not that slow, that's good.

04:20.000 --> 04:25.000
But it's actually possible to reduce this time.

04:25.000 --> 04:29.000
And if we look back at all of the steps that were taken in

04:29.000 --> 04:32.000
or benchmark, a lot of them are about marshalling it

04:32.000 --> 04:38.000
and marshalling object to their wire representation.

04:38.000 --> 04:42.000
By default, what the grpc-library does is that it uses

04:42.000 --> 04:43.000
a reflection.

04:43.000 --> 04:47.000
It iterates over all of the public field of the goods

04:47.000 --> 04:50.000
tracked and it will marshall them one by one.

04:50.000 --> 04:54.000
And while this works, a reflection isn't the fastest thing

04:55.000 --> 04:58.000
So it is possible to change the way

04:58.000 --> 05:01.000
change the codec that is used to marshalling a marshall object

05:01.000 --> 05:04.000
to use a different representation that may be faster.

05:04.000 --> 05:09.000
To do that, grpc-go as a library exposes a codec interface

05:09.000 --> 05:13.000
that has two method, marshall and a marshall.

05:13.000 --> 05:17.000
And it is possible to pass a user-defined codec implementation

05:17.000 --> 05:20.000
to do this transformation.

05:21.000 --> 05:25.000
And it's not that hard and it's possible to make it faster.

05:25.000 --> 05:28.000
And to do that, I will introduce the vtprotabuff plugin

05:28.000 --> 05:33.000
that we'll help at doing the far further marshalling and marshalling.

05:33.000 --> 05:36.000
This is an open source project and it's a protocol

05:36.000 --> 05:37.000
with a compiler plugin.

05:37.000 --> 05:41.000
So it takes your protocol definition

05:41.000 --> 05:44.000
and it will generate additional code.

05:44.000 --> 05:48.000
Using it is fairly straightforward when on your compiling setup

05:49.000 --> 05:51.000
where you invoke the protocol buffer

05:51.000 --> 05:54.000
specifically the go plugin and the grpc plugin

05:54.000 --> 05:58.000
we just add one more to use the vtprotabuff codec.

05:58.000 --> 06:00.000
The vtprotabuff plugin.

06:00.000 --> 06:03.000
And when we generate the go code from our protocol,

06:03.000 --> 06:06.000
it will generate an additional file called my service

06:06.000 --> 06:09.000
and the score vtprotabuff.pb.go.

06:09.000 --> 06:12.000
This file will contain additional method

06:12.000 --> 06:16.000
on top of our ghost track that represent our protabuff object

06:17.000 --> 06:20.000
and this method can be used to implement a custom codec

06:20.000 --> 06:22.000
that was faster.

06:22.000 --> 06:25.000
So we want to implement a codec that has a

06:25.000 --> 06:28.000
method and a marshalling method and we will create

06:28.000 --> 06:31.000
such type and we will delegate the actual marshalling

06:31.000 --> 06:34.000
and a marshalling to a specific function

06:34.000 --> 06:38.000
to a specific method of protabuff go types,

06:38.000 --> 06:41.000
specifically marshalling vt and a marshalling vt.

06:41.000 --> 06:45.000
These two methods are generated by the vtprotabuff plugin.

06:46.000 --> 06:49.000
And they contain a hand-rolled implementation of the actual

06:49.000 --> 06:52.000
marshalling and marshalling, but do not rely on reflection.

06:52.000 --> 06:55.000
So it means that we don't have to use reflection

06:55.000 --> 06:59.000
and that this implementation is way more friendly

06:59.000 --> 07:02.000
to the in-liner and to the compiler.

07:02.000 --> 07:07.000
Once we've defined this type that implement the codec interface,

07:07.000 --> 07:10.000
we can tell the grpc library to actually use it.

07:10.000 --> 07:13.000
This can be done at any time here with anchored in the

07:13.000 --> 07:17.000
register codec or it can also be done when you create a grpc

07:17.000 --> 07:20.000
server object or a client object.

07:20.000 --> 07:24.000
And with this change we can run again or benchmark.

07:24.000 --> 07:28.000
And while it used to take 33 microsecond to actually do the whole

07:28.000 --> 07:33.000
herpc sequence it now takes around 33 microsecond.

07:33.000 --> 07:38.000
To ensure that we have a real difference and it's not just

07:38.000 --> 07:43.000
noises in the measurement, we can run the benchmark several times

07:43.000 --> 07:48.000
with a dash content option and save the output in a file.

07:48.000 --> 07:53.000
And then we can ask the bench start to compare the result that we got

07:53.000 --> 07:58.000
and bench start bench start will do statistical analysis on

07:58.000 --> 08:01.000
those results to confirm or deny that there is indeed a difference

08:01.000 --> 08:03.000
in our two implementation.

08:03.000 --> 08:07.000
And in this specifically it confirms that there is a statistically

08:07.000 --> 08:12.000
significant difference and that with the vt prototype of

08:12.000 --> 08:19.000
codec we use 12% less CPU time when actually performing

08:19.000 --> 08:22.000
or benchmark.

08:22.000 --> 08:28.000
So with very little code here we just had to enable the use of one plugin

08:28.000 --> 08:30.000
and make a simple codec implementation.

08:30.000 --> 08:35.000
We managed to reduce a bit overhead of the grpc ecosystem.

08:35.000 --> 08:39.000
And there was no need to change or hinder implementation.

08:39.000 --> 08:42.000
So this was for unary request.

08:42.000 --> 08:47.000
Now let's have a look at grpc streams and specifically grpc streams

08:47.000 --> 08:50.000
will send a larger amount of data.

08:50.000 --> 08:55.000
In my service definition I have two rpcs,

08:55.000 --> 08:59.000
put where we will send the stream from the client to the server

08:59.000 --> 09:04.000
and get where we send the stream from the server to the client.

09:05.000 --> 09:09.000
In grpc stream is just a sequence of prototype messages.

09:09.000 --> 09:13.000
And specifically here what we consider to be a message in our stream

09:13.000 --> 09:19.000
is just a chunk that has a single field which is a bite slice.

09:19.000 --> 09:27.000
From this definition we can have we can write a basic implementation

09:27.000 --> 09:31.000
of put on put we just iterate of the stream and

09:31.000 --> 09:34.000
stream that we see on it up until the stream gets closed.

09:34.000 --> 09:42.000
And on get we generate a random bite slice and we send that to the stream.

09:42.000 --> 09:51.000
So we will work with chunks of 16 kilobytes and for total stream size of 16 megabytes.

09:51.000 --> 09:56.000
And so we can have a look at how fast that basic implementation of grpc streams

09:56.000 --> 10:01.000
would be. This time I will not choose the go benchmark set up.

10:01.000 --> 10:05.000
I will have separate the client and the server and look at

10:05.000 --> 10:12.000
system metrics coming from that explorer as well as go runtime metrics.

10:12.000 --> 10:18.000
So when we run our grpc server implementation and we stress it with

10:19.000 --> 10:25.000
a gats week. So sorry our test set up has two virtual machines.

10:25.000 --> 10:30.000
One for the client one for server each of them have two virtual CPUs and they share

10:30.000 --> 10:33.000
a five gigabit per second network link.

10:33.000 --> 10:38.000
So when stressing the grpc server with gats we can see that we manage to saturate

10:38.000 --> 10:42.000
this five gigabit per second network link and that to do that we can

10:42.000 --> 10:46.000
see one from point three CPU core.

10:46.000 --> 10:50.000
So that's good or basic knife grpc implementation can saturate

10:50.000 --> 10:56.000
kind of high-performance kind of high speed network link but it costs some CPU.

10:56.000 --> 11:01.000
And if we compare with another project for example kd which is a popular

11:01.000 --> 11:05.000
ktp 2.5 server with an in go we can see that kd is able to saturate

11:05.000 --> 11:09.000
the same network link but for all these 0.2 CPU core.

11:09.000 --> 11:13.000
So there is a lot of this dependency between what our grpc implementation

11:13.000 --> 11:16.000
does and what kd which is optimized does.

11:16.000 --> 11:20.000
And so even though grpc is a bit of more involved protocol

11:20.000 --> 11:25.000
by an HTTP 2 it would still be nice if we are able to reduce the CPU

11:25.000 --> 11:28.000
that we consume to do so.

11:28.000 --> 11:33.000
And the problem is even worse when we try the put stream when we put

11:33.000 --> 11:37.000
data to the server so when the server receives data it needs to consume

11:37.000 --> 11:43.000
1.60 CPU core to saturate this network link.

11:43.000 --> 11:48.000
So the question is where does where is the CPU consume what is our

11:48.000 --> 11:52.000
CPU doing on this workload to consume that much power.

11:52.000 --> 11:56.000
And the good thing about go is that answering this question is fairly straightforward.

11:56.000 --> 12:02.000
You can just run the CPU profine on your rolling jrpc server to get an answer.

12:02.000 --> 12:06.000
And if you do that you get this kind of frame graph.

12:06.000 --> 12:11.000
There is one big tower in the middle which is actually doing the c-school

12:11.000 --> 12:12.000
to read data.

12:12.000 --> 12:15.000
So while it can be optimized I will not have a look at that today.

12:15.000 --> 12:17.000
This is kernel CPU time.

12:17.000 --> 12:22.000
But I will never have a look at other towers and specifically this one.

12:22.000 --> 12:29.000
They all are related to garbage collection or memory allocation.

12:29.000 --> 12:34.000
And indeed if we look at go runtime metrics you can see that

12:34.000 --> 12:41.000
or jrpc server while under load allocates at a rate of 11 gigabit per second of

12:41.000 --> 12:42.000
that time.

12:42.000 --> 12:46.000
When we are serving 5 gigabit per second of throughput we are actually

12:46.000 --> 12:51.000
internally allocated 11 gigabits of second per second of that time.

12:51.000 --> 12:56.000
And due to this higher location rate we need to run gc very often.

12:56.000 --> 13:01.000
During our stress test the gc was running between 4 and 5 times per second.

13:01.000 --> 13:06.000
This is also metrics coming from this growing time.

13:06.000 --> 13:11.000
And so while the go gc is kind or efficient and fast if you are in it several

13:11.000 --> 13:17.000
hundred times per second when it starts to add up and to be a significant amount of CPU

13:17.000 --> 13:18.000
type.

13:18.000 --> 13:22.000
So now the question is where does this memory allocation come from and can we

13:22.000 --> 13:23.000
reduce it?

13:23.000 --> 13:28.000
Once again in go it is very straightforward to have an answer on where does it come from.

13:28.000 --> 13:35.000
We can just run a memory profile on our running jrpc server and see where it comes from.

13:35.000 --> 13:42.000
And specifically it seems to come from a single function inside the jrpc library called jrpc.

13:42.000 --> 13:46.000
The jrpc that we see has two parts.

13:46.000 --> 13:52.000
Receive and decompress that reads data from the network connection and put it in a

13:52.000 --> 13:53.000
byte buffer.

13:53.000 --> 13:59.000
So it reads one part of a message is located on the heap and puts the wire representation

13:59.000 --> 14:00.000
on that place.

14:00.000 --> 14:02.000
And then on the second part it unmashing.

14:02.000 --> 14:09.000
So it tries to turn this wire representation to a good track.

14:09.000 --> 14:15.000
So specifically when for the receive and decompress part by default every time a

14:16.000 --> 14:19.000
prototype of message is received by the jrpc library it

14:19.000 --> 14:25.000
evaluates a byte size and it writes that's wire format to it.

14:25.000 --> 14:30.000
And so for every message that is received by the jrpc library we need to

14:30.000 --> 14:32.000
heap allocate.

14:32.000 --> 14:37.000
It would be nice if there was a way to tell the jrpc library to actually not heap

14:37.000 --> 14:42.000
allocate for every new message that is being received but actually try to reuse data

14:42.000 --> 14:44.000
internally to reuse memory.

14:44.000 --> 14:46.000
And it's actually something that is possible.

14:46.000 --> 14:50.000
You can configure the jrpc go library to do so.

14:50.000 --> 14:57.000
If you are using jrpc go with a version that is lower than 1.66 there is an

14:57.000 --> 15:02.000
experimental option that can be used when creating a server or a client.

15:02.000 --> 15:07.000
If you use experimental dots receive buffer pull you will tell the jrpc library

15:07.000 --> 15:11.000
to try to use a memory buffer pull internally when receiving new messages.

15:11.000 --> 15:20.000
If you are using a recent enough version of jrpc go which I would strongly advise then

15:20.000 --> 15:27.000
you actually don't have to do anything and it is now the new behavior by default.

15:27.000 --> 15:31.000
With this memory pull we don't have to heap allocate for every new message that

15:31.000 --> 15:33.000
is being received by the jrpc library.

15:33.000 --> 15:37.000
This memory can be reused across different messages.

15:37.000 --> 15:44.000
And so if we were to run again or benchmark setup you can see that our allocation rate is lower.

15:44.000 --> 15:51.000
We only allocate at around 5 gigabit per second compared to the 11 gigabit per second that we had previously.

15:51.000 --> 15:57.000
And as a result we consume less CPU time when serving the same amount of traffic.

15:57.000 --> 16:00.000
So this was one half of the answer.

16:00.000 --> 16:06.000
The other half we would need to look at how to reduce memory allocation when we unmasher.

16:07.000 --> 16:14.000
In our basic implementation what we did was calling stream dot receive in a loop

16:14.000 --> 16:17.000
up until the jrpc stream gets closed.

16:17.000 --> 16:23.000
And what stream dot receive does is that it will return new heap allocated chunk

16:23.000 --> 16:30.000
that has a single data field that is a byte slice that will contain what we send through the network.

16:31.000 --> 16:36.000
Every time we call stream dot receive we heap allocate a chunk.

16:36.000 --> 16:41.000
And we also need to heap allocate when we unmasher.

16:41.000 --> 16:43.000
We need to heap allocate the data.

16:43.000 --> 16:46.000
The slice that contains the data relevant payload.

16:46.000 --> 16:51.000
We need to heap allocate it so that we can link it to the data field.

16:51.000 --> 16:57.000
And it is possible to make a difference implementation of puts that do not have this issue.

16:58.000 --> 17:03.000
To do that we will stop using the jrpc api stream dot receive.

17:03.000 --> 17:07.000
And we will rather use the api stream dot receive message.

17:07.000 --> 17:15.000
The big difference is that stream dot receive message takes a chunk as argument and will unmasher to this thing.

17:15.000 --> 17:23.000
If we are able to pass a chunk that has a data field that is a byte slice of length zero

17:23.000 --> 17:31.000
but with free capacity then we can unmasher to this free capacity without having to do heap allocation.

17:31.000 --> 17:40.000
And to do that we will pull our chunk messages by using some VT protocol helpers.

17:40.000 --> 17:44.000
When we get a chunk from the VT protocol pool there are two options.

17:44.000 --> 17:48.000
I have of the pool was empty and we heap allocated new chunk.

17:48.000 --> 17:53.000
The data field is the nil slice, length zero and capacity equals zero.

17:53.000 --> 18:01.000
So when we unmasher we need to create a new slice and attach it to the data field which is exactly what was happening before.

18:01.000 --> 18:09.000
Or if we get lucky when we get a chunk from the pool it is possible that the data field is a slice of land zero

18:09.000 --> 18:14.000
but with some extra free capacity that was previously allocated.

18:14.000 --> 18:20.000
And if it is the case when we unmasher to this slice we can just write it directly to the free capacity

18:20.000 --> 18:25.000
without having to ask the runtime to do a new heap allocation.

18:25.000 --> 18:33.000
And so with this thing we don't have to make a new heap allocation for every message that we receive on the stream.

18:34.000 --> 18:44.000
And so by running again or benchmark we can see that with this change we have reduced dramatically the allocation of our GAPC server.

18:44.000 --> 18:49.000
We now only allocate around 20 to 30 megabit per second of data.

18:49.000 --> 18:53.000
One we use to allocate several gigabit per second.

18:53.000 --> 18:57.000
Due to this change the gc rate is way more appropriate.

18:57.000 --> 19:00.000
It only runs around one per second.

19:00.000 --> 19:08.000
And so with that the CPU consumed when saturating the network is actually around 0.7 CPU core.

19:08.000 --> 19:21.000
It's used to be at 1.6 so with this simple change we more than reduced and divided by 2 the amount of CPU that is spent on the GAPC in terms.

19:21.000 --> 19:24.000
We can now have a look at the get workload.

19:24.000 --> 19:27.000
That's the first from roughly the same issue.

19:27.000 --> 19:34.000
On the get workload so when there is a stream from the server to the client it allocates a lot of the memory.

19:34.000 --> 19:39.000
It allocates around 5 gigabit per second when we serve 5 gigabit per second.

19:39.000 --> 19:44.000
And as we run this year lot it consumes a lot of CPU and it's the same issue as previously.

19:44.000 --> 19:48.000
But this time the memory allocation comes from a different place.

19:48.000 --> 20:00.000
All of the memory allocation comes when calling the marshalling to transform or go struct to their way of presentation and to send them free network.

20:00.000 --> 20:08.000
The reason for that can be found in the API that GAPC library uses for the codec.

20:08.000 --> 20:14.000
Specifically, marshalling takes a good struct and returns a byte slice.

20:14.000 --> 20:22.000
As we may have shown in the talk previously returning a byte slice means that you need to high-pallocate it.

20:22.000 --> 20:30.000
There is no way for the color and the color of this API to cooperate to prevent memory allocation.

20:30.000 --> 20:40.000
So every time we need to marshalling object through this interface we need to keep allocate as much memory as it's why your representation is.

20:40.000 --> 20:46.000
Thankfully GAPC recently introduced a codec V2 interface.

20:46.000 --> 20:49.000
It's kind of the same thing as a codec interface.

20:49.000 --> 20:59.000
There is still a marshalling method but this time rather than using a standard byte slice it uses a specific GAPC internal type.

20:59.000 --> 21:02.000
called Mem.BufferSlices.

21:02.000 --> 21:16.000
And what a Mem.BufferSlices is that it's basically a reference count at byte slice but the GAPC library can reuse and pull internally so that it does not have to e-pallocate it new stuff every time.

21:16.000 --> 21:24.000
And so we can make a codec that implements this codec V2 interface.

21:24.000 --> 21:29.000
Once again we have the help of the V2 prototype helpers.

21:29.000 --> 21:38.000
To implement marshalling we will first ask what would be the size of the wire representation of this object.

21:38.000 --> 21:47.000
We will then get a piece of memory from an internal GAPC memory buffer pool that can hold this wire representation.

21:47.000 --> 21:58.000
Then we can just marshalling to the slice and return it as wrapped in a mem.BufferSlices type.

21:59.000 --> 22:11.000
And this means that the GAPC library can now send this piece of slice of memory that contains the representation of a ghost truck.

22:11.000 --> 22:22.000
And when it is done it can just decrement a reference count and if it switches zero put it back into the pool that can be so that it can be reused later.

22:22.000 --> 22:30.000
So with this change we don't have to make a heap allocation every time we call marshalling.

22:30.000 --> 22:39.000
And so it means that we can run or benchmark again and we can see that once again our allocation rate has been dramatically reduced.

22:39.000 --> 22:54.000
We only allocate a few dozen megabyte per second when serving this gigabit per second load does the GC rate is way lower and we consume less CPU.

22:54.000 --> 23:06.000
So as a small summary we've seen that it's possible to make a few changes to how you use the GAPC library to reduce CPU consumption.

23:06.000 --> 23:21.000
For you now I request using the VT product of the codec helps you to remove the default implementation that is based on reflection and by doing so you can save around 10% of CPU time.

23:21.000 --> 23:42.000
For streams on egress streams of stream moving out of the GAPC implementation you can reduce memory allocation and this CPU consumption by a significant factor by just using a recent GAPC version and having a codec VT implementation that do not hit allocate for every call.

23:42.000 --> 23:53.000
Ingress stream you can also reduce memory allocation and CPU usage by using either a recent GAPC version or enabling an experimental option all the ones.

23:53.000 --> 24:00.000
And if you want to go further you can change the handler implementation to pull the message that you receive from the stream.

24:00.000 --> 24:11.000
Apart from this very last point all of the everything that I explained previously do not require you to change the actual GAPC implementation of your handler.

24:11.000 --> 24:23.000
You just need to change a few lines of code on your combining setup on when you would how you generate GAPC from protocol and have one good codec implementation.

24:23.000 --> 24:31.000
So with very little amount of work and code you can get a significant amount of CPU reduction.

24:31.000 --> 24:36.000
Thank you for listening we have a few minutes I think for question guys.

24:36.000 --> 24:45.000
I have space for one very short question.

