WEBVTT

00:00.000 --> 00:14.320
Hello everyone, so today's talk is kind of the segue of the talk that I gave last year here

00:14.320 --> 00:20.160
in the networking room, it was the last year it was more like a prototype and I was a

00:20.160 --> 00:30.800
full-fledged project. The agenda for today is why we might want to run router on our

00:30.800 --> 00:39.600
Kubernetes nodes, what's this open-peer router project about is a router that you can run on Kubernetes

00:39.600 --> 00:47.920
nodes, how you install it on your cluster and how you integrate it with your existing component

00:47.920 --> 00:56.320
of the cluster and what's ahead. So first of all why we might want to run something like

00:56.320 --> 01:04.800
a router on a Kubernetes node and the thing is on cloud providers everything is simple. The network

01:04.800 --> 01:14.480
is abstracted away from us, we only have one big 8-h, 0 interface and that's it, but where I work on

01:14.560 --> 01:22.560
is the dark side of things. On bare metal you have users want to do their own things, you have

01:22.560 --> 01:30.160
multiple networks, you want network segregation, you want multiple violences, secondary interface

01:30.160 --> 01:37.760
and a lot of stuff that require your fabric to be configured around your cluster and vice versa.

01:37.840 --> 01:44.880
You need to have your clusters being talking to the fabric in a way that both are configured

01:44.880 --> 01:50.960
together and also I think the life of our network engineer is a bit less boring, but that's personal.

01:52.160 --> 01:58.320
So how is the fabric configured and most importantly who is configuring this fabric? So we said that

01:58.320 --> 02:04.480
we want multiple networks and the most common way of doing this is with the violences interfaces.

02:04.480 --> 02:14.560
So you go and configure some violences interfaces on your nodes. Those those villain interfaces

02:14.560 --> 02:21.120
are translated into VPN tunnels in some way in your top of the rack switch, you have multiple

02:21.120 --> 02:29.520
BGP sessions between your nodes and your top of the rack and everything works.

02:30.400 --> 02:40.400
From a data point of view, there is a mapping between the tunnels and the violences and the

02:40.400 --> 02:46.240
fabric is configured to encapsulate the traffic coming from into those violences into some kind of

02:46.240 --> 02:54.960
tunnels. If we focus on the host, the host sees multiple interfaces, it sees multiple BGP sessions,

02:54.960 --> 03:01.840
I'm a bit biased because I work on metal at BGP as well as an example, but the same happens with

03:01.840 --> 03:09.280
colleague or with any other component that talks BGP in your cluster. So what happens is that every

03:09.280 --> 03:15.200
single deal on every single interface becomes your access point to a different tunnel.

03:15.200 --> 03:27.360
But the elephant in the room is who is configuring the top of the rack and whenever I want

03:27.360 --> 03:34.080
to add a new network, who I am asking to configure that network into the fabric. And these

03:34.080 --> 03:41.760
brings a common theme that I observed talking to multiple users, which is this kind of tension

03:41.760 --> 03:46.320
between the networking folks that do not want to touch anything because they are afraid that

03:46.320 --> 03:52.880
the network will break and the cross-efolks, which want to spin up a new tenant, they want to add

03:52.880 --> 04:00.560
a new type of network to our nodes. And so there is this tension, which results in discussions

04:00.560 --> 04:09.920
in our usability and so on. So what we are doing with open peer-outer is to take this concept,

04:10.000 --> 04:17.280
the VPN termination, and we are moving that down to the node. So the same thing that was happening

04:17.280 --> 04:24.240
on the top of the rack, the translation between a given VLAN and or multiple VLANs, and a

04:24.240 --> 04:31.440
VPN tunnel is now handled by this new OpenP component, which runs into the node and which

04:31.440 --> 04:44.480
exposes the networks as that lags toward the host. So open peer-outer, again, is separate network

04:44.480 --> 04:50.880
namespace, the network namespace, isolates all the complexity required to set up some kind of

04:50.880 --> 04:59.440
VPNs such as EVP and DXLung. It uses FRR as the control plane for the BGP part, and the

04:59.440 --> 05:06.000
idea is that it behaves like a router. So the same interaction that our components would have

05:06.000 --> 05:13.440
with a router now can happen towards this component. And of course the configuration is CRD-driven,

05:13.440 --> 05:19.440
so we can leverage all the stuff that comes with Kubernetes. And of course we can have multiple networks.

05:19.440 --> 05:29.040
So we can have multiple networks exposing multiple BGP sessions and acting as the access point to

05:29.040 --> 05:35.760
multiple tournaments. And this should remind us what we just discussed. So the interaction is

05:35.760 --> 05:41.760
exactly the same. The difference is that this is driven by CRDs and configured automatically,

05:42.640 --> 05:48.320
whereas what we had before we had to go to the networking folks and to ask for a new configuration.

05:48.800 --> 05:59.440
Okay. I work for QVirt pretty much. We run virtual machines in Kubernetes, and if something is

05:59.440 --> 06:04.320
our bread and butter, it would be live migration and layer 2. So I'm here to speak about

06:04.320 --> 06:12.400
everything related to layer 2. Okay. Thank you for that. So how would you typically do this without

06:12.480 --> 06:19.120
open peer router? You need a new network, you plump a new VLAN into it, you plump dots into your

06:19.120 --> 06:24.800
top of rack, your top of rack will pretty much have the IP address for the gateway.

06:25.840 --> 06:31.440
And then the network people would go and configure the top of rack with the VPN blah blah blah.

06:31.440 --> 06:38.320
Do a lot of people have to coordinate between themselves? So friction. This is not what we want to have.

06:38.960 --> 06:44.800
Hence, we don't want to be a router. What you essentially have is you have your Kubernetes node,

06:44.800 --> 06:50.320
you have your router running, which is that blue thing over there is like the network namespace of the

06:50.320 --> 06:58.880
router. For every overlay you're creating, you'll essentially create a dedicated bridge for it.

06:58.880 --> 07:04.160
So that number right there is the VPN on your using. So you have three networks. Each of them

07:04.240 --> 07:11.120
is encapsulated on a different VNI and you have a VESPIR connected to that bridge and you're

07:11.120 --> 07:19.200
left with a dangling breath on your host. Then you will plump this VESPIR in you will find a way

07:19.200 --> 07:26.240
for this VESPIR to extend into your workloads. Essentially fabric becomes number.

07:26.240 --> 07:31.680
So what happens to the fabric then? On the fabric side we don't need to configure

07:32.000 --> 07:38.080
tunnels anymore. We only need to configure leaf routers to relay around this VESPIR router

07:38.080 --> 07:44.000
related to the VPN and the traffic becomes completely routed. So it's plain IP traffic that

07:44.000 --> 07:50.000
is being routed by the fabric according to the VESP constructs. So in this case there is not

07:50.000 --> 07:54.880
need to involve the networking people anymore. The configuration is dynamic, it can scale and it's

07:54.960 --> 08:01.760
very easy to add or remove overlays and please note the hand is not completely right.

08:04.160 --> 08:11.280
The first kind of tunneling that we added is the VPN VX1 and in order to implement that we need

08:11.280 --> 08:19.600
two things. We need to configure the session between the node and the top of the rack where we

08:19.600 --> 08:27.680
can spread around our VGP routes and send out the VX1 pockets. We can do that in a couple of ways

08:27.680 --> 08:34.560
we don't can peer out. One way is to steal an interface from the node. Can work on some scenarios

08:34.560 --> 08:40.320
can not work in other. So we take the interface from the network main space of the host and we

08:40.320 --> 08:46.480
move it into this new network main space. Or we can use multis because these runs as a pod.

08:47.120 --> 08:53.440
So we can use multis and make VLAN interfaces for example to have one interface that allow you

08:54.400 --> 08:57.280
to establish the VGP session with the top of the rock switch.

08:58.880 --> 09:05.120
Yumult because it's Kubernetes. Like this is how you configure it is very simple. You need to

09:05.120 --> 09:12.560
have a session between the top of the rack and your peer outer and of course we have not selectors

09:12.560 --> 09:18.080
and all the nice things that come with Kubernetes and in this example we declare the interface

09:18.080 --> 09:26.720
that we want to steal. Then there is this L3VNI that enables the creation of the battle

09:26.720 --> 09:33.600
like that I was talking about with listening to BGP and what is going to happen is that all the

09:33.600 --> 09:42.240
routes receive it from and to the host that's going to be translated into the event type 5 routes

09:42.240 --> 09:50.640
with the local VTAP as the tunnel endpoint and from a traffic point of view the traffic from

09:50.640 --> 09:59.360
into the fabric gets encapsulated and this is how you do that. You declare the VNI which is the

09:59.440 --> 10:05.120
identifier of the tunnel, one VRF that is used for the Linux VRF inside the main space,

10:06.800 --> 10:17.520
a local cider, one thing to notice that getting back here, one thing to note is that the

10:17.520 --> 10:24.000
and the inside of the VF on the router will always have the same IP regardless on the node and

10:24.320 --> 10:29.440
the host side is going to be different so your BGP configuration should be easier than

10:30.160 --> 10:39.120
lay it to okay L2 again so essentially what you've seen is how to provide like an

10:39.120 --> 10:45.200
interconnected type of networks this is how you would have like a layer to stretch layer to network

10:45.200 --> 10:51.040
across your fabric does doing things without like interconnecting different type of clusters

10:51.680 --> 10:58.400
this is what you would use an L2 VNI the interesting thing is you as I said before you will have

10:58.400 --> 11:05.520
one VF bear per each VNI that is connected to a bridge a dedicated VNI for that bridge

11:05.520 --> 11:14.800
optionally encapsulated in a VRF so you can have a network segregation in it and in the

11:14.800 --> 11:22.480
host side what you'll have is a bridge you can be an OVS bridge you can be a Linux bridge you choose

11:22.480 --> 11:30.800
what you want to plumb with and then the magic is that this bridge is what your workloads will

11:30.800 --> 11:39.120
attach to you and how do you do that well this is how you create L2 VNI thing and the most

11:39.200 --> 11:45.200
interesting things here is the VNI number that you choose and the VRF if you are if this is

11:45.200 --> 11:53.840
essentially the key of the interconnection between your L2 VNI and your L3 VNI in case you want to

11:54.720 --> 12:04.320
expose this stretch layer to 2D of sideworld okay all right have the mic so there is also this

12:04.640 --> 12:10.560
pass through mode that was made more for testing integration and also if you have a process

12:10.560 --> 12:16.560
on the host that needs to access directly to the fabric but you want still to have a single session

12:16.560 --> 12:22.400
with the top of the rock switch so that is still the same you have something on this side that

12:22.400 --> 12:27.840
talks BGP with this the only difference is that knowing encapsulation won't happen what will happen

12:27.840 --> 12:33.920
no route translation will happen but it is going to just be translating the BGP routes

12:36.080 --> 12:43.280
so second part of this is how does this integrate with existing projects and as I said before

12:43.280 --> 12:51.040
one driving concept is we want this to interact with our existing components such in the same way

12:51.040 --> 12:58.640
that they would be interact with their real router on the official website there are a few examples

12:58.640 --> 13:06.400
that you can run on your laptop I'm going to just talk briefly about a couple there are layer 3 ones

13:06.400 --> 13:13.120
so I'm going to talk about metal B again because quite familiar with the project with metal B we can

13:13.120 --> 13:21.440
establish one BGP session pair of vf and again each vf is the access point to you for for you to

13:21.440 --> 13:27.680
the death overlay and you can advertise different services to different overlays and this is what

13:27.680 --> 13:32.880
is going to happen when you deploy the example on your laptop this is a container lab instance

13:32.880 --> 13:39.920
connected to a kind cluster and metal B will advertise the the BGP routes are related to the

13:39.920 --> 13:47.360
services to the open peer router open peer router will translate them as type 5 routes and they will

13:47.360 --> 13:53.760
be spread across the fabric and the leaves are acting as the termination so this host here

13:53.760 --> 13:58.480
will be able to access the service and again this is something that if you look at the website

13:58.480 --> 14:08.240
you can deploy on your laptop and from data point of view you have TCP on the terminations and on

14:08.240 --> 14:15.840
this path and all the rest is UDP pockets sent to and from the the VTPS. Calico is another thing

14:15.840 --> 14:22.240
that I tried to to play with because I wasn't familiar with the project so I wanted to see if

14:22.240 --> 14:28.880
effectively we could integrate with him and in this case the bird instance running on each node is

14:28.880 --> 14:36.800
at the peer in with the peer router and advertising their podseider and we can have pod to pod traffic

14:36.800 --> 14:42.800
so it's too astrophic happening through a week's land and with the routes spread around as

14:42.800 --> 14:51.120
EDP and routes and we can even have those pods routable from the host connected on the same

14:51.120 --> 15:01.040
overlay. Okay for L2 integrations we have two of them like with Q-vert pretty much like running

15:01.040 --> 15:09.520
virtual machines and the equivalent but for pods so both of these are using secondary interfaces to do

15:09.520 --> 15:15.120
that and this is a slide that I thought we would have before but remember that bridge we've seen

15:15.120 --> 15:22.640
this is how you attach it to your workflows we rely on multis so secondary networks and you just

15:22.640 --> 15:30.800
use your network attachment definition of choice to attach your VM to that bridge which again could

15:30.800 --> 15:38.160
be an OVS bridge to be a Linux bridge could be a IPv LAN or a Mac file on top of that.

15:40.000 --> 15:46.080
Interesting enough we also have like a distributed gateway IP so pretty much no matter in

15:46.080 --> 15:51.920
where your VM runs because your VM can migrate to a different node in a different cluster it will

15:51.920 --> 16:00.080
always point to the same gateway Mac and IP address does reducing the downtime while you migrate.

16:01.760 --> 16:07.200
From the Q-vert perspective this is what it looks like it's essentially the same and our integration

16:07.200 --> 16:14.800
like the demos we provide to showcase the integration also allow for a host on a separate network

16:14.800 --> 16:23.680
to access the VMs which run on a separate network and for demos. Yeah so basically we didn't have

16:23.840 --> 16:29.680
time for live demos but this is our reference to YouTube playlist with a bunch of

16:30.720 --> 16:36.960
demos of the stuff that we spoke about just now I want to stress on the fact that you can run

16:36.960 --> 16:42.480
those very same demos on your laptop by running a bunch of make-omans from a checkout of the

16:42.480 --> 16:49.360
repository. So last bit is what's next what we are planning to do. So one thing is that all this

16:49.440 --> 16:54.960
stuff that we spoke about now works beautifully for if you are leveraging your primary scene

16:54.960 --> 17:02.320
I but whenever your application uses an SRV interface is an SRV interface that is completely

17:02.320 --> 17:09.760
bypassing the host so it's like having one wire that bypasses your the peer router and goes

17:09.760 --> 17:16.880
directly into the top of the rack and so you are back at the squid square one you have to ask somebody

17:16.880 --> 17:23.680
to configure your your fabric. So this is something that we'll try to solve in some way.

17:24.400 --> 17:29.840
Second bit is there is this chicken and egg problem because in order to configure their

17:29.840 --> 17:35.600
router you need to access the API server but what if you want all your network to go through

17:35.600 --> 17:43.120
this thing and so we are working and we are halfway in in emerging PRs to have a way to run

17:43.120 --> 17:49.040
these as a system we unit that can be bootstrapped with a study configuration and then whenever

17:49.040 --> 17:56.080
the API server is available then it's going to start consuming it's going to start consuming the

17:56.080 --> 18:02.400
new configuration. So it's a fancy way to have the basic connectivity required for the

18:02.400 --> 18:09.440
class to come up at boot time while being able to to configure and out the networks dynamically.

18:10.400 --> 18:18.080
SRV seeks from the host point of view it's going to be just another type of vf so if you're

18:18.080 --> 18:24.000
on colleague or metal be or cube virtual here nothing will change what is going to change is how

18:24.000 --> 18:31.360
you configure those those vfs and how those vf where those vf leads so in this case some vfs will

18:31.360 --> 18:38.080
lead to an unnecessary system which I think is the nice part of this architecture.

18:40.320 --> 18:46.640
The data path is kernel based so we are investigating and discussing if there is a way to have

18:46.640 --> 18:53.520
a plugable data path that can be faster and can also solve the server interface problem.

18:54.560 --> 19:00.320
We are talking to the team behind grout which is a dbtk based graph router

19:00.960 --> 19:07.680
it has already some integration with FRR so the idea is to have a plugable data path one can

19:07.680 --> 19:13.520
be kernel for those users who are kernel speed is more than enough and grout or any other

19:13.520 --> 19:24.560
possible data path available. Resources we try to keep the documentation easy to consume again

19:24.560 --> 19:31.520
examples on how to configure open peer router to integrate with the existing projects there is

19:31.520 --> 19:37.200
this talk that I gave last year about how the European works and how I started prototyping this

19:38.400 --> 19:47.040
and that's basically so long story short is still a project that we are bootstrapping it works

19:47.040 --> 19:54.640
it as CI we are open to collect feedback if you have use cases that we didn't think about

19:54.640 --> 20:01.280
just file an issue and also we are trying to be a contributor friendly we already have some external

20:01.280 --> 20:08.880
contributions which I think is very nice and with that we are done if there are any questions.

20:08.880 --> 20:14.400
So we don't have much time but maybe just one and Maxine the next speaker if you're here

20:15.040 --> 20:19.040
please come on up.

20:22.240 --> 20:27.920
All right so thank you for the talk. The CI integration is just a BGP or does

20:27.920 --> 20:34.000
colleague have something more fancier that you are using for it? I don't know. Like the only thing that

20:34.000 --> 20:43.360
they tried we use is a BGP. Okay great yeah thank you very much sorry we're out of time

20:43.360 --> 20:51.520
and again the point is that colleague was easy as a router so the configuration is identical with the