WEBVTT

00:00.000 --> 00:11.840
So, now for something somewhat completely different, but to completely familiar things, put

00:11.840 --> 00:15.000
together in a new way, I think.

00:15.000 --> 00:21.320
Thank you very much.

00:21.320 --> 00:26.520
Just for a show of hands for Michael Rostick, who here has any experience of interest in

00:26.520 --> 00:31.520
cloning or theorem editing, like making new constructs and organisms?

00:31.520 --> 00:40.960
Oh, the question was, who here has any experience with cloning plasmids or genome editing

00:40.960 --> 00:43.360
or engineering organisms, spans?

00:43.360 --> 00:45.360
Okay, perfect.

00:45.360 --> 00:49.280
Then a lot of you are going to learn some new stuff as well, so that's great.

00:49.280 --> 00:55.660
What I'm here to present is, get for genomes, it's about a tool we made, specifically for

00:55.660 --> 01:03.140
doing version control on sequences, DNA sequences, protein sequences, and engineering context.

01:03.140 --> 01:08.140
And where this comes from, is actually the reason I go into Sintai Biology, is this notion

01:08.140 --> 01:15.060
that DNA is like code for computers, means like we can read it, we can also create it and

01:15.060 --> 01:22.100
design new genetic scripts to allow cells to execute, like what we want to do.

01:22.100 --> 01:25.340
But the thing is, it does not feel at all like software engineering.

01:25.340 --> 01:30.820
So it's once you add for a while, like you'll notice it's much harder and it's actually

01:30.820 --> 01:34.940
much more tricky than you would think coming into it.

01:34.940 --> 01:40.940
And to illustrate it with an example, let's say at the project where we are tasked with

01:40.940 --> 01:47.140
making beer yeast that produces the hop flavor compounds, so that we can just make beer

01:47.140 --> 01:52.020
with yeast alone and then not have to worry about hops, like if you're on Mars or something

01:52.100 --> 01:55.940
or Space Station, you don't have access to hops.

01:55.940 --> 01:59.900
We are going to explain that in the context of like the design build test cycle, the Sintai

01:59.900 --> 02:06.420
Biology people like to use, and starting with a design part, here I would start with

02:06.420 --> 02:10.660
sourcing genetic parts to express these genes, so parts that you have put in front of your

02:10.660 --> 02:14.340
genes or the genes itself, to help the cell express them.

02:14.340 --> 02:18.980
I jam, an amazing resource, it's an open source collection of the unique parts.

02:18.980 --> 02:24.220
They do this massive competition every year for high schoolers to undergrad, to

02:24.220 --> 02:28.220
overgrads, where people come together to make genetic engineering machines.

02:28.220 --> 02:32.300
It says a lot of fun, I recommend checking them out.

02:32.300 --> 02:37.060
Then we would create combinations of all these variable parts, because I think we're not

02:37.060 --> 02:38.060
that good at it yet.

02:38.060 --> 02:43.020
So we always test a bunch in parallel, because it's a slow test process, so we might

02:43.020 --> 02:45.780
as well test quite a few at once.

02:45.780 --> 02:49.660
Then second point is that we go with our built, so this is where it takes our digital

02:49.660 --> 02:50.660
sequences.

02:50.660 --> 02:54.420
We're going to make them real, so that means like ordering synthetic sequences, putting

02:54.420 --> 02:58.300
them together, putting them into your host, and creating these organisms that have been

02:58.300 --> 03:03.900
engineered to execute these genetic programs.

03:03.900 --> 03:10.980
That part tends to be a little bit expensive, so you oftentimes, especially academia, try

03:10.980 --> 03:14.300
to build the strategy as you can reuse as many parts as possible.

03:14.300 --> 03:18.900
So it's like you're using a little Lego bricks, and you're trying to use your synthetic

03:18.900 --> 03:23.140
DNA multiple times, so you get some more mileage out of that.

03:23.140 --> 03:25.180
That's where I could first issue comes in.

03:25.180 --> 03:31.580
If you're trying to do this design build test on a large scale, is that when you want

03:31.580 --> 03:36.020
to build an efficient build strategy, you kind of need to know what's going to enter the

03:36.020 --> 03:37.020
design.

03:37.020 --> 03:38.020
You need some context.

03:38.020 --> 03:41.980
You can just give these build people a synthesis company, a list of sequences, and

03:41.980 --> 03:45.660
then go like have added, and that's going to cost you a lot of money.

03:45.660 --> 03:50.460
So there's this loss of context sometimes you see in an industry between a design and

03:50.460 --> 03:53.900
a built stage.

03:53.900 --> 03:58.140
Then the third stage, then is that where we would like read back our genome, especially

03:58.140 --> 04:04.540
for everyone like possibly somebody or if you're doing an insertion into a genome, the

04:04.540 --> 04:08.420
genome editing process is not perfect, so oftentimes it doesn't work.

04:08.420 --> 04:12.860
So you end up with these assays where you say, like, our genes got into our bug, our genes

04:12.860 --> 04:18.460
is not getting into our bug, but more often than not, you end up in a third category where

04:18.460 --> 04:22.660
your genes got into your genome, but there's some limitations.

04:22.660 --> 04:26.180
You're in the stuff you put in there, or somewhere completely else.

04:26.180 --> 04:32.860
And that's a point where you really end up in a lot of trouble in the sense that we don't

04:32.860 --> 04:35.180
yet know well how to deal with that.

04:35.180 --> 04:39.940
So a lot of our software is based on these are genes that get into the host, or they

04:39.940 --> 04:41.300
do not get into host.

04:41.300 --> 04:45.820
And on the other side, like the variant colors, they're great, they tell you the variants,

04:45.820 --> 04:47.740
but you lose the engineering context.

04:47.740 --> 04:51.740
So then you don't know exactly what is that variant, and it's important to me, or what

04:51.740 --> 04:54.900
should I expect.

04:54.900 --> 05:01.380
So that in total basically leads to this issue, where closing this design, built test

05:01.380 --> 05:11.220
group is, yeah, then remaining, or then fast, that's, gets really tricky, and the whole

05:11.220 --> 05:13.860
engineering process becomes a lot harder than it should be.

05:13.860 --> 05:19.380
So like, like, synthetic biology hasn't really taken off that much yet, and that's important

05:19.380 --> 05:24.380
because we're missing some gaps that's, we can be inspired by software engineering to

05:24.380 --> 05:27.140
solve.

05:27.140 --> 05:29.580
So that's where our project comes in.

05:29.580 --> 05:35.420
So David is like my co-workers sitting right there, and he's another fellow contributor,

05:35.420 --> 05:44.580
and we built a client called Gen, that is heavily inspired by get, and it is good to

05:44.580 --> 05:47.700
design specifically for sequences.

05:47.700 --> 05:52.900
It's a risk rate, we have command line interface, we have Python bindings, we have a nice

05:52.900 --> 06:01.460
tool as well, and what it is, is it's a way to organize your sequences in repositories,

06:01.460 --> 06:06.340
there are SQLite databases, where we track every change you make here in the design stage,

06:06.340 --> 06:12.740
or in the observation stage, when you use sequence your samples, and we track those as operations

06:12.740 --> 06:18.740
basically commits, and then we supply the familiar get commands to initialize your repository,

06:18.980 --> 06:23.460
synchronize your remote repository, make branches, basically like the workflow of software

06:23.460 --> 06:30.500
engineers are used to, but on top of that, we add a whole bunch of sequence specific commands

06:30.500 --> 06:32.220
and operations.

06:32.220 --> 06:38.660
So when you look back at our design bill test, for example, here are some of the functions

06:38.660 --> 06:45.180
like I would use to reflect what I did in the lab, and what I saw in my, in my data,

06:45.180 --> 06:49.860
and load them onto my repository, so I was like building the story of what happens to

06:49.860 --> 06:56.580
this train, what went into the design, why did it do it, and sort of end up with a more

06:56.580 --> 07:04.300
reviewable PR for your collaborators, for governments, agencies, for any one really.

07:04.300 --> 07:10.180
So here, for instance, when we're planning this planning strategy, we can chunk out

07:10.180 --> 07:17.900
the sequences, we can stitch them back together and keep track of all that.

07:17.900 --> 07:23.620
Yeah, and then here's some other ones as well, where we have ways of interacting with

07:23.620 --> 07:31.180
these variant call files and loading up in a way that we can view and study like that.

07:31.180 --> 07:35.780
Now why don't we just use get, like why can't we, why do we need this whole new system,

07:35.780 --> 07:40.100
if we are just working with sequences, why can't we use our existing tools?

07:40.100 --> 07:45.220
And the reason for that is that coordinate frames on genomes are really, really fragile.

07:45.220 --> 07:53.780
So what you end up with is I can be talking about age genome, and I say like base 1,255,

07:53.780 --> 07:56.740
and then you can think we're talking about the same thing, but if you have a different reference

07:56.740 --> 07:59.220
in mind, then we're screwed.

07:59.220 --> 08:05.660
Now the references also have an issue, like there's no one here who is like the reference

08:05.660 --> 08:06.660
human, right?

08:06.660 --> 08:10.580
So but we're still working with these reference genes, and part of that is because that

08:10.580 --> 08:18.140
was just the best we had at the time, but it's also slow down, like the engineering side,

08:18.140 --> 08:23.860
because we don't also get to do these issues of these intended and observed variants,

08:23.860 --> 08:28.700
where if you are thinking that you're working with a certain sequence, you're reference

08:28.700 --> 08:33.340
sequence, and then in the lab, you soon see your real data that tells you, there's actually

08:33.340 --> 08:37.380
a thousand differences compared to your reference, like how do you find like the relevant ones

08:37.380 --> 08:42.780
from that and how do you find your own engineering context there?

08:42.780 --> 08:44.780
So how do you solve that?

08:44.780 --> 08:48.500
We solved it a little earlier, so actually I'm going to ask again, how many people here

08:48.580 --> 08:53.580
are familiar with like pan genomes and graph genome representations?

08:53.580 --> 09:01.460
Okay, okay, so as a very brief introduction to that, it's a way of representing genome

09:01.460 --> 09:09.220
sequences with DNA sequences in a nonlinear fashion, so what you do is every linear sequence

09:09.220 --> 09:15.540
is a walk and a graph from like a start to an end, and when you have a variant, you add

09:15.620 --> 09:20.260
an edge, so it's an edge in a node, and a node carries sub sequences.

09:20.260 --> 09:25.820
So in this case, what I'm showing here is when you have an AT that muted to a TG, instead

09:25.820 --> 09:31.500
of this following your initial reference, you then hop on to another node and then hop back,

09:31.500 --> 09:40.500
and that allows you to very efficiently store a ton of variants in one object, let's see,

09:40.740 --> 09:53.900
and then what's also interesting there, when I spend a lot of effort for the engineering

09:53.900 --> 09:59.460
sites, more than the pan genomes site, is that these points where you're forks, that can

09:59.460 --> 10:03.780
actually represent many things that are very useful for us.

10:03.780 --> 10:09.700
On the one site, it can represent mixed populations and polyploidy as in like these are real

10:09.700 --> 10:14.020
things that happen that you don't see in any of your sequences. On the other hand, it can

10:14.020 --> 10:18.660
give you historical variants, so our software, every variant that sees it adds, I just do this

10:18.660 --> 10:24.260
additive data model so that we keep a change log like that, and then you can also use that to

10:24.260 --> 10:29.460
represent all of the variants you're screening. So let's see, I'm testing 10 sequences, that

10:29.460 --> 10:35.860
can be 10, like, like, legs of that fork, and then we can, we can work with this entire screening

10:35.940 --> 10:40.500
library at once, and that becomes really interesting once you start combining many metvenny

10:40.500 --> 10:44.980
parts because that quickly blooms, so oftentimes we find people restricting their experiments

10:44.980 --> 10:52.100
based on what they think they can cover, but then this allows them to go much, much higher

10:52.100 --> 10:56.660
without having these huge data sets. And I like to call shooting a smaller kill, because these

10:56.660 --> 11:00.580
are smaller kills, they can be anything, and tell you sequence, and then like this proposition

11:00.580 --> 11:05.620
collapses, and it becomes this useful metaphor of working with the sequences where you think

11:05.620 --> 11:12.500
you know what it is, but you don't know yet. So what we're offering here is then, not just

11:12.500 --> 11:17.620
then a version control client, but also tools for working with these graphs, to make it easier to

11:17.620 --> 11:23.060
to visualize and conceptualize and do some printing and pasting, and hoping to get more people

11:23.060 --> 11:30.020
on the graph geomes space, especially on the engineering side. We also have done a web interface at

11:30.100 --> 11:36.180
general bio, that is also being worked on so that we have as many ways as possible to get people

11:36.180 --> 11:41.460
on this, and to get people familiar with this and working with this. And then the whole idea is that

11:41.460 --> 11:47.140
we want to promote collaborative engineering. So we support common bio from many file formats,

11:47.140 --> 11:51.780
and anyone, if you want to get your format on here, let us know, and then we'll work on that.

11:52.980 --> 11:59.060
We have ways of doing this decentralized distribution via patch files, so you can just email patches

11:59.140 --> 12:03.220
to each other and keep your repositories in sync. We have remote repositories, like you're

12:03.220 --> 12:08.980
used to with Gets, that we host ourselves at general.io, or that you can host yourself yourself,

12:08.980 --> 12:14.500
and our Gets repo is there, we're a patchy to licensed, and we're accepting PRs.

12:18.500 --> 12:24.420
So the basically our entire goal is going from a world where we're sharing sequences in

12:24.420 --> 12:29.860
word files, or horrible patterns that you're eligible, and we're getting to a world where it

12:29.860 --> 12:35.060
looks more like a Gets repo, and we can get as fast as software engineering with Android biology.

12:35.700 --> 12:40.340
Thank you.

12:43.220 --> 12:49.220
So it's time for questions. Do we have any questions? Yes, I get friends. How can you

12:49.220 --> 12:59.540
store your data? So the question was, how do you store your data? It's SQLite. We have per repository,

12:59.540 --> 13:06.260
we can make one or more SQLite databases as files. So it's a graph model on SQLite.

13:10.900 --> 13:15.940
So we import center files, so one of the things that we did, that wasn't there in the

13:15.940 --> 13:23.300
Pan Genomex space, is when you have your GFA file from that space, and you add more variants with

13:23.300 --> 13:31.540
your changes, everything changes. It's a very fragile model in the sense that if you break up and

13:31.540 --> 13:36.900
notice, then all edges are connected, there's no need to be changed. So the way we do it is we

13:38.020 --> 13:44.260
work a slightly different model of graphs, that's additive. So we set it up in a way that is

13:44.260 --> 13:49.220
completely additive, and then we put it in SQLite, so it's a good scale. But the import

13:49.220 --> 13:54.740
export center files.

13:58.420 --> 14:04.900
Yeah, so could you use this for genome alignments? We don't include an aligner, but we do interface with

14:04.900 --> 14:11.220
those aligners. So you can have your aligner like cactus is like a common one, and then pipe it into

14:11.300 --> 14:14.260
into a gen, and have then the results and the pinnure repo.

14:27.540 --> 14:34.100
Yeah, so this is, as I said, the question was, you have these files where you have

14:34.980 --> 14:43.780
instead of ACTG of ACTN, or K, or whatever, like ambiguous basis. It's not in our main

14:44.580 --> 14:50.740
branch yet, but we have some scripts where we convert those on the fly into a graph. So you

14:50.740 --> 14:56.740
underpate these barcode, that can be like millions of possible sequences, and you have like 20 nodes.

14:56.740 --> 15:00.740
And that sort of what makes that exponential problem like all these years.

15:04.180 --> 15:08.740
Yeah, one more final. We share a plan also to have like the human, you know, one a repo

15:08.740 --> 15:14.500
where we can put all our stuff. Yeah, so one of our developers, he stressed us, everything was like

15:14.500 --> 15:21.540
100,000 human genomes at once, so it's built to handle that scale, and it does, but it does as well.

15:21.540 --> 15:27.060
Yeah. Okay, so everybody, please, thank you, Bob.

