WEBVTT

00:00.000 --> 00:07.000
Thank you.

00:07.000 --> 00:10.000
Yes, so my name is Felix.

00:10.000 --> 00:12.000
I'm going to talk about data wizard,

00:12.000 --> 00:15.000
which is a project of our lab from the last,

00:15.000 --> 00:16.000
I don't know, four, five years.

00:16.000 --> 00:17.000
I would say it started.

00:17.000 --> 00:20.000
And it's about reporting your tabular data.

00:20.000 --> 00:24.000
And yes, all of you are probably involved in bioinformatics

00:24.000 --> 00:26.000
and run some sort of analysis.

00:26.000 --> 00:29.000
You have some output of them.

00:29.000 --> 00:33.000
And these are more likely to be tables or are likely to be tables.

00:33.000 --> 00:34.000
Yeah.

00:34.000 --> 00:37.000
And people tend to use access for that,

00:37.000 --> 00:39.000
although it comes with some caveats,

00:39.000 --> 00:41.000
especially non-computational people use like,

00:41.000 --> 00:44.000
and yeah, it's not very reproducible.

00:44.000 --> 00:46.000
It also comes with some other,

00:46.000 --> 00:48.000
like, not so nice features.

00:48.000 --> 00:50.000
If you have a, like, had a column with genes in it,

00:50.000 --> 00:53.000
it tends to convert them into dates and this sort of stuff.

00:53.000 --> 00:55.000
And yeah.

00:55.000 --> 00:59.000
Also, you can use stuff like pandas or molytly polas

00:59.000 --> 01:01.000
or within our, the tidiverse.

01:01.000 --> 01:03.000
And yeah, this is great.

01:03.000 --> 01:04.000
We use this ourselves,

01:04.000 --> 01:06.000
but more within, like, the analysis.

01:06.000 --> 01:09.000
And not for, like, visualizing the final results

01:09.000 --> 01:12.000
and communicating the tables.

01:12.000 --> 01:15.000
Two, for example, like doctors in a molecular tumor

01:15.000 --> 01:16.000
about scenario.

01:16.000 --> 01:19.000
And additionally, also a single table

01:19.000 --> 01:21.000
might not even be enough for most of the use cases

01:21.000 --> 01:23.000
we have in bioinformatics.

01:23.000 --> 01:25.000
For example, you could have, like, an anchor print

01:25.000 --> 01:28.000
and single variant calls or differential express genes

01:28.000 --> 01:29.000
and an expression matrix.

01:29.000 --> 01:32.000
And this means this is, like, hierarchically,

01:32.000 --> 01:35.000
like, structure state, structure data.

01:35.000 --> 01:37.000
So you have, like, one big overview table

01:37.000 --> 01:39.000
and then you have, like, multiple other tables

01:39.000 --> 01:41.000
where each row of the overview table,

01:41.000 --> 01:44.000
basically corresponds to, like, one table

01:44.000 --> 01:45.000
with more details in them.

01:45.000 --> 01:47.000
Or you could have something, like, a joint

01:47.000 --> 01:50.000
and a database where, like, one row corresponds to another row.

01:51.000 --> 01:54.000
And additionally, one might also want to include,

01:54.000 --> 01:55.000
like, plots in there.

01:55.000 --> 01:58.000
And this could be, like, a singular big plot

01:58.000 --> 02:01.000
or, like, a plot within each cell, yeah,

02:01.000 --> 02:03.000
of a single table.

02:03.000 --> 02:06.000
And to quickly recap the state of the art,

02:06.000 --> 02:08.000
we can have, like, individual TSV files,

02:08.000 --> 02:10.000
actual files, and individual plots.

02:10.000 --> 02:15.000
And, like, I don't know, SVG or PDF or PNG format.

02:15.000 --> 02:18.000
Yeah, these are very easy to publish, just single files.

02:18.000 --> 02:19.000
You can open them.

02:19.000 --> 02:21.000
But they come with limited interactivity,

02:21.000 --> 02:24.000
and also, for the TSV and excess scenario

02:24.000 --> 02:27.000
with, like, very limited visualization.

02:27.000 --> 02:28.000
Yeah.

02:28.000 --> 02:30.000
And, as I mentioned with the Uncle Print, for example,

02:30.000 --> 02:32.000
the connections in between, like, different items

02:32.000 --> 02:34.000
in between different tables, really get lost

02:34.000 --> 02:36.000
in, like, plain CSV, or, I don't know,

02:36.000 --> 02:38.000
Excel files, yeah.

02:38.000 --> 02:42.000
And, on the other hand, one could do, like, a custom solution,

02:42.000 --> 02:45.000
using, like, stuff, like, shiny or lumen,

02:45.000 --> 02:49.000
like, these frameworks for, yeah, running, like, a web application,

02:49.000 --> 02:53.000
or you could even go ahead and, like, implement your own thing.

02:53.000 --> 02:56.000
And this comes with a big implementation overhead.

02:56.000 --> 02:58.000
You have to sit down, write some code,

02:58.000 --> 03:00.000
and, yeah, it takes just takes time.

03:00.000 --> 03:01.000
I don't have to tell you.

03:01.000 --> 03:04.000
And, yeah, you need to maintain that server.

03:04.000 --> 03:06.000
You need to make sure it keeps running over time.

03:06.000 --> 03:07.000
Yeah.

03:07.000 --> 03:11.000
And, therefore, the long-term maintenance for that is challenging.

03:11.000 --> 03:13.000
Out of this, we can formulate a problem

03:13.000 --> 03:16.000
where the input is basically a set of tables,

03:16.000 --> 03:18.000
and the relations between these tables,

03:18.000 --> 03:20.000
and a set of rendering definitions.

03:20.000 --> 03:23.000
And the output shall be portable,

03:23.000 --> 03:27.000
and also an interactive and visual representation of our data set.

03:27.000 --> 03:30.000
And, yeah, as I already mentioned in the beginning,

03:30.000 --> 03:32.000
for that, we developed data wizard.

03:32.000 --> 03:35.000
On the left, you see a configuration file.

03:35.000 --> 03:38.000
Data wizard is invoked by the command line

03:38.000 --> 03:40.000
with that configuration file.

03:41.000 --> 03:43.000
And, in that configuration file,

03:43.000 --> 03:45.000
you define your data sets.

03:45.000 --> 03:48.000
Yeah, the configuration file is written in young formats,

03:48.000 --> 03:51.000
so very easily readable and editable for humans.

03:51.000 --> 03:55.000
And, you define these, like, linkages in between different data sets,

03:55.000 --> 04:00.000
and also per column, define what sort of visualization you have.

04:00.000 --> 04:03.000
And then, when you call data wizard from the command line,

04:03.000 --> 04:05.000
generates an HTML report,

04:05.000 --> 04:08.000
so it's a bit similar to something like multi-QC,

04:09.000 --> 04:12.000
that is self-contained, and is then openable,

04:12.000 --> 04:14.000
and the browser on any system.

04:14.000 --> 04:16.000
So, it's very portable.

04:16.000 --> 04:20.000
To show you a bit further how you configure this stuff.

04:20.000 --> 04:23.000
There's a few examples I want to show you here.

04:23.000 --> 04:26.000
You simply, yeah, give your data set,

04:26.000 --> 04:28.000
for example here, there's some movie stuff.

04:28.000 --> 04:30.000
You give it an arbitrary name, like Oscar's,

04:30.000 --> 04:34.000
and then give the file path where the CSV is located.

04:34.000 --> 04:37.000
We also support, like Jason or Paquet,

04:37.000 --> 04:39.000
some sort of stuff.

04:39.000 --> 04:44.000
Yeah, and then you basically say what column is linked to another column,

04:44.000 --> 04:47.000
if you want to jump around in your data set.

04:47.000 --> 04:49.000
For example, data wizard for that will define,

04:49.000 --> 04:51.000
like, automatic, link outs.

04:51.000 --> 04:53.000
Yeah, so you click on one row,

04:53.000 --> 04:55.000
and then jump to your other data set immediately

04:55.000 --> 04:58.000
with the row that corresponds highlighted.

04:58.000 --> 05:01.000
Yeah, this about the data sets,

05:01.000 --> 05:05.000
and the same goes for basically visualizing any column.

05:05.000 --> 05:08.000
Yeah, for example, you type in the name of the column,

05:08.000 --> 05:10.000
then you say, oh, I want a plot.

05:10.000 --> 05:11.000
It shall be a tick plot.

05:11.000 --> 05:14.000
And what you get is what you see on the right here.

05:14.000 --> 05:16.000
And yeah, there are, like,

05:16.000 --> 05:19.000
a lot of pre-made options for this plot.

05:19.000 --> 05:23.000
For example, also bar plots, as a show to you a heat map.

05:23.000 --> 05:26.000
We have a link out, for example,

05:26.000 --> 05:28.000
if you want to link out to stuff, like,

05:28.000 --> 05:30.000
clin-va based on the value that's in the cell.

05:30.000 --> 05:32.000
Data wizard will render these links.

05:32.000 --> 05:35.000
And then what I also want to highlight is that you can do,

05:35.000 --> 05:37.000
like, custom plots using the Vigalite library,

05:37.000 --> 05:39.000
simply pass the plot specs,

05:39.000 --> 05:42.000
and then it allows you to do, like,

05:42.000 --> 05:44.000
any plot in a single cell.

05:44.000 --> 05:46.000
And also, like, a full plot view of this,

05:46.000 --> 05:47.000
what we also can do.

05:47.000 --> 05:48.000
And if that's not enough,

05:48.000 --> 05:52.000
we support, like, passing any arbitrary, like,

05:52.000 --> 05:55.000
JavaScript function that manipulates the content of a cell.

05:55.000 --> 05:58.000
So this means you can really do anything with it.

05:59.000 --> 06:00.000
Yeah.

06:00.000 --> 06:04.000
And, once again, coming to the probability of data wizard,

06:04.000 --> 06:07.000
so as I mentioned, you invoke it from the CLI,

06:07.000 --> 06:10.000
and then it outputs a single directory.

06:10.000 --> 06:13.000
And this means you can open it on any machine.

06:13.000 --> 06:15.000
No servers needed.

06:15.000 --> 06:18.000
And, yeah, this means also it's very easily,

06:18.000 --> 06:19.000
easily shareable.

06:19.000 --> 06:22.000
Yeah, you can just zip up the whole directory,

06:22.000 --> 06:24.000
send it around to a doctor,

06:24.000 --> 06:29.000
or, like, I don't know, uploaded to GitHub pages,

06:29.000 --> 06:33.000
which actually means that you also outsource the server stuff,

06:33.000 --> 06:36.000
basically, as it's only static HTML.

06:36.000 --> 06:38.000
And what's also very nice, you can attach it,

06:38.000 --> 06:40.000
like, to a publication.

06:40.000 --> 06:43.000
This means the reviewers, or, like,

06:43.000 --> 06:45.000
any other people who read your manuscript,

06:45.000 --> 06:47.000
can basically explore the data set,

06:47.000 --> 06:49.000
the same way you can do.

06:49.000 --> 06:53.000
And this also brings me to the end of my presentation.

06:53.000 --> 06:56.000
I want to thank my PI, Jonas Kester,

06:56.000 --> 06:57.000
and my whole group.

06:57.000 --> 07:00.000
If you want to take further into data wizard and use it,

07:00.000 --> 07:04.000
we have it on Condy and all other sort of stuff.

07:04.000 --> 07:07.000
And also, we have a publication that we did last year

07:07.000 --> 07:09.000
that also explains a lot more.

07:09.000 --> 07:10.000
Thank you.

07:10.000 --> 07:11.000
Thank you.

07:12.000 --> 07:17.000
You've got time for maybe one question.

07:17.000 --> 07:19.000
Yes, for two minutes.

07:19.000 --> 07:20.000
Two minutes.

07:20.000 --> 07:21.000
Yeah, go ahead.

07:21.000 --> 07:22.000
I'd like to go ahead.

07:22.000 --> 07:30.000
I'd like to go ahead.

07:30.000 --> 07:35.000
So the question was, whether you can, like,

07:35.000 --> 07:37.000
apply data transformation,

07:38.000 --> 07:41.000
with, in that yellow config, so yes and no.

07:41.000 --> 07:44.000
So you can do some manipulation,

07:44.000 --> 07:48.000
or basically add columns based on the values of other columns.

07:48.000 --> 07:51.000
This possible, but if you, if you want to do,

07:51.000 --> 07:54.000
like, stuff before and I would advise on,

07:54.000 --> 07:57.000
I'm putting this, like, before into the workflow,

07:57.000 --> 08:00.000
this way it also stays a bit more readable to the user

08:00.000 --> 08:02.000
and stays more reproducible.

08:02.000 --> 08:04.000
Another question.

08:04.000 --> 08:06.000
I can talk presenters.

08:06.000 --> 08:08.000
We are starting to assemble the S,

08:08.000 --> 08:09.000
which we're presenting.

08:09.000 --> 08:11.000
Please come up and try to hear.

08:11.000 --> 08:13.000
Is there another question?

08:13.000 --> 08:15.000
I can ask one.

08:15.000 --> 08:16.000
Yes.

08:16.000 --> 08:18.000
We have a schema for your young one.

08:18.000 --> 08:21.000
Actually, yes, we can, we can derive that.

08:21.000 --> 08:23.000
Data was at a written and rust.

08:23.000 --> 08:26.000
And we can simply derive that as we have this,

08:26.000 --> 08:28.000
as structs in rust.

08:28.000 --> 08:29.000
Yes.

08:29.000 --> 08:30.000
Yeah.

08:30.000 --> 08:31.000
Have you actually published it?

08:31.000 --> 08:33.000
Because that's useful rather than passing the rust.

