– Good morning. – Good morning. – Good morning, hello. – Hello everyone, bright
and fresh new day. And it's shining hear. So very welcome everyone
from around the world, let us know where you're
watching from, by the way, in the chats on a different places. And it was really, really cool
and exciting day yesterday. I really loved the talks. Very, very impressive what people are building
50 year technologies and the datasets and graphs in general and conveyed for today, how about you all, you too.

– Absolutely our same. I had an exciting day yesterday. I think it was really good seeing all these presentations from all, we touched all various topics yesterday and I think today it continues. I'm really looking
forward to the sessions. So yeah, I think it's gonna be good. – I agree, completely agree
for the presentation yesterday. Exciting stuff installed for today. Let me share my screen and let's have a quick look
at the agenda for today, so. – Let's do that. – There we go. Starting, I click the wrong button. I wanted to click here, perfect. So this is our agenda for today. So after a short opening from us, we have a last minute addition
and that's what interesting and fantastic talk about
the Clinical Knowledge Graph I suppose a lot of know the publication was viral RAF and later we'll describe this graph and set a framework for translation and clinical proteomics.

And from what I know, about Clinical Knowledge Graph that the scope data they've
integrated is massive. And the interesting thing
is the whole framework has to update this Knowledge Graphs into public and available. We will follow with, with a super-interesting
talk from the jump center for diabetes research and interesting, and also very, very public and
very open non-staff project with a huge focus on publications
and integrating everything in the translation and
manner from clinical research down to basic search, we switch focus a bit to microorganisms, which is I think super interesting.

And we have to talk about
Neo4j for bacterial genomes, and I'm really, really looking
forward to data modelings and genome and what kind
of creatives they can run. We will follow with the talk
on Multi-Omics integration. And I think this topic is very relevant. It is interesting for almost everyone. And what I found particularly
interesting yesterday is some in the OMOP integration talk, there was also Multi-Omics integration kind of in the future.

And one of the motivations by the way to go from the OMOP system or OMOP modeling NEo4j was to be able to integrate
in non-patient data on that same OMOP observation data like microbiomics studies
of those patients. So this is also about micro organisms. Microbiomics and Multi-Omics independent actual organisms, right? So this is going to be an exciting talk.

We were following a bit of research from Davide Mattins from Aarhus University has maybe one of the best titles here. And reading the knowledge
in knowledge graphs and whatever they does is he
has a huge amount of graph algorithms and graph data
analysis methods available that he can apply and Knowledge Graphs and why the GDS is impressive and growing. There's so much more beyond that, that they can use to get more insights from our Knowledge Graphs. We'll get back to healthcare orientation, especially going from
IQVIA going to describe the sketch graph that
they use for evaluating drug safety data with graph databases. And I think this is a
very interesting topic because this whole idea of understanding what a drug does after clinical studies, after clinical trials is really growing. So everything of the area
of, will provide evidence, it's just very, very
relevant for the industry, but obviously also for
patients and clinical practice. We'll finish with a talk on
natural language processing and in the life med presentation from Evelyn networking yesterday, we've seen how powerful
NLP can be in combination with the graph, cause all the output and methods of NLP essentially produce graphs.

And if you can get it
into a Knowledge Graph, we can integrate it with so many different
layers of information and annually we have presented
this Spark NLP pipeline and how he creates a clinical
knowledge on top of it. We have a closing session
with Alzheimer's outlook because like we mentioned
maybe 500 times yesterday, the goal for this workshop is to foster collaboration and network. So we wanna bring researchers
and industry together we wanna bring the healthcare space, the life science research space together, and really fostering close
collaborations network sharing of data models and data is I think kind of one of
the main goals of this event. Okay, that's our agenda for today. And without further review let's switch to Rica. Who's going to present the
clinical knowledge map. – Hi, good morning everyone. Can you hear me fine? – Yes. – Perfect, so thank you to the organizers for inviting me for this talk. I'm really excited to be
amongst so many people that love graph databases and
have worked so hard on them. So today I'll just briefly
talk about this framework We developed for translation
of clinical proteomics.

It's based on a graph
database Neo4j database it's working, perfect. So we'll start with the why,
the how the package itself, the database and the
interfaces are created for it, and then where we envisioned
the next steps going. So as you know, many of you probably
know omics technologies like proteomics produce
very high resolution data, which allows for a more holistic
view of biological complex processes and complex diseases. And it's that exact same high
sensitivity and high accuracy that allows us to search for biomarkers or protein signatures
in health and disease and use them hopefully in the clinic.

The call at Mathias lab, both in Copenhagen and in Munich
is to make this technology cost efficient and high
throughput yielding quick, clear interpretable, actionable
and reproducible results that can be more easily
translatable into the clinic. And so closing this gap
between research and clinic, however, this very complex and the sheer amount of
data that is produced daily on multiple labs, really highlighted the need
for specific analytics tools to gather mine and integrate
this kind of information, which we've really found
that was not so easy to get. So right now, a standard proteomics
workflow can take hours and till you actually get the spectra and it's finally processed. And after this, you would find other roadblocks
like long statistical analysis that are not
all the standardized.

So they kind of arrive from lab to lab or even within the same
lab, depending who runs it very time consuming. And at fruitful, sometimes
database searches and literature searches. And the biggest of all roadblocks is the actual manpower and time as not all labs have
access to bioinformatician. So are the experts in downstream analysis. So as part of the computational
clinical proteomics team, Alberto, myself, and Annelaura we set out to create a tool
that would help standardize and automatize this
downstream data analysis. So it basically took a page from Google and decided that a good first step would be to use the Knowledge Graph to help users resolve their analysis without having to navigate
through other websites, having dozens of tabs
open in their browser, lose days or weeks of
constantly searching PubMed and JoniPod for example. So we just wanted to
make their life easier and save them time that
they could be spending doing other experiments, for example, the CKG our clinical knowledge
graph automated framework goes from a max Panther spectrum, not outputs or other
outputs like frag pipe, or just MC graph files in
minutes to actual knowledge.

And this includes the
automated analysis pipelines. So both the analysis
and data visualization, we have libraries for all of this. It includes state the integration as well. And not only with the metadata we collect from all these publicly
available databases, but also if you have, for example, clinical data proteomics, you want to get information or knowledge from both of them combined. So that would be in the
data integration part and knowledge mining. So sort of summarizing all the findings from the other tabs or the other analysis and showing them in a
very user-friendly way like a networks or Sankey plots and gather it with as
well to match the data. So the work flow goes like this, we download these publicly
available databases and ontologies, we have our experiments.

If you wanted also
download some from pride. And then we have our servers
that basically transform all this raw data into very easy files to load into the database. So basically just nodes
and relationship files with all the attributes we find relevant for the questions, we will
later be asking the graph. So this is all loaded
into the Neo4j database with queries that we have predefined. So as long as these inputs
have the same format, it can be loaded from other
databases or ontologies. And then some other predefined
queries that we created. We'll also mine the database
and feed that information into our analytics core
module that runs in Python, but also applies our functions. All the plotting and
analysis is dealt with in the background by a specific module. So we don't have to worry about it.

Everything is in the automated analysis, but you can access it in Jupiter notebooks and manipulate the data as you please, and use other functions that are not in the automated analysis, which is also shown in a bash app that you can very easily navigate. Yeah, so this is how it works. The database itself, as I said, contains a lot of other
databases, 26 so far, we can always add more if
we find that it's relevant to answer our clinical questions, but it includes a unit product
of course, PubMed, DrugBank, the human metabolome database reactome, a lot of other databases, also search plus if you want
to look at modified proteins and so on also ontologies so we can
standardize and harmonize all to knowledge.

And that includes, for example, SnowMed CT because of
the clinical variables and clinical data we want to
analyze, this is ontology. And of course, ontology for
enrichment amongst others. So these are numbers at the
time we submit the papers. So now they are at higher, but at the time we had a bit
more than 16 million notes and 110 million relationships like now we are around 200 million and data scheme, it's a bit more complex to of course the questions we would like to answer and solve, but also because we want to
store all the information around the projects and the experiments from the researchers. So we actually attribute
identifiers unique identifiers to both projects, subjects and
the samples associated to it.

For now, it's analyzing mostly
proteomics and clinical data. You can see the links to
peptide modified protein and protein, but it would be
very easy and straightforward to adapt it to other omics
datas like transcriptomics or metabolomics by creating the process and the links to transcript, for example, gene or metabolite. In terms of the interfaces,
we have a few of them, they all run in that. We are looking into
other more advanced ones, but for now we have the homepage, right? Where we have basic stats on
the Neo4j database itself. How many dots, relationships, how many of them came from
which a database or ontology data scheme also links for the other apps and for the project reports themselves. So we also have a small,
clever search drop down menu. If you basically type our site typing the name of your project, it will show you the
ones that are available.

And by clicking, you can just go to the report
of the respective projects. To make life easier for your users, we also created this project creation app. It's basically a form. You just fill it in. And then by clicking on
the creative project, it will create all these TSC or CSV files and load the information
into the graph database on its own on the background. And in the end, it will just give you the option
to download these templates or the data upload, and also give you any
internal project identifier that you also use for the data upload, in this case for the data upload. It's another app, of course you use the identifier. It will just do like a sanity check to make sure that there is the project you want to upload data to.

You would select the kind
of data you want to upload, the processing tool. If we are talking about proteomics data and then a drag and drop
field for the upload of the files themselves, in the end, you upload the data to the CKG and then you can run to the fault automated analysis
pipeline just by clicking the generate report link. It will run everything in the background. It will retrieve the proteomics data from the graph database.

It will run all the processing statistics, the correlations in the background. And then in the end, it will, you can go to the
report app to the report page, and it will give you the tabs for the different kinds of knowledge, always the project information
tab and knowledge tab with the summarized information
and metadata on multi omics. If you have more than one type of data, and then one for each
type of the data itself. In the proteomics stuff, for example, you would get ready for publication plots they can go from QC
sample plots like proteins for group that's per group,
a coefficient of variation, abundance plots to
stratification plots PCA here maps functional
stratification as well.

The statistical tests, this will predefine or choose the appropriate tests according to your experimental
design, if it's, for example, paradigm paired, if it said T-tests or ANOVA, it will choose
itself at the appropriate one. And then of course, the multi omics. So correlation between in our case clinical and proteomics
data in the form of networks or double CNA, for example. So to illustrate how
the CKG can accelerate both the analysis of the
data and its reputation and potentially, or hopefully support clinical decision-making. We use this user case from
your echo carcinoma patient with respiratory lung metastasis set so a previous study. We ran and we in this study in the end, we proposed is KBM1, protein as a possible druggable target, which was approved by the tumor board, but we basically just
run the entire analysis, the default one. And in that case, we got more than a 300
significantly regulated proteins, which is not ideal if you want to prioritize
drug targets and candidates. So we extended this analysis
in Jupiter notebooks using only our analytics
score and the knowledge we have gathered in the graph database.

And so we started reducing
the list of significant or relevant proteins by taking up only the up-regulated ones. And from these, we focused on the ones that
were already associated or were known to be
associated to lung cancer by queering the graph database. Also the ones that were reduced as further by selecting the ones that were
target by inhibitory drugs, which gave us 19 proteins to
target by 60 potential drugs. And amongst them, this is just a subset of
the network amongst them. We had KTM1A a as a potential target and his terminal cypromine
as inhibitory truck, which was the one approved
originally by the tumor board for this patient.

But we also had another drug that would, could also work as the inhibitory truck. So already here, we can see that in a
very short, concise way. We could reach not only
what was known and approved by the doctors and the tumor board, but other alternatives as well and potential alternatives as well. So to reduce this list further, we actually wanted to
avoid previously observed at first reactions. So we retrieved all the side effects associated with these inhibitory trucks when used in this chemo therapy regimes. And we ranked them with a Jacquard index and we filtered for very low cutoff.

So the fewer adverse reactions
together with the art wards is chemotherapeutic regimes, the better. And from those 60, we selected six and we then looked in case they were also known to be used already in lung cancer treatments. And if from these, they have, by chance been described
as combination treatments and a literature already. And to our surprise, then delight. We actually found Draco
stat in an arena stat that were mentioned in combination in more than 30 publications already. So in the end, we actually propose these two candidates as a potential treatment for
this patient specifically. So this pipeline is completely reusable and can be applied to other studies. The functions are completely general. So as long as the inputs
again, have the same format, it could run a similar
prioritization analysis or other studies. And this is only one of the
many applications of the CKG. All the knowledge was mined
from the graph database in a matter of a couple of hours, instead of going and searching
each one of the 19 proteins individually and looking for
drugs that could target them. If I haven't mentioned before by design, the clinical knowledge graph is an open source community project.

And we envisioned that the
community could directly contribute to its further development, helping extending the
database, the analysis, the visualizations and notebooks. So if you are curious,
please do go explore it, use it, contribute. Every aspect of it is very much welcomed. There are parts and the notebooks can be shared across multiple labs. So we hope that in this way, it will contribute to more open science and thinking of long-term, we also envision that
having one knowledge graph connecting a fast amount
of proteomics experiments could benefit the community immensely by allowing direct and
did project comparisons and leading to increasingly more robust and powerful analysis. And of course, all the knowledge generation
that we could get from it, all of these opportunities
would really allow us to support expert panels
and medical doctors, making clinical decisions,
never replaced them, but giving them a more
thorough knowledge base in a faster and more robust way.

And with this, we would really hope to move
to an era of a translational evidence-based medicine. For online resources if you are interested,
please go to GitHub. The repository is public. You can just download as set
out and then push changes if you want to contribute to it. The manual and the API
reference are in read the docs. We are continuously improving the docs. So if you also have suggestions for it, feel free to do it. And the manuscript and seen by our archive and soon on edge biotechnology, with this, I would like
to thank the entire team, Matthias, Alberto and Annelaura, it was amazing to work with this guys, such talented researchers and all the funding agencies of course, and all of you today here, thank you. – Fantastic, thank you so much for this, for the super interesting presentation.

And again, I think both the amount of data and data sources you have integrated, and the types of analysis of
can run as super interesting and obviously most
interesting for the community is that this is opensource and usable. And let me look at the questions. So, and that was asking, how did you handle mighty target drugs, in your type of privatization? – That's a very good question. In this case, we were lucky we had almost
one-to-one relationships, but well, mighty drugs you could find out the kinds
of data to prioritize them. We use adverse reactions, for example, you could use a higher
scoring, slower scorings, different kinds of evidence where that association came from. It all depends on what you prioritize. – So I think that I already
answered the follow up question. Do you account for drug
ability scores in any way. – Drug ability scores if it's
in the DrugBank database, yes. – And another question about
open source and accessibility.

So is it possible to run a local copy of your system or how's it possible to reuse
your code and your project? – So you can, all of these notebooks, they are recipe notebooks. So they are on the GitHub repository. If you clone or download
it, you can install, we have made the installation
a bit more simple. You can run a Docker image for example, and then you can just open
it in a Jupiter notebook and you have everything
I showed in this example will be in this note book. – Fantastic, so it's really
easy to reuse and test. I think that's fantastic
that people stop repping this stuff in indefinitely. – If you are using to Jupiter notebooks, you just need to have a little
bit of knowledge on Python that's why we also made
automated analysis pipeline without people having to code in anything. But if you do have some knowledge, you can just manipulate
the chair will the data and minded database in the notebook.

– Okay, so thank you so
much for your presentation. I think it was really interesting. We got a lot of feedback
and I'm pretty sure that a lot of people would be interested in trying the clinical knowledge graph and good luck with the
publication of course. So thank you, thank you so much. And for that, it's a letter
to announce the next speaker, Alexander Jarasch from the German Center of Diabetes Research. Will represent how they went
from queries to algorithms into advanced machine learning. Your stage Alex. – Good morning, everyone. Thanks for joining my session. And thanks also to all the
other speakers from yesterday and also restart today. It's really fascinating
what you guys are doing. And I always think that
that we are doing graphs, but everybody's doing graphs. So I'm pretty much amazed
at all these applications.

So how can I share my screen here? Share my screen. Here we go. All right, so you should
see my screen now, I'm showing you today. It's an applications that we are doing in the German Center
for Diabetes Research. And first of all, a little
summary of what we are. We are a nonprofit organization
and the federal Institute in Germany here with five partners and several other associated partners, which sums up in roughly 450 researchers that are having expertise
in basic research, but also clinical trials
and clinical research on a kind of data level. We have a variety of data. Obviously we have very,
very unstructured data. It's very heterogeneous. Usually it's not connected in a federated State like Germany. So we termed that unfair. So he's overview of my team. So we have four developers and myself we're sitting in Munich in Germany.

And our challenge is
that we have data silos. They are scattered around Germany, and usually people are
asking scientific questions that are usually very easy. How many data samples do we
have from blood in a mouse or in human? It's very easy question on the one hand, but to answer this
question on a data level, it's not that easy because
the data is scattered and this is what we are tackling. We also have another challenge because we have many different users here. For example, we have scientists that is usually not trained
to be a computer scientist coming from the clinic, coming from as a biochemist. So they have questions very
from the scientific level or from the medical level. But we also have data scientists
that ask query questions or query my questions. And in the end, we want to make access to
them to all the same database. Why are you using graphs? I think you're familiar
with questions like these are human type, two diabetic genes and enzymes acting on metabolites, which in turn are
regulated in the pig model.

This is the usual question
that a researcher would ask. What does it mean on a data level? Basically it means is A connected to R, and I'm showing you here one Excel sheet. And I give you three seconds. If you see if you see
if A is connected to R in terms of a data point 21, 22 23, and I guess you didn't see it. I'm asking you the same question
now is A connected to R, when we transfer our data
into a Knowledge Graph or into a graph database in general. And I think you can easily
see that A is connected to R, that means in the end, we have basically a graph problem, right? So we only have to
transfer our relation data or Excel sheet data into a graph so that we can make scientific questions. I think I don't have to show
you why we use graphs here.

Graph is basically everything. What we have in biology,
everything is connected. Everything is dependent on each other. We use a need Neo4j
because it's easy to model. It's extremely flexible,
also easy to adapt. So when we have to reshape the graph, it is very, very easy. When we compare the two usual SQL models, I think we have quite a
huge graph database now. And so we are still scaling on one server. Many of our queries are
cyclic dependencies. So that is very, very easy to answer. And last but not least, we're using graph data science
library and graph embeddings. I will show you that in a second. And that's where we are using Neo4j. Diabetes is a connected disease, and it's connected also
to other indications like cardiovascular diseases
like stroke and heart attack, diabetic patients get
Alzheimer diabetic patients get even liver cancer,
infectious diseases. When we think about
COVID also lung diseases.

So that's why I see diabetes
as a connected disease in a connected network. And that's why we have to
see not only diabetes data, but also connect other data
from other indications. The concept is pretty simple, in our case, in the center, we have a Knowledge Graph, which we built up from public data. We connect this by using
natural language processing and inferring knowledge. This is the second layer doing
named entity recognition. And in the end, this is the outer layer. We connect our DCD data
to the Knowledge Graph so that we can ask scientific questions. Let me talk about a little
bit about the steps. So currently we have a production. So with roughly 330 million nodes and roughly a billion edges on our hard drive, which is only SSDs, we have roughly 500 gigabytes. We have another developer
server, which is way larger. We all run that on a single
server machine, 60 chorus, 256 gigabytes of Ram.

So nothing really special, I would say no cloud instance. So we are running new Neo4j enterprise cause of life backups and GDS library. We have some applications
running on top of our database. I will show you that in a second. And we use the visualization, the interactive browsing by some specs, which I'll also show you in a second. Some works to the data migration. Most of our data comes
from from SQL databases and also spreadsheets. And here we use Py2Neo and GraphIO, packages for Python basically, we have a Docker orchestration pipeline, which is basically open source, and you can also use that. We integrate data. And then we enriched data
normally using text matching algorithms, named entity recognition, natural language processing, you name it.

And I'm only focusing
here on this example, we integrate genes and RNA and proteins. I guess you're familiar with that. So genes are coding RNA. The RNA is coding for protein, and this is where we
do inferring knowledge, which is in this case,
not in very knowledge, it's like straightforward. We are creating this RET
relationship here codes that a gene is coding
for a protein in the end, in order to save some hops
in the graph database. And this is what we're doing many, many times in our database. You see the overview of our data model. You don't have to read that here.

I'm showing you that into a little Zoom, just to show you here,
the domains that we have, in yellow, we have literature, in orange, we have all the mess terms and the keywords from the literature also affiliations and scientists. The big part is all the
molecular modalities like genes and proteins, metabolites,
gene variance, SNPs, or snips into actions, also chemicals. We integrated some
ontologies in green here, disease ontology, gene
ontology, phenotype ontology. And we recently integrated
also gene expression data here in blue. Well, that's the Zoom
here on this data model. You see that the nodes
are pretty much connected to each other. We also have a relationship
pointing on the same note, meaning that we have synonyms
and mapping identifies from different databases.

So we also cover that. Let me say a few words of our challenge about the users that we have
and the kind of input they need and output they need. So the challenge is, for example, we have a biochemist or biologist. They do Multi- Omics experiments, and they want to know
something about the output. And this is where we have a specific input and a specific output, for this, we developed a flask app where you can enter your
multi omics experiments can be transcriptomics proteomics.

We have a pilot for lipidomics, even at the results here in this case, it's very specific output. Whereas the gene mentioned
which publication, which kind of mesh terms,
which kind of ontology is used. And so on. Then we have a second group of users, which are usually our medical
doctors in the clinic. They don't really know
where they want to start their research and they
don't know where to end. So they are very, very freely
to explore this network. And for this we integrated a tool called SemSpect by the
Derivo company here in Germany on the left side, you'll see something like Neo4j technology. You see all our note labels here and you can drag and drop them easily into the big field here and do
your expiratory search here. And all the notes that you usually have in that are wobbling
around in the Neo4j browser are summarized here,
very nicely, make notes.

So you can easily cluster them. You can filter them accordingly. And it's very, very handy to use that. The third user as the
data scientists itself. So we have people who can
read and write cypher queries, Python code and so on. So we have Neo4j Browser
as well Python interfaces. So you can also access our graph using Python or R. Let me show you some
use cases that we have in the German Center
for Diabetes Research. And we go from very simple
to more advanced queries. First is we want to query
a friend of a friend. So we want to have a gene that
is connected to its synonyms, but also to its map, identify
from other databases. And this is very easy here
because we are defining a path using the cypher query. We query a gene called TCF7L2, which is a typical diabetes related gene. And we are mapping one
or two or three hops here in this case, two hops to another gene, which is a synonym or
a mapping identifier. And this is like open one seconds.

And you get all the synonyms
and albeit the map identify as from the same gene. So what you see here in
this graph is basically, they're talking about the same gene and we have very, very
different gene identifiers. Second are centers that
let's say we want to query the gene and we want to find information that is connected to these gene. And this is where we
integrated, as I said, gene variant data. So SNP on mutations that
are associated to diabetes. And we were interested in what can we find in these databases according to this gene, when we start from the very left this the type 2 diabetes trait or the phenotype that we have, which is associated here
in sign two gene variant, this gene variant is associated to TCF7L2 the gene and as I showed
you in the last slide, which I turned internal
identifier in ensemble, 69, 34.

And this identifier is
coding for a transcript. So an RNA in blue, and this is coding for a protein in green. And this protein has
basically a function, right, in this function is in gene ontology. So we have the function in
orange and the ontology term or the ontology itself in green. And that should only show you here. All the colors are
representing single data silos and that we are now collecting natively in our graph database. The third example is using a GDS library, and I will skip that slide because Martin showed that already. And I'm showing you here,
the example for COVID-19, where we could identify ACE2
as the most prominent team using Google Page rank algorithm. Unfortunately, I cannot show you here, our page rank algorithm results from the DCD, because this is still confidential. So we will skip that example and other use cases that we use now, graph embeddings or note
embeddings to sub phenotype our diabetic patients.

Now, let me explain how we do that, in the DZD we not only
have a knowledge graph with public knowledge, but
we also have scientific data from basic research, but also clinical data
here from our patients. And on the left side, you see
that's very relational data. We have Excel sheets
where they keep their data or some SQL databases. And we transformed that
now into a graph here in a very centric and red. And we have the patient
who has a medication. We measured, some biomarkers
has a type of diabetes, maybe has some surgeries
and so on and so forth. And then we are performing
now clinical studies. We are performing
experiments here in orange and these experiments, we acquire parameters in
green on the very right. And sometimes we have
continuous time data. So we also keep keep track of the time. So we integrate the data and now we use molecular fingerprints like lipidomics fingerprints on the left side again, you see the Excel sheets where we measure a specific lipids here.

In our case, we have a platform that can specifically
measure 116 specific lipids. And we represent them in the graph. Same is true for
transcriptomics experiments. We have chips which have
roughly 60,000 transcripts and we connect them in Neo4j also with the values of
the different patients. And what we then do is we
calculate fast lender protections. So in memory graph where
we have note embeddings, we take the top 50 parameters and we represent our patients with their molecule
fingerprints in a vector so that we can do machine learning. And that's what we do. We then apply here 10
years, labor clustering with K equals five because we have, in research, we have roughly five different subtypes of diabetes. And so we can cluster our patients and can then learn more
about the different subgroups if they are responding to a specific drug, if they are responding
to a specific treatment or prevention strategy, or if they not respond to them. In the end, we connect
that to our knowledge graph so that you can easily ask a
questions like I have here, this specific transcript on my chip, what is known about this transcript is that coding for this and that gene who's publishing about this gene? What is it connected to? Is there another indication
like or Alzheimer's so we can easily create that.

We can also query from the other
side coming from the right, we see an ontology term or a disease term, and we can cruise through our
graph to our experimental data and ask the question. Do we have data in house
to this and that event. With this I'm at the end of my talk, I hope I could show you that we have quite a huge knowledge graph, which is our single point of truth. And we connect in-house
data, but also public data. We also connect across specious data. So we have integrated animal models like from pigs and from mice, where we can do studies in animal models and then infer the knowledge to the human.

I showed you some simple
and more advanced use cases. And I showed you here, the specific applications
that we are running. So like the flask app,
the same spec and browser, and with this, I'm already
at the end of my talk. Oh yeah, that's the last comment here. We could increase our
performance in research or in data look up between
20 and 12,000 fold. So I think this is quite
a remarkable number. So depending on your input, sometimes we are 25, 20 times faster, but in our last use case that we had, we were at 12,000 times faster than, than the normal approach before. So with this and at the end of our talk, and I'm happy to take questions, thank you for your attention. And yeah, I'm really looking
forward to the other talks. And if you have any questions let me know. – Alex, fantastic, thank you.

It's usually impressive
to see a Knowledge Graph and to see a canopy of the amount of data you get integrated with there. Let's have a look at the
questions section and the Q and A, so that a couple of questions, starting with the Lissa from the Boondocks who
presented yesterday, and the question is
which genetics database did you use for getting the
SNPs and SNPs interactions? – Basically we integrate, or we continuously integrate GIS data. So we download the GIS catalog. I think currently we have
one version, 1.0, or 0.4. And in this catalog, there's also the gene interactions, but also the SNPs interactions. So sometimes we have a snippet action, like one SNPs is depending on the other, or we have across inference like that.

I don't know the correct term for that. So this is coming from GIS data. – Okay, there's another question. Hi, Alex, from SemSpect
can just select datasets and run more sophisticated analysis. – Depending on what you integrated so far. If you integrate your data
into notes and relationships with specific measurements, I think you can do that. Shouldn't be a problem, currently, we are not
using that with SoemSpect but in general, I don't see any problem with that. – And I think what it can do is you can always
explore CSV files, right? – Exactly. – So if you have some
theaters and everything, and Mike was pointing out that
his echo has a public sense instance called COBIT ground. If you're interested in trying to count, if it can post them link in
the check to the placement, let me jump back to the chat and Q and A, and see if there's more questions, why I look for that in the
various different channels.

I would have a question. So, I mean, every time the clinic can
not disrupt it like that, a lot of very interesting
data integrated already, and they have their own kind
of data loading pipeline. You also have lots of data and the fantastic data loading pipeline, but there's a lot of overlap. So could you comment
maybe at what you think, how would you approach, sharing resources and
then integrating this or combining this. – I thought about that
many times seeing all this nice graph use cases from other groups. And I would like to reuse, like a data source or some knowledge that they have. And I think one approach
could be like exploring or dumping sub graph of the one graph and then importing that into our graph or the other way around if we're not talking
about the pipeline itself. But I think sharing sub
graphs with each other would be a very interesting approach so that data can be loaded
in this or that way.

But we can share that on,
let's say on a higher level or on a database level. – So the idea would be to have to share it in kind of its final form and all its knowledge graphs subsets. – Well that would be one option, right? To dump it, to import
that in your own database. But I also see a possibility just to reuse since we, most of the part
that we do is open source to reuse that the Python scripts or other script languages just to run that in another instance. – Okay, so let me look at the chat, a couple of questions
around accessibility. So can you post the GitHub link and so on? So you mentioned that your code and also the data, not in pipeline, the open source, how
can people access that? – Yes, so we have to put that in, into our presentation. So when we share our presentation, I will give you a link
to our lab repository so that you can reduce that.

– Perfect, and let me have another look. So there's a chapter on the text, so there's a follow up about
the same spread question. Okay, thanks, I think it would be great if we could use tools like SemSpect to select sets for collecting data and then move to, for
example, Jupiter notebook for the analysis. So it looks like this is
interesting to people in the chat.

– Yeah, as Martin mentioned, I think this is pretty easy, so you can explore your graph and then export that into a CSV file and just import that into
your Jupiter notebook. – Okay, so I'll ask if
anyone has another question, it's your chance to post it now, if there's nothing else, Alex, thank you for the presentation again, it's impressive most impressive data. And I think there are
interesting applications from constantly too complex, I like that. So thank you so much. And I think with that, we
can jump to our next speaker. – Thanks for having me. – Thank you so much. Let's jump to our next speaker. Just a second. There we go. So the next talk is from
Sixing Huang from MGI. Who's joining us from China today and he's going to talk about Neo4j for bacterial genomes. Sixing Huang the stage is yours. I'm really looking forward
to your presentation. – Okay, thank you Martin.

To speak in this conference. Hello, everyone, a nice
warm welcome from China. Today I'm going to talk about
Neo4j for bacterial genomes. And the talks overview is simple in contrast to the talks
from the previous day. This talk is about doing small projects. The data are from public database and whose size are typically
smaller than one GB. So all the analyzed can be
done on a local computer in bioinformatic
bacteria, genome analyzed. We typically look at three
different levels, gene level, genome level, and phenotype levels. Here I'm going to show you that Neo4j can cover all these three levels. So I'm going to show three examples. Neo4j for genome analyses, for carbohydrate-active enzymes with Kasey and for antibiotic resistance
in the CARD database. So first about me, I am a bioinformatician who studied and work in Bremen and Braunschweig. Currently I am employed in Shenzhen MGI. My contact with Neo4j was shot because my first project
with this software platform was in 2019, but it was a love at first sight.

Since then I use Neo4j for
my knowledge management as a genome browser, and also a backend database
for many of my projects. And I also passionately writing
about Neon4j in So first of genomes, they are different from
just a bag of words, because genes have structures. They are organized much like in the way of
the natural languages. So they have orders. Traditionally, we saved
this gene information in a relational database, which is a very hairy, and they are very difficult to understand, but you were saying a
graph database like Neo4j We can model this gene
structure very effectively, so many biologists may be
familiar with EMBL file which is used to save
the genome information. EMBL file is a big text file.

It is very difficult to read, to search and so on, with a small imposs script however, we can import all this
information into Neo4j database. And it's very easy to
visualize these genes in the Neo4j browser, as you can see on the right. So the red note represent the organismal and the green note represent the context of inside this organismal. And then each blue note represent one gene and we can save many
information in this note, for example, the genome size,
the annotation and so on. It is about, it's more dangerous, beautiful pictures because
with this data structures, we can do a lot of things. For example, gene cluster analyze say you have a gene in mind,
for example, GH16 from CaZy, which degrades many AGA polysaccharides.

And you want to see the
neighbors of this gene. So in Neon4j if you save the genome data in Neon4j it will be quite
easy with just one query. As you can see here, in the example, in this case, we not
only find the 11 genes we are interested in because Neon4j is going to
pull all the copies of genes, GH16 together, so we have
unexpected findings here. We can find some super gene clusters that was previously not expected.

Also it is quite easy to edit the genome information in Neon4j. Previously, we use a software like Artemis to edit the genomes, but with Neon4j head on
like Neon4j Commander, we can edit the gene
annotation also inside Neon4j with a graphical interface, which is very nice. So that was the gene level. What about genome, on a genome level we are no longer interested
in the order of the genes. We are more interested in
how many of each genes. So in this case, we have a genome matrix. So it is a very simple
numeric representation of the gene contents of each genome. Typically we are performing the so-called gene annotation analyses. For example, in this case, we have two genomes, the
blue one and the core one, because both of these genomes, they will share some genes and they also have some unique genes, together the genomes is
called a pan genomes.

We are going to look at
each part of these genomes so that we can know the
relationships between these genomes. So this is typically called
the auto-lock analyze. And in Neon4j we can do this kind of
analyze effectively. There is a unique challenge in Neo4j analyze in this case, because things in biologist, they are in orthology, for example, in taxonomy, we have phylum, class, order and genome. Also, we have orthology
in gene annotation. One of the most heavily used orthology it is the KEGG Gene annnotation, as you can see here in the right example. So metabolism, carbohydrate
metabolism and so on. As you can see here, we already can see the
graph structure here because we can see nodes.

We can see collections so we can represent all
this orthology together, where we effectively
in our genome analyses. So in this concrete example, I'm showing you that how
I'm going to represent a genome in Neo4j. So on the left side, in the red nodes, you can see, I can represent the taxonomy
with the Hess Taxol relations. And then on the right side, I can test the Hess-KO relations that summarize all the gene annotation of the particular genome. So in just one schema, I can have both the taxonomy
and the gene annotation in one place. And with all this
information in one place, we have great advantages. Let's say we want to do one small analyses about a bacterium called chromobacterium. ATCC 53434. There are in total six,
such chromobacterium in the database, but I'm only interested in
this particular ATCC string. And I want to see what genes
are unique in this genome. They are not shared by the other five. So in Neo4j we can use the cypher effectively to do this. As you can see in the screen, the first three lines is
about creating a gene filter.

So basically I just exclude our ATCC and then I will combine all
the genes in the other five. And then I will the result
using the with clause. And then the last three lines, I will list all the
genes in my ATCC strings, and then compare them against the filter, the results and the unique
genes in our ATCC strings. And then with this results, we can look at patterns and see, is there any interesting things? And in this case, yes, we can find that there are
many genes that are related to iron transport proteins. For example, I can find two
groups of siderophore proteins. And then the last one, it is a complex of a ferric
hydroxamate transport, and also one more gene, the ferric iron reductase protein FhuF. So as you can see with
a very simple query, we can discover something
new that is not mentioned in the literature. Also, we can do phylogeny normally in a bioinformatic analyses the phylogeny is normally
carried out based on the 16S or other pokey orthologies.

In our case, we can extend this analyses with Neo4j and then
also to do a very small phylogeny very quickly, just by looking at how
many share KOs are there's between different genomes, again, taking our ATCC
string as an example, with these three lines of cypher query, we can quickly calculate
the sheer numbers of Kos between the sixth strings.
And as you can see here, vaccinii has more than 1,800 share KOs so it should be a very close
related to our ATCC strings while the last one, the sp.
257-1 is a remote relatives. So we can compare this result to the 16S to see whether these two
phylogenys are congruent.

We can extend the same
process to add of domains too, for example, in carbohydrate equitations, the Cazy database collect a lot of carbohydrate related enzymes. So we can look at the
so-called CaZy norms, whether a particular bacteria can degrade certain kinds of polysaccharides. We can use Neo4j for this purpose too, in this example, I'll show you the annotation
of a bacterial genome of Formosa agariphila. This genome was published
more than 10 years ago, but since then we have
lots of new annotations.

So as you can see here in
the graph, the central node, represent the genome, while the orange node
represent CaZy annotation these genome processes. And just as the same procedure as the KEGG here, I calculate it, the
unique CaZy annotation of this genome. So only Formosa agariphila posses, not its six gnomes. And as you can see, we can also find three groups of genes that have particular
biological significance. The first group degrade ulvan, the second group, degrade pectin, and the last one degrade
sulfate polysaccharides. And they are all several
components of the green algae. So again, this new annotation confirm the algariphilic lifestyle of algariphila. It is more than that. We can extend this knowledge
to other polysaccharides, to not only green algae
but also cellulose. As you know, cellulose play a great roll in the biofuel and other industrial domains. So it would be of great
interest for us to predict which bacteria can degrade cellulose, however among all the
CaZy annotated genomes, only a small part.

Like you can see here in the
graph on the left-hand side, only a small part of the
genomes have been experimentally validated against cellulose degradation. So the blue dot while the
worst majority of the genomes, we have annotation, CaZy annotation but we do not know whether
they can degrade cellulose. So this project is to use
this more subset of genomes, to chain, a graph model, and then use that model
to predict the cellulose degradability of the whole our data. So in this case, I use Neo4j graph data science library. And first I embed the graph
information in five dimensions in the embedding step.

And then afterwards, I use
Neo4j cross validation procedure to train logistic regression model. As you can see the result here, it is the training score of 0.67, and the testing score of 0.65. At first glance, the scores were low, but however, I was optimistic because I have carried out
the whole our prediction tool. And I use this model to predict which genomes in a whole our datasets can degrade cellulose and
the prediction returned 11 positive cellulose degraders. And then I look at the
literature to find confirmations. Only five of the genomes have
been experimentally tested against cellulose, and three of them have been confirmed positively. The Streptomyces prasinus It was listed in back as
undetermined against cellulose So this is a wild CARD. However, the second example here, micromonospora sp.HM134 is a strange case because this genome contains all the CaZys of cellulose degrader and more, however, the experimenter set this genome this bacterium can not degrade cellulose. So this, it has a big
question mark for me too. So also in this project, I
have implemented the graph QL. So because not many colleagues in my group can use a Cypher. It is a new language.

However, most of them can use a graph QL because the syntax is really easy. As we store all the genome
information in Neo4j we need to also provide an
access for all the colleagues, the non-Neo4j users. So the graph QL provides
a very effective way to interface all these
users and because a graph QL do not over provide data. So it's safe bandwidth, and it is very efficient for data loading.

So finally, I'm going to
present one more example. It is about the antibiotic resistance. As you know, we, the human
society has overused antibiotics, and that leads to a widespread
antibiotic resistance is an emerging global disaster because it is predicted in 2050, more people will be queued
by the antibiotic resistance. And it's going to make a
lot of medical procedures, impossible, for example, operations, or some something as simple
as device replacement. So it is very urgent for us
to deal with this situation.

And the CARD database is dedicated to the antibiotic resistance, in this project, I downloaded the data from the CARD database
and then imported them into Neo4j the schema,
as you can see here, it is simple. So the Lilac nodes represent bacteria and the orange node we present mechanisms. And then the green node
represent the antibiotics. So the green node conveyed
antibiotic resistant against the antibiotics. So the bacterium process, the green nodes, then they can resist the
corresponding antibiotics.

Here you see two examples on the left is the fluoroquinolone resistance. And on the right is the
cephalosporin resistance. But even though they are widely resisted by many bacterial across
gram positive and negative, but the situation and the
landscapes are different, the fluoroquinolone resistance
is like a nine head hydro. So if you want to completely
eliminate this resistance, you need to cut off all
this resistance mechanisms to make that work. However, the situation is different for cephalosporin because it seems there
are only four mechanisms shared by many bacterium to evade the effect of cephalosporin. So in theory, we just need to plug all these four hosts to eliminate this resistance. So you can see with a
simple analyses like this. We can see yeah, how we should handle different antibiotic resistance in practice. So after this free shot examples, I hope I can give you the impression that Neo4j can serve as an
all in one genome browser as a bio data warehouse.

And also as a data mining tool, in comparison to relational
database and SQL, it can deliver insights more quickly. In my opinion, also use machine learning from the graph data science library, and it can predict new connections. In our case, for example, it can make new annotations
or the missing annotations, and you can print it new properties. For example, in our CaZy example, it will be the phenotypes. And also we can set up
graph QL as an interface for non-Neo4j users.

So it is a database for
effectively everybody. So with this slide, I want to conclude my talk and I want to thank the following people my tutor Hanno Teeling MPI Bremen and the Neo4j Community, because they have welcomed me warmly and provided me with ample
support in my journey. And I also want to thank
my current employer in MGI because they have
supported me in my research, in the graph theory
and the graph database. And I thank you all for your attentions. – Fantastic, thank you
so much for presentation. I think, I think you're so some really, really interesting concepts.

I will jump over the Q and
A section and to the chats to see if we have questions. And while I do that, I would
like to ask the first one. So what do you think your data model and your let's say analysis concept and how you model annotations. Can you see any limitations
in using the same models or transporting the same
models to a higher organisms, a human and mouse. – At the moment I see there are several first because these
annotation is really rich in relation rich data. So if we address doing
single genome analyses currently, I cannot think
about many scenarios for this on rotate, this
graph database model to use it because currently you can see in my cases, many of my examples are about comparisons.

So comparative genomics, that
is what I want to emphasize. So if on the higher
scale all coyote genomes, I think you can compare different genomes, but at the moment it is
still on a gene level. So I don't know if you
can do it on a SNP basis. So if you have single
nucleotide polymorphism, can you still use it? But maybe it is a very
interesting area I can explore. So can we go deeper than the gene level as I show in these slides? So can we do SNPs mutations and yeah.

And maybe also on the higher up level, not only phenotypes,
but population genomics and so on. So yeah, can we show the overlaps between different populations? – Yeah, I think that will
be really interesting. Let's go to the first
question from the chat. And this is the question that we had a couple of times yesterday, and it seems to be interesting
for a lot of people. So do you have a new data frequently and how did your handle things. – Right now, all this three projects, they download the data
from the public database for genome analyzed from CAC. The CaZy is from CaZye and
the antibiotic resistance is from CARD. So my update schedule
is completely dependent on these free databases, as long as the database have an update. And the curators have, yeah, the need, we can update it, currently there is no,
let's say automated way to just stream, feed the data and yeah, it is like, it is a snapshot of these databases.

However, I can see the
potential that we can add our own data. For example, if we have new
genomes from our sequence, from our sequencing, we can add our genomes
in all these examples so we can combine the
public and the private data so we can do, yeah, we can do crystallize some new knowledge from our private genomes, yeah. With all these public knowledge. So you can compare your new genome with the known six genomes, like in the first example. So you can have all the
benefits of the public data too.

So, but yes, currently there is no like
livestreaming and kindness. And because these databases, they do not kind of provide
me like a feed or something, but they provide APIs like
KEGG, they provide APIs. So we can regularly update
the data with their APIs. – That's, I find those
questions very interesting because we had it after I
think every talk yesterday and not from the same person, always when we come someone else, because I think in terms of data model and data processing updating it's a very interesting challenge. But at the same time, we know that the knowledge
of their changes, it's not static. We know that we have, there's a bit of a
follow-up to that question with the updates, how do we handle new data in terms of machine learning results? Do you have to do the embeddings
and other analysis again.

– At the moment my current understanding of this graphs data science library, yes. This model needs to be trained again because you can currently, I do not find a way to
increment on the model so I can not maybe do
something like a batch thing, but I think as long as
the graph data science library can provide us
with this capabilities, we can do this. But also I want to mention at the moment that the data for all
these three projects, they are small. So the training, it
will be lightening fast. It is not the issue of the chaining. I think it's more than the data cleaning and the concept thing that
takes the most of the time. So the training and the
prediction will be just like two clicks away, but the model, yes, it is static at the moment. – Okay, so the model is not updated, but the performance of the
GPS, at least on your dataset is good enough that you
can just leave around it.

– Yes. – Okay, let's maybe talk
a second about the data. So is your database publicly available? Did you see other database? – Yes, they are all publicly available. I also have written
all these free projects in my articles with all GitHub repositories, with all the code and data. So, and everybody is welcomed to interact and to join me on this journey. So, yeah, I also want some inputs from the broad communities. – Yeah, that's fantastic. So maybe you can join, if you can share out the
medium articles, blog posts, because I'll post them
in the big market chat so that everyone can have a look at the articles editor description, because I think it's a very comprehensive, also the a lot of figures and so on. Obviously the court is abandonment, so that's really, really cool.

Let me have a look at the chat I think there are Q and A sections. If we have something else here. Maybe one last one last question. I've found that very
interesting that you said that you use graph QL
to give everyone access because it's easy to use. So we had a couple of mentions
of graph QL yesterday, but I think they are
still a lot of Neo4j users who don't use graph QL and API. You maybe commented that
how difficult was it for you to set up properly graph QL API and how easy is it really for infusers for non IT let's say
non graph database users to use graph QL. – Actually the setup was quite easy. There was a XLM post in the Neo4j, for them
and I followed that post. And then I also have
written my whole experience in the same article in the So the setup of graph QL
was relatively smooth. Everything from there. I think it depends on
how we model the data in the graph QL, because, so in my case, I have, for example, model the whole data as I show in this graph.

So my colleagues can pull the taxonomy and the annotations from the graph QL, as they can see in here. So in the mental model. So as long as they understand this graph, that this graphic, they can pull data relatively
easy with my graph QL and the usability of
graph QL is quite nice because it can be assessed
with different languages. Not only yeah, this interface, but also they can pull
the data from there R because many of my biologist colleagues, they use R and Python. So yeah, they don't need to stick around with this local host 4,000. They can directly pull the
data without knowing cipher and to do their own analysis. And for those graph folks, they can, yeah, they can
keep doing their cipher and doing amazing stuff. However, for the non-Neo4j user, they can still have their
access to the same data.

So I don't need to keep two
different annotation databases. And there are even talks in the group that they want to change
all their data storage into a graph database. And of course, if we proposed that many
people were post-test. However, if you say we have a graph QL, and then people okay
that we can do with that, because they know how to
pull data from graph QL. So I think it is like a nice facet for many of our users who still have not learned early Neo4j but for those Neo4j lovers yeah.

They can use their Cypher as usual. So I think they can live
both nicely together and we provide them both so that, yeah, you can just keep one database, but all the users can use it. – Yeah, thank you so much. I think your blog posts ultimately graph will be a very interesting
results for the community. So thank you so much
for your presentation. Thank you so much for being here. That would be great if you
would share the links again. – Yes. – And so thank you so much. And I think that we can move
over to our next speaker. The next talk, will be about to modeling
Multi Omics data in Neo4j to identify targets
for strain development, Peeyush Sahu bioinformaticists
and data scientist experience will presenters his project.

So the stage is yours looking forward to your presentation. I think you are muted,
so we can't hear you. – Okay, can you hear me now? – Yes. – Okay, good, thank you, Martin. First of all, for the invitation for
this talk and really, to be able to attend this workshop where people are working,
you know, with Neo4j and the graph database in
a very different dimensions and also actually in a
quite overlapping manner. So today let me share my screen. Just let me know if
you can see the screen. – I can see the screen. – And a proper slideshow
means in the full screen. – We don't see that yet. – Let me try to correct that. Oh yeah, that makes sense. – Okay, now we can see the
sideshow in full screen. – Okay, that sounds good. Okay, so today, what
I'm going to talk about, but actually our preceding presenters have already touched upon and
actually explained in detail their use cases. So in Clariant, just a
little bit about Clariant. Clariant especially a
specialty Chemical Company. However, the biotechnology
unit of Clariant is working towards
bioethanol production plants.

And so we are setting up the process to convert cellulose
hemicellulose material, which is also referred as
2G materials to bioethanol. And in this whole process,
there are strain development, which has a very big focus of the research and in that area, because you need to identify what is good and what could be, what
could make a strain better with respect to certain
production of enzymes and other things. So we came up with this, we always come up with the
questions from the scientists that how to make this process easier, identifying something which
is good for the process.

Okay, so this picture is
quite actually nice for us. And when we think about how
data actually comes together or how data can make sense, it is very, very interesting because when you look at, when you look at this
picture shows you that, that how perceived observation
could be misleading in that sense of a complete overview. And this is very true. I think with respect to virtual mix are actually a lot biological data because without having all
of its part observation, derived from a single technology in either proteomics,
transcriptomics, metabolomics, any could be misleading. So our aim was to develop a platform where we could utilize, or actually put all the rates together and then ask questions, which can give us meaningful results.

So overall aim is to drive the discovery. So the scientists can really identify what is the reason for a certain phenotype if we are looking at something, okay, so what we have here in Clariant, and what we are working on is, we call it the Multi-Omics platform. So it started with combining
the multi Multi-Omics together. However, it has grown further now, and we are trying, we are incorporating not only Omics data, but also phenotypic data, and other data types, but to go with the name. So, this Multi-Omics platform can be divided into
two parts in that case. So the first part is for data storage in a knowledge graph.

And then the second part is
visualization and analytics because in the end, once the data is in, you need to find a clever way to visualize and make sense of it. So coming to the first part, so the data storage. So we had certain requirements when we were looking into
how data could be stored, because we are not talking
about only storing data for machine learning, but it
is a data storage platform. So the requirements for it. So we wanted to store
metadata for each experiment and all the samples in this metadata, it could be processed
data from five mentors. It could be at different time points. It could be any sort of comments which are important
for scientists to infer the results or make some conclusion. At the later time point. Then we have, of course, all the Omics, different Omics technology
and phenotypical data, which need to be
integrated in the platform. We also wanted to have
cross-data analysis, so where genomics
transcriptomics proteomics and all the data would be
kind of extracted together and could talk to each other. Another very important
aspect for us was to store multi-organism in single database and also actually multiple
genomes in one database.

So when I say multiple genome
means multiple versions because we work with many organisms, which are not that well annotated. So we do evolve the
organism genome annotations, and then the model has
to somehow make sure that all the data is there and we can access it if we want to. Of course, another very big
part of Knowledge Graphs is to interest this whole data with available knowledge,
expert, annotations. And as much as information
you can find about all makes in terms of
gene protein transcripts. And of course, so like I said, so in the end, of course, everything has to be connected
in a biological manner. So that makes sense and
a possible structure of the whole data.

So these are of course the requirements but the Neo4j I just wanted
to show you a very small, very abstract view, how we want to, how we think that it should be done. So you have metadata, which is connected to capture data, which is connected to biological entities and the innovation, and now going a little bit into detail. So what I will do, so I want to share with
you that how we think our data model works at this point, okay. It took quite a lot of revisions. I think we are all at
eight or nine divisions of the data model, where we have to change quite big parts of the data modeling. So in this slide, we have two parts or we
are talking about two parts of the data model.

First here we have the
metadata part of the data model and the capture data
part of the data model. So like I said, the metadata is a very
important part for us because we are asked to process data and we utilize those data. And just to say, so it is evolving still. So what we have here, so we are storing data
from the experiment site. So there are multiple experiments in a lab and these experiments
have metadata about them. What are the conditions and other data, which could be attached to it. And of course this experiment
could be changed itself if it is derived from
different experiments. And then of course, experiments results into samples and the samples could be
transcriptomic samples, genomic samples, proteomics samples, phenotypic related samples as your data. What we wanted to do, we also wanted to capture if
we want to group these samples into some meaningful manner, if there are some conditions
that we want to use to group them together. So we used a known chain event. So we used a sample group, we call it a sample group here.

So you can easily put
them multiple samples from different experiments together. So they can be easily extractable if need comes in the future. Now coming to the capture
data part actually, so this is where our
experimental data lives. Of course, so the first two
nodes measurement and analysis, these are nodes, which store most about a little
bit more about the metadata of the capture data, something like what type
of technology it was in terms of what measurement it is from. So something like, yeah, you have variant data. So we are talking about whether there was a single
nucleotide polymorphism or insertion deletion, or even structural
variance could be modeled. And then with the similar data model, one can also be store
transcriptomics estimate, and we can further extend this to differential estimate,
differential expression, because from the transcript
Atomics estimate, you derive differential expressions of what is very good and what we think it's quite clever here, that you can combine actually
different analysis node to create a new analysis, which can result into a different results or different data point.

And similarly, we have other data types such as phenotypic data can be stored. So what is very, very interesting for us in this point that the overall data model is same for all the different data types. So when you are creating it, you can utilize the same
part with certain parameters and you can extract actually
all the different data in single query. So this is about the capture data. Now coming to the later
part a very interesting, and it means we have heard a lot about how the knowledge is stored. So we are coming to a more
into the genetic background and how the biological entities and adaptations are connected.

So I will go ahead because
this is a complex topic. So it needs at least
some it's one slide here. So what we want to,
like I explained earlier that we want to actually
tour different genomes, different genome versions, multiple organisms, different strains. And also actually, if a
strain has a vector inside it, we also want to store that information. So if you are looking at, if you look here, so we start our model from organism where organism has multiple
strains connected to it. And this organism also has
a multiple reference genomes this referencing genomes has
genes transcripts protein. And of course, so a gene could have multiple transcripts or multiple proteins depending on the different splicing events or ultimate enzymes selections. But what else we have
is strain is connected to a different part of data model, which actually defines vectors. So where you can store
data about if there are some external DNA, which is introduced in your strain, this is a very powerful thing because, with this we get the flexibility to actually look into strains or even the non-native part of the genome and coming to the second part, of course.

So the annotation, so of
our genes and proteins, they have to be connected
to what is known about them. And for this, we are going with annotation
part of our data model, where we create additional
nodes for different annotations. And so we have different sources here. You can get annotations from unipode NCBI. You can also get easy numbers
about enzyme function. We have go annotations there and a good part, very good thing is that what
we thought that annotations can actually conduct
themselves and to the entities and why to connect to themselves. Because if we want to store
actually individual IDs to individual nodes, if two different IDs come
from the same database, they are together can also mark
that what kind of actually, what do they actually associate with the gene transcript approach? So this gives us a more flexible when we are getting this data out, what do we want to look
at to give scientists and data scientists a way
to filter their annotations for looking for functional.

Okay, so this was about the data model. So we, so after the data is in Neo4j the next step is of course
the extraction analysis and the visualization of this data, for this, what we did. So we use Python and Jupiter
to create a user interface, which allows not only data
scientists to query the data and to make sense of the data, but also to the lab scientists, because in the end, we want our scientists to be empowered. So they can look into
a huge amount of data or the data, which they had
produced five years ago. And just with certain
clicks, a couple of clicks, they can extract these data and compare with all the different processes, which they are currently performing.

So how do we do it? This is just this common
schematic representation of how different parts are connected. So of course, like I
already, we already saw, so the Omics data is
already in the database. Biological relations are also present already in Neo4j database, but in house, of course, we have limps electronic
lab notebook system where our all experimented
or the stream and inventory the site, and actually there
is a lot of important data, which is required from this database, which could be imported to Neo4j or without to our Multi-Omics platform to make more sense of the overall data.

So in this process, like
the schematic shows us. So we combine Neo4j together
with Python and Jupiter to create these user interfaces, which can be used to query
and actually analyze data due to data sensitivity I am not able to show you live view of how we play
around with the data, but I would like to at
least give you a feeling how the notebook and Python
really empowered data extraction and data visualization. So this is one screenshot
of the Jupiter notebook where scientists can actually select what organisms they want to look for, what reference they want to work with.

And then they are taken to another step where they can select different of multiple projects together and different analysis together. Then they can select what
different properties they want. So this particular example is for variant, where they visualize
what different mutations or different bearings are present
between different strains, between different projects and what different
information we can gather based on annotation for them. And of course, some other things where they can filter out their initial observations
with certain technical, something like geneline
mutation, so intervening, or they can select structural
variants and things like that. So this is on the variant side, we similar have something
for the transcript side where they can combine
variant data together with the transcript data. So there they can select which
project they want to combine the variant data with and a different analysis they can select. And they can also select
certain cluster from variant, which I will show, what do I mean with it in the later slide. So as I cannot show you
because the data sensitivity the whole process, but what I would like to give you a small based on what it could do, because paving our whole data on Neo4j we are talking about the
whole Multi-Omics of the data.

So we can extract them together
in a very, very easy manner. So, but then a couple of clicks, we can get all the data out
and it can be visualized and data processing is performed on it. So this is a heat map, which actually has multiple
different data types together. So in the first row, you have mutation information
about certain genes. You have differential expression
between different analyses. You have even an RNA expression. If you want to further look into how the RNA expression was, and this hit map was already clustered. So it could give you some patterns, with an unsupervised clustering manner. So this is actually an
interactive heat map. So of course we cannot see
here, but to go over it, they can, there will be a pop-up window, which shows what gene it is, what kind of mutation that was, or in this case, how the
differential expression is there.

There are also information
about the function of the gene. So if there's some test just in case are interested in some cluster, they can further Zoom in to
look at what different genes are mutated, what kind
of effects they can have on the function. So this is one way of looking at data. Then further we also want, we have a connection
to the phenotypic data and other part of the
Omics data proteomics data.

So in this figure, what we have in case fermentation was run, which resulted into certain measurements for certain substrates, like here can be shown on the top. So what you can see
there is some abundance for certain substrate
within different samples. Each column is one sample, and then we can right away on the fly, compare it together with
the proteomics data, which is produced for
the same fermentations. And then we can see it easily. And of course we have our user interface and scientists can select
different enzyme classes if they are in trouble or
they want to investigate what kind of enzymes could have differential effects
in different processes.

And if they can find different patterns of protein expression, which could correlate to what
we observe in the phenotype. So we have another
pipeline which runs on top, which tries to identify correlation between phenotype and enzyme classes or different enzymes and proteins. So this all comes together in the end, to give scientists a way
to generate hypothesis, to identify important features
in strain development, and many more actually.

So, but this, I would like to just go ahead and give what we think. So we are currently, I would say at, at the early point of our
Multi-Omics platform development, however, we think we
have achieved quite a lot with respect to connecting
all the data together, extracting data in a meaningful manner, which could actually result into hypothesis generation
and identification of specific features, but what are our future plans? So we would further like
to integrate of course more information about annotations, like metabolic pathway, regulatory networks, expert annotations, maybe information from publications, because a lot of organisms we work with are not well annotated. So publications are a very
important information source probably be information from
the pattern and other things. We are also looking for a way to automated hypothesis generation, where scientists currently, they have to look into data
and try to identify patterns, but we are working towards something where a rule-based system can be in place, which can already give some hypothesis based on the data they have selected.

And of course, machine learning and graph learning is one of the future goals for us, where we would like to use this data, this highly connected data to further advance our strain optimization and other discoveries. So with this, I would like
to acknowledge our team. Frank Wallrapp who's our Head
of Group Completion Biology. We worked together with Martin and Jamie who are consulting us on
Neo4j and other technologies. We have constant inflow of interns. One of our intern who really helped us with development of Neo4j, he has left us.

And after that, we have to have different many interns now who are currently, they have
booked and will be working on these projects to
make it, take it further. With this I would like to thank you all for your attention and I'm open for questions, thank you. – Fantastic, thank you. Thank you so much for
the great presentations. I think it's just cool
that they'd been out kind of with lots of talks, kind of closed gnosis. So we now really covered
everything from a patient type data that fixed tech to variant
annotation and microorganisms. As I think this is really cool, before we jump to the questions, I would like to announce that
in our closing session today at 1:30 CT, we will have a small panel
discussion with speakers from yesterday and from today, where we want to focus on the idea of sharing
knowledge and resources again. So just exchanging your best ideas I think that's going
to be very interesting because I think you'll, you might get omics modeling
is obviously very interesting. So let me jump to the chat. In the Q and A and see what we get here.

Why did they do that? I would like to reuse
at the update question that we had up there, every talk. So how do you handle the data updates and changing data sources? – That is I think the
question which is worth looking into always because a data update
is always a big issue when resources that you are
using are changing constantly. But I think it is kind of saying for all the presenters and I think from before for us as well, that you have to always track what different changes are happening. And of course, then you have to, update, either the data model. Because if that is something
which needs to be changed, the whole data model has to be updated. But in most of the cases, it is only the data which
needs to be updated. However, when we are talking
about the update itself for the data, I think that is also something we are currently working towards where I'm putting data into the data already create a data model, should be automated, where we have an API or a user interface where you can easily throw in the data.

And it is sorted according
to where it should be stored and updated and of course, because when you are updating data, there's one question which we try to already tackle. So there has to be some sort
of sense consistency test, which we do perform on our
data after the uploading. So the data is checked totally that there are no wrong
connections present or wrong data has input to nodes, but this is something we
are focusing on quite a lot, because we want to create, or the maintain data accuracy on that. – Yeah, I think it's
always a problem topic. And again, this question came up a lot. Let me again, have a look at the chat.

So just as a side note, I posted the blog posts
from the previous talk to the big market. Another question is maybe, and this is obviously also
something that came up a lot. How do you, how do you
take care of data quality? How do you make sure that
certain relationships in the database when you collect so many different datasets are valid and true? – Yes, so in the, so like, so data quality is important because the whole Neo4j
idea is connecting data so wrong connections could
lead to wrong conclusions. So for that, in-house, we have developed a
package which expressly takes care of it. So because our data model is
defined quite nicely to find, I think, so we do not
have at least up to now loosely connected parts. So we can really make sure that the data is connected in a proper
manner and in an automated way. So if the credit is
happening, because in the end, we are giving those data scientists who will extract the data
and they will take it on the road from us that
this is the correct data they are extracting.

So the data quality is of
course of utmost priority here. And that is something we take care of, I think quite well with the automated strips
and scripts and testing data regressively after each update. – Okay, so thank you so
much for your presentation on carrying for data models
on your own approach. And again, I'm really, really
happy that you will accompany us in one and a half hours
after three more talks, our closing session and our discussion on how we can use all this and we use and share and collaborate. So thank you so much. And that I would like to hand
over to our next speaker, Davide Mottin from Aarhus University. We'll talk about Unveiling the knowledge
in knowledge graphs, and this is going to be a
bit more science focused and data analysis and algorithm focused. So Davide stages yours, really looking forward to your talk. – Thank you, Martin. Thank you for organizing
and for inviting me here. It's really nice. So let me share my screen, okay.

Think it should be visual, right? – We can see your screen. – Perfect, okay, so thanks
again for the invitation. So this talk would be
on the algorithmic side. So some of the challenges
that many of the companies might face and I'm very happy
also to take the conversation later on also by email. So my name is Davide Mottin, and I'm an Assistant Professor at the Computer Science
Department at Aarhus University. And today I would like
really to talk something, once you have the data, what can you do? Or how can you really unveil
the most powerful information behind your knowledge graph? At least one of the algorithms
or a couple of algorithms are easily implementable in Neo4j even though we didn't do, because we basically
play with some prototypes very briefly about me. So I was a post-doc at HBI in Berlin, where I met Martin and
I met Michael as well. And I got my PhD from Italy
and I worked with graphs for a long, long time
under different aspects.

So for me, graphs is a very,
very generic obstruction. So especially knowledge graphs, it can represent many
different things like social networks, biological networks, or some sort of co-citation networks. So there are huge wealth of information as they can be structured
in a hierarchical manner, they have multiple different types. So you can embed the different
files or different formats. And they're very, very expressive. And this is what actually interested me from the beginning because they, they can
contain a lot of information which are hidden, which are not really accessible
by inquiry, for instance. So that's why I focus most of my time in working with graph mining, which is a very broad area, which combines together
multiple different disciplines, data mining, of course, to find the some patterns
or some regularities databases to retrieve the information and to store the
information in a proper way and machine learning to start predicting. So what happens if you
read a lot of these data and you can predict the
existence of something that they haven't seen before. I want to point out basically, but very, very briefly the, my era goes in many different directions and I have been investigating
multiple different problems in graphs, such as alignment.

So if you have two networks, how can I find, for instance, that two entities are the same, similarities I think
it's a very important, and I will touch upon
briefly in this talk, but not so much into the details. I worked a lot with exploratory
analysis, which really, I think it's a powerful tool
that we should explore more because most often scientists
and the domain experts really need to find something in the data. And what is that? Something is not well-defined
in many of these, in many of the disciplines
in many of the sciences. It's not very clear. We are trying to find
something interesting, but what is this? It has it's up to definition.

So, and of course also it was mentioned that the previous Docker
cleaning is a process, which is very important. So I'm trying to make
everything automatic, scalable to large amount of data. And in terms of similarities, I have done a few works in this area, but they're mostly, mostly
somehow to figure out in an automatic manner, whether two entities can be similar. And this is a very broad question, a very generic questions. In some cases, you really have to
integrate the main expertise or so something about
biology, something that more, but there's some other cases, it is possible to extract
quite a lot of information without knowing anything.

And this is what we did. So knowing really nothing
and not even having entities as in the proper way, but just having the graph structure, but okay, for today, I want to talk in somehow
a very high level. So imagine that you have your own data because I couldn't really fit
two specific types of data. So I just start with something
that everyone can know. These are smart devices that
currently can answer question, like, what is the weather today? And the, of course it
usually answers, okay.

The weather is like, they can be also personalized of course, another thing is where is, company located for instance, and this is yet another question that can be answered quite
faster by these devices. There are other sort of questions, there are sort of questions
that are more exploratory, which is, can you recommend me a city like Copenhagen in Europe, but this is very, very
similar to the previous one, but it's not because here the
user or any scientist here is asking something that, is there something similar
about from these entity from Copenhagen entity in Europe. And unfortunately, currently
the answer is satisfactory. So the most of these
devices will not return to the right answer. They
will just tell you, okay, I found something in the internet queering in that way, but even, even the search
engines cannot answer this. So ideally in some foreseeable future, we would like a proper recommendation, but it's not an easy question here.

We would like to have devices
that answer like that. So Venice, Marseille, Hamburger, which are recommendations
of possible cities, or even better, that it
starts a dialogue with us. So if we are scientists
we're the main expert, we would like to support the law, okay. Is perhaps Venice, something
that you're looking for. And I believe that Venice isn't a right answer because
of this and this resource. So somehow explanations the idea that how this queries works. So how this simple queries works well, besides the fact of understanding
the nature of languages, understanding what is written here behind there is a knowledge graph. So these devices are
operate on normal laws, graph, aware nodes are entities. And these entities represent for instance, Copenhagen or Denmark
have some relationships with these entities. And these can be easily done with the Neo4j with Cypher query, right? So you can just ask this and
you will get the right answer. So once you have done the NLP work, then it's just a matter of
having your database behind then the answer will come.

But when it comes to these question, so to recommend some other
cities like Copenhagen, well, the things becomes
a bit more complicated because the ceipher query
can still find a structure, but the answer is not there, there is nothing that this like
is not really materialized. It's not something that exists. Well, I'm not stressing here. How many knowledge graphs we have. You probably all are dealing with this and they're getting bigger
and bigger and more complex. These are very generic
knowledge graphs that are like, YaGo, Freebase, DBpedia, the
other publicly available, and we can download it.

That's why in research they
are actually quite well used. So my personal vision, and
this is what I'm pushing a lot. That's actually working with
the companies and academia, like doing fundamental research, but also focusing on the
application scenarios that we should have a
whole stack of technologies starting from interactive algorithm. So algorithm that really
subjective something. And you can get trust on algorithm.

You should also have some similarities among the entities that are explainable. And what we touched upon before
also in the previous talk is the adaptive, right? So if I have a structure below like a Neo4j database, how can it change over time? Because changes will happen. Plus it should also be personalized. It should be change according
to the user interest in some part of the data. And this is what I will talk about today. So I will talk about
certain personalized part. So we realized that
something very simple, okay. The users typically ask questions, right? As I started from the
beginning is Norway need you.

Then you basically can imagine
that you are highlighting some part of your graph, where is Venice located? well Venice is located
for instance, in Italy. But you can also imagine that starting from this specific node, you can be actually
interested in capital all of a member of us on so forth. So other connections, but the question, the
fundamental question is can we find the summary, which is a subset of all
the possible connections, all the possible entities
and relationships such that we can serve queries, not only now, but also in the future. So can we model somehow
what is the user preference? That was a very interesting question that we asked ourselves and we started by asking, okay, what is even a summary? So how can we define a summary? I don't want to go in to details, but you can imagine that you
have your own knowledge graph, any knowledge you have a
set of Cypher like queries or so a set of queries asking
for specific relationships, arbitrarily complex, we don't know you have a sort of a Cypher engine.

So you have Neo4j disturbing the query, turning to the answers. And at some point you put the cost, right? You say, okay, well, if I want, I find a doctor and I don't have any
connection to the server or the server is extremely slow. Can I have a summary of that in my mobile? And so the limit is K, which
is the number of edits, the number of entity,
a relationship entity. So it can be formalized as saying, okay, I would like the summary that
the approximate my querries is the best way possible. And the summary that in
the future will answer, oh, sorry, this is not an easy question. So how do I know about the future? So somehow it becomes, as I learned from
something I've seen before. So, and this is basically, if you want in a more complicated manner, how do I define this function? How, what do I say? It is actually a good summary? And the problem is that I
don't know this probability. I don't know whether in node
will serve a query or not.

Is that specific entity should be part of the summary or not. That is my main question now, and what we did is actually
you can use something that is also in Neo4j so
personalized page rank, personalized page rank works. Okay it'd be, the deep
is really important, works in the following, right? You start from the node
from any node in any graph and with the equal probability, you choose the next node. And you keep repeating this process. It's the number of times previous, but recording also the fact that we, with some
probability, you dumped back. So you can imagine that
this is wandering around and then coming back and
coming back in the end, you will count how many times
you hit the certain nodes. And this will give you an indication of how important a certain node is. According to the structure, According to the structure of your graph.

Of course, there is a very beautiful mathematical expression, but it's basically it can be done in any graph analysis tool. This is one of the basic stuff. And therefore we can
estimate the probability that the certain entity
V belongs or conserve the queries can actually
answer the queries. And this is given by two factors. So historically what we have seen before, what queries, what the
searches we've seen before, and the fact that we
explored the structure. And this is something that
sometimes we forget about that below when we
construct our knowledge, we have a lot of different connections. So why not exploiting these connections and the, so you sum all together, this is just a parameter
that set to basically, but the idea is that you can
freely walk over the graph and by freely walk in a very free manner, really a randomic manner, you end up having something relevant. Well, what we have done is we
will lock the real query log. So we didn't have sequence
of queries from users. So we generated, but we generate in a way
that they were simulating what the user could do and user could actually stay in the same topic.

So I'm searching for
patients with diabetes and I stayed with her,
or I kept changing you. I searched with a patient with diabetes, patients that are, that have cancer and so on and so forth.
So I keep changing. And the more you vary,
these two parameters, the harder it becomes, the problem. This is also motivated by
the fact that we have seen in other query logs, in knowledge graphs, that was happening actually. So we have basically extracted
this sort of information. So we ask ourselves, is this working, well, it turns out that it does. And it does by only. So if I take 10% of the
graph of our knowledge graph, we can predict all the
queries that we had. So nearly perfectly. So this is basically a hundred percent. And if you take 1%, so
it's, if we take 1%, we have 80% of the time is correct. So it means that out of five
answers four are correct, and it's also quite efficient. So it's actually very efficient. Okay, I will not stress this to here. Is that the experiments that we did before with few topics and with many topics, or whether, whether you're interested in patients with the same
disease or more or less, or the same groups of proteins, or you keep changing over time and you see that this is a
harder scenario, of course, and the performance goes a bit lower.

So the beautiful thing of this method is very simple. It can be easily implemented. It can be, because
actually this is a summary are visual graphs. So once you, and you can load them, and then you created the virtual graph instead of creating the big, the big graph and can be
used with any type of quiz, we did import it in Neo4j but it is possible to do it. So all the methods that we
have used are already there, you just need to put them together and we have the code that is available and can be actually used by anyone. I'm not sure I will be able to talk about also the second problems, but we are also working on
something more specific, which is Entity Summarization.
So if I talk about one, if I'm interested in a protein, or if I'm interested in a specific patient or in this case, I'm interested in the amount of information that we have is a normal, even for a single entity, we need a summary of the single entity.

So if you see, for instance, in Denmark, you have a huge tables. Are you just looking at Wikipedia. And all of these are basically attributes or connection to other entities. For instance, it is in Europe, it is in the European Union. This is a problem known
as Entity Summarization. And it will be actually very useful for many different sectors. We are working currently with doctors and this is a practical scenario. We're heavy, but also biologist
could also be interested and the journalists, like for instance, to find that quick
information about the news or a smaller account of
a protein information. So I don't have a specific results, but we are getting some
preliminary results now that are encouraging. And we are devising a
more powerful method. And those show that works, if you haven't seen the entity at all. Okay, so I have only a couple of minutes. I want to just swipe very
quickly about another problem that we solved, but I mentioned similarity search , similarity search are there other cities, like Copenhagen typically
nowadays you fix an an algorithm assuming that we know
what these algorithm are, and these algorithm arethe
score for each of the entities related to Copenhagen.

And for instance, you can rank the cities and says, okay, I returned Bristol auto symbol. There is another method, but this is, of course
doesn't involve the user. So the user doesn't know why the method, the return, this are answers. There's another method which
is called Active Searches. So the algorithm selects an
extra item and ask the user, is these relevant? Do you think that this is relevant and slowly, it learns, very slowly.

So what we have done and what we have asked
ourselves is can we do better? So if we have the static version where everything is returned at once, and then we have a very dynamic version, but it is returned in a
very, very slow manner. Imagine you as a scientist, the looking for some specific protein, and then it takes hundreds
of questions to be answered. This is not possible.

So what we have done in a previous work in a very recent one it's
to combine these two, okay. So if the user tells me it
is interested in Copenhagen, well, I know that there
are other cities around. So I use the similarity
that I already know, and they use it as a
second feedback somehow, I have the user telling me yes or no. And I also have a second opinion,
which is of lower quality, which is a similarity that I already know, but that allows me to
understand a bit better. And instead of looking around everywhere, I look to what this is suggesting, what is the similarity of such I think, and this allows me to remove a lot, a lot, a lot of candidates, technically speaking is basically blending two different information. And this is done in an elegant manner with some probabilistic
approach, basically what we could see and here
I'm basically concluding.

What we could see is that it was possible to really dramatically reduce
the number of introduction you see here that this is the budget, which
is the number of questions. And the methods required to
go up to a hundred questions, but we were going down to 20. So it was quite an
interesting improvement. Again, call this available. This is a bit harder though, to be implemented in your
four day, or if there is, I don't have the entire idea,
but they can be for instance, implemented in Java.

And then use Neo4j as a back up. So as I said, we are, I'm working at a Morgan with
other very interesting people are Catania and the industry like, we are aiming to improve
how people searching these knowledge graphs, how people explore data
in these knowledge graphs. It's not as crowded and we are working. We are going further and further with this vision and
hopefully in a few years, we will have a fully
working prototype for now. There are pieces of code
that have to be put together, okay this concludes my talk. And I'm very happy now to
take any question from you. – Fantastic, thank you so much. Thank you so much for your presentation, really interesting approaches.

And I think especially the last one, collecting user input for
searches or kind of aggregating user input for searches and all that. I think this is relevant
because the more users you have, the more, I don't know, the brains you can eventually tip into and get expert knowledge out of it. You, since this whole workshop it's about collaboration
and networking and so on and you mentioned that
next to the basic research, if you have active collaborations already, do we have anything in the life sciences or healthcare space? – Yes, so one of the products
is exactly with hospitals here in Denmark, and they're interested in having, so they have they're constructing. We are constructing
actually knowledge graph with patient records and the disease and what we know, currently, we have a bit of problems with the privacy here, so that there are a lot of privacy issues, but we are trying to do
our best to in what we know and what they could
provide us any information.

And the purpose there is to find so to help the doctors that
are typically going around and clicking around in
this finding journals and the hospitalizations, finding opinions on disease
and so on and so forth to put all of that together in a very compact manner
so that they could actually go around with their mobile phones and check about information and another work. It's actually on biological
data or gene therapy. And there they're, trying
to reduce the number of experiments in order
to construct a certain, the cure of all the diseases.

So, and there also have a knowledge graph, which is very similar to
many that is workshop. So connecting proteins,
gene, and so on and so forth. And the other information is about these specific experiment. So actually, yeah, so the applicability
of this is very broad. So it can go from general
to life science in general, and I'm very happy
actually, to collaborate. So slowly building up some
knowledge about biology that I didn't have before. – So you have active collaborations
in this area already. – I'm very happy of course, to take us there kind of limited also because one, the hospital
project that was put inside as a collaboration later on, but it is really, really
broad that the question is really big. – Yesterday we had to talk
from Northwestern University about mapping OMOP data from
the OMOP accommodation model to Neo4j that's also interesting. So there was one question
about your methods, for such a big knowledge graph. I think you have to pre-K for all the similarity
coefficient, am I right? And if that's right how do we update it when
you entities come in? – Yeah, that's a very, very nice question.

We currently don't
update the similarities. You can approximate them. So especially if you're using page rank, you can use a easy
approximation fo sampling data allows you to some
degree of updating, but for now we assume that it is static. So it is quite challenging to update the weights or somehow what you know, we are working though. We are working in the
direction of update everything, so it's to make it more adaptive, but it requires quite a
lot of not engineering, but it's a lot of sampling methods. So you have to imagine
that you have to maintain the distribution more or less. And that is if you're new node comma, it changes where in the
distribution exchange. So there is quite a lot of work to do. – So also from the algorithmic
and analytical perspective.

But the whole update question
is especially relevant and still an open issue. I would have a much manic
at work, no question. So you use a lot of tools and graph and network
analysis tools and frameworks, and so on outside of Neo4j so what do you think, how does the intersection
of like these existing data science tools with Neo4j work and where would be room for improvement? – Well, there are some tools
that are written in Java, and then you can basically
already interface with Neo4j and that we
have used in the past. We have, so most of the
times in which we need the heavy computations, we
really need to do that in Java. So we basically load
the portion of the graph that we're interested in memory. And then we make that analysis because otherwise making
a lot of queries in Neo4j will be too slow from the
computational point of view. But basically you, you
use a two step approach. So you load some part of the
data, you make your analysis, and then you save on, Neo4j again.

So that would be the workflow that we do. – Okay, because we have that
question a few times yesterday and because the GDS library can do a lot, but obviously we have
seen so many examples of other data analysis approaches. The question came up a couple of times. Okay, so thank you so much for
your presentation point talk, thank you so much for being here. I hope that you will find
new ideas for collaborations, new projects and so on and yeah. – Thank you Martin, and of course, if anyone wants to reach
me out through my email up, you can just type my name
and it will find my email.

– And with that, we can hand
over to our next speaker. That's the Zeshan Ghory from IQVIA. Will talk about updated database that he uses to evaluate drug safety data. So I'm looking forward to your talk back to more patient burning
up, starting this again. So the stage is yours
looking forward to your talk. – Great, thanks a lot. So let me just see if
I can share my screen. All right, hopefully
everyone can see my screen. So today, yeah. This talk is going to be
about how we can use Neo4j to help us evaluate drug safety. So let's jump into it. So just start with the
first question obviously is how, how do we evaluate drug safety? What do we mean by drug safety? So all the drugs that you
and I use as patients, they've all been through
extensive clinical trials, they've been approved by
whoever your local regulator is.

So the FDA, for example, in the US, however, even once drugs are
approved and on the market, we still have to have
some way of identifying if anything's happening,
that shouldn't be happening, what we call adverse events. And these adverse events
could be very mild. Something like nausea or fatigue, or it could be something
much more serious, like a stroke or a heart attack. So most in most jurisdictions, there is some kind of
spontaneous reporting system available for recording adverse events. So this is something, a system that allows
physicians or patients to be able to record if they took a drug and they had some kind
of unexpected side effect or adverse event taking place. And the best known of
these is the FAERS system, which is operated by the FDA. I think that stands for the FDA adverse event reporting system. And that's the system that we're going to be looking at. Soif a drug has adverse event reports, should we be concerned? And the reality is every drug
is going to have some entries in the fAERS database.

So the real question is how
do we know if this represents some kind of potential problem? And the typical way we deal with this is to use auto ratios. So I'm not going to get into
the detail of the statistics, but essentially what you look at is you build a two by two table. You want to know who's been taking a drug and had the adverse
event who took the drug and didn't have the adverse event. And then you look at all the patients that didn't take the drug. And again, you look at how many of those have the adverse event and how many of those didn't
have the adverse event. And then you can look
at a ratio of the two that's, one statistic. There are other statistics
you can use as well, other ratios to understand whether or not we think there's a link
between taking the drug and an adverse event happening.

But there is a slight problem with this, which is that the FAERS data and any other spontaneous
reporting system data is going to have a certain amount of bias. And that bias is going to
be the fact that the system is only going to contain data where adverse events have occurred. So normal drug use where the drug is used, nothing goes wrong, is never really going to
turn up in the system, except if you happen to be taking the drug alongside another drug. So what we need to look at is
what we would typically call real-world evidence. And by real world evidence, I mean here information about
how the drug is being used in real life. So an example
here could be looking at hospital electronic health records. We can look at insurance
claims data as well.

And indeed that's that
specifically insurance claims data is what we looked at with our solution. So the first problem that you run into and a few of the previous
talks have touched on this is that every single database you use is likely to have be using
different vocabularies to record drugs and conditions. So for example, in the FAERS database, all of the adverse events that occur, the reactions, when you take a drug, I recording do using gay
vocabulary cord MedDRA in our system where we've
caught insurance claims data, we need, we've got everything sitting using a ontology code ICD10. So we need a way of
mapping between the two. And fortunately the OHDSI initiative provides a way of mapping between many different vocabularies.

So you can see here on
the, right-hand side, we have here, all of the MedDRA hierarchy,
we can go from there to SnoMed and from SnoMed we can get down to in our case, ICD10CM. So this makes life a lot easier. What's important to note, there is still a certain
amount of messiness involved is not quite as clean as being able to map from one thing to another directly. And some of the mappings may be missing. So you can't rely on
this a hundred percent, but it's probably better than anything that's out there at the moment. So what did we do? So we essentially took FAERS and the real world evidence that we had. W combine this with the OHDSI ontologies, and we did this all using
Neo4j to connect them. This gave us a solution which
lets us do these ultra ratio calculations both over the fast data, and then find the corresponding
real world evidence in this case, insurance claims to do the
equivalent calculation there, to see how the two compare.

So let's talk a little bit about how we actually implemented this. So you see here a model that we developed and repopulated with data in Neo4j. So just for example, here, you'd just probably be too small to read, but in the center here, we
have a drug in this place. It's a Metformin, it's a
common antidiabetic drug. This is encoded in RX norm, which is a drug ontology from here in the dark blue. These are the entries
and the FAERS database for people who are taking that drug. And you'll notice here, this drug could have various
different brand names. Then for anyone who was taking this drug, this connects to the
case record and FAERS. So each of these has got
a different case number, from there you can jump to
the reaction that happened as a result of, or potentially as a result of taking that drug. We obviously, we don't know whether there's a causal relationship here.

All that's recorded is
somebody took this drug and they had these reactions occur. So you can hear here. For example, pain is a very common one. We've got malaise here. So you do different
reactions that happened. And then from there, we translate
this into, in the green, we have the entry in MedDRA and what I haven't included
here to save spaces. Then from MedDRA, we would go to SnoMed and from SnoMed we'd go to ISD10. And that's what lets us eventually connect to our real world evidence.

So we just, we learned the ontologies and Neo4j. So we put all the FAERS
data sitting in Neo4j, we've got all of the ontologies from the OHDSI project,
sitting in the Neo4j. And then we linked the FAERS
data to those ontologies. Now we come to our real world evidence, our insurance claims. So the insurance claims
data is very, very large. We're talking billions of rows of data. We did initially look at
putting this into Neo4j and we figured out that really
wasn't a practical solution. I mean, you'd need a ridiculously
large amount of memory on your box to make this work.

And actually what we realized is we're better off leaving
this data setting in place in the Hadoop cluster where it lives. So there's already a whole
bunch of ETL processes that populated, and we didn't
want to have to replicate those either if we're
pulling that into the graph. So yes, it didn't make sense
to move that into Neo4j. So what we did in practice
is we use the graph to generate the ICD10. Those are the condition
codes we use to generate the corresponding NDC codes. So NDC is the ontology we
use with our insurance data to record the drugs that
patients are taking. And so we take the fast data, we convert it into a series of codes and the ontologies we want. And then we feed that back into spark to query the Hadoop cluster and all of that's
orchestrated using Python. So that seemed for us to
be the most efficient way to get this to work.

To be honest, we put the fast data into the data, into the graph database. You could equally do the same
thing with the FAERS data you could have that sitting
in a relational database and really use Neo4j
to hold the ontologies, the graph like data, which is what it's best at doing. And then once we had all of that in place, we built a react front end on top of this to give us a SAS product, which we can use to run the calculations for any drug or adverse
events combination. So let's have a look at some of the sample
output from what we built.

So as you can see at the top here, and let me just switch
to the laser points. Okay. So you can see at the top here, we can pick a drug. We can take a date range. So this drug here, we're using the RX norm ontology. So you have to pick something
from the RMX norm ontology, but you could equally
use any other ontology. If that made more sense from once you select that particular drug, we're able to then bring
back from the FAERS data, all of the, the adverse events. And in this case, what we've done is we've
grouped the adverse events using one particular level
of the MedDRA hierarchy, which is HLT high level terms, but you could equally go higher up.

You could go to SOC, which
is the system organ class. And that's part of the
beauty of having all of this connected with a graph. It's very easy for us to aggregate any level of the ontology that we want. So this is brought everything back at the level of the high level terms we've picked here. So in this case, we picked aspirin, we've picked renal failure and impairment as the high level term, we can
then based on selecting that we can bring out all of the
individual preferred terms. So this is the level
that's recorded in FAERS. And we've selected here, the top one, which is acute kidney injury. So what we want to know
is what's the relationship or what kind of alterations
do we get for patients who are taking aspirin and
had acute kidney injury? So once we've done that
down here at the bottom, what we can do is we can do
the odds ratio, calculations, both for FAERS.

So this data is all sitting in Neo4j and we can then jump
over to the data sitting in the Hadoop cluster, based on the codes that we're sending over to get the same calculation
for all of the data in the insurance claims. And then on the right-hand side here, you can see the comparison
between the two, we've looked at two different ratios here. There's the PRR and the ROR. They're both different
alternate shows that are used when analyzing drug safety. Now what you can see here, and this is quite a common
thing that'll happen is that the odds ratio comes out much higher in the fares database
compared to when you look at what's happening in
the real world data. And, and this is not surprising, the FAERS database is obviously,
as I mentioned earlier, bias towards people who
are submitting details, where they had some kind
of suspected adverse event. Whereas the data we have sitting
in our real world database is showing much more
typical usage of the drug, where it looks like that doesn't
appear to be anywhere near a stronger link between taking this drug, this particular adverse event.

And if you go to PubMed, for example, you you'll find sort of
dozens of papers on people who do this kind of
analysis over fast data. And what does system like this lets us do is then be able to just
do that same comparison across real-world data sets. So, as I mentioned, we're
looking at insurance claims data. You could equally look at,
we haven't done this yet, but we can connect it to, for example, electronic health record
data or any other datasets that we wanted to, as long as there's a
means to be able to link from one ontology to another. Now, having everything
sitting in the graph also allows us to a couple of other, quite interesting things.

One thing that we can do is we can, for a particular drug, we can compare that drug safety profile to other drugs in the same class. And you may have a question
about, what do we mean by class? Again, this is coming
back to the ontologies. We can, we can pick an ontology. I mean, ATC, for example, would be a useful one to use here. So we can then jump across
from whichever ontology you were using for the
drug NDC for example, we can go across to what
the entry for that drug is in the ATC ontology. We can go up one level
to find a drug class. We can come back down to find
all of the other children.

So the siblings of this drug, and then we can do the
exact same calculation that we've done here for
every other drug in the class. And we could indeed, we could do it for the class as a whole. And the reason you might want to do that is a drug may well have a
certain risk of an adverse event, but maybe that risk is
no worse than other, other drugs in the same class.

That's probably the more
useful thing to look out. And then the other thing that we can do, because we've got all of this
linked sitting in the graph is that we're able to look
at combinations of drugs. So if we have reports where
multiple drugs are being used, we can do that same analysis. And looking at that in FAERS we can look at the same thing again in our real world data. So those are some of sort of
the more interesting areas that we can move on to after this. But the main advantage here, the graph is bringing is the
ability to be able to connect multiple heterogeneous data sets and then be able to do a single
analysis across all of them.

And so, yeah, that's, that's
about all I had to cover. I'd love to hear if anyone has any questions or comments
on any of what I've said. – Fantastic, thank you as
much for your presentation and channeling your application and all this graph and analysis. So let me jump over to the chat and see if we have questions. And while I do that, I
would like to, again, reuse the question we have off every talk, how do you work with data queries? So we have to come at the
question a couple of times when it was about genes
and proteins and so on and I think in a certain way, the terminology and
vocabulary is very fundamental to everything in medical informatics, but they change. And how did you handle that? How to get you with that? – Yep. No, that's a great question. And our team was chatting about
this this morning, actually, because we'd seen the same
discussion on the other talks.

And the reality is the
easiest solution for us is just to wipe the
database and to rebuild, it is much easier than trying to come up with some kind of incremental process to fix it as we go along is the simple answer. I, I suspect as the field matures, there may be better solutions in place to have incremental updates, but for now we don't have anything more sophisticated than that. – So you can pull the latest version from the audits vocabularies. – And I'm bear in mind. A lot of this data doesn't
change that frequently, but yeah, if we need to we'll
just reload it all again. – And maybe a follow-up
question about the vocabularies. So as far as I know what the
MedDRO SnoMed relationships in the oldest vocabularies
are not complete yet.

So I think bachelor has
only used for classification of a correct me if I'm wrong here, calculus other, other mapping data sources on top of the OHDSI vocabulary. – So we didn't use
additional data sources, although that is obviously an option. And so typically when this
kind of analysis is done, if you go to PubMed and
looking at his papers, typically what'll happen is
someone will just manually go through and do the mapping
between one coding system and another, and what we're
trying to do is come up with a more, more general solution.

So, I mean, there's a
decent level of coverage of MedDRA preferred terms of the PT level, but it's not complete. I mean, so what we did is
go to SnoMed jump up a level and then go back down again to try and get reasonable coverage of the relevant codes. But yeah, so I think with the OHDSI model, that's probably about as good
as you can do with the data that's there now, or the alternative is
to find other mappings. – But at least if it's in the graph, you get the option to extended again, the same question,
asked almost every talk. I mean, there wasn't the
OMOP epic talk yesterday, and we had a bit of
discussion after that talk.

Do we think for this very
fundamental space of terminologies and vocabularies do you things, they could be kind of a default way, a default graph representation of this that they could reuse. – Sorry, could there be a
default representation of words? – So we had to talk about mechanism of data to Neo4j yesterday and what a bit of a discussion. So the power of OMOP is that
as a common data model, right? All star data in the same way, do you think there would be
a 100% correct a default way to represent data from OMOP
vocabularies plus events in the knowledge graph? – It's yeah, it's a good question. You're right. As in it feels like there's a
lot of us doing the same kind of implementation again
and again and again, I suspect there are
probably certain decisions that are made in modeling, which are probably fairly
application specific is the reality right now
that said it would be great if there was a default
off the shelf, okay, click a button and you
get all of this pulled into a graph.

And I guess maybe that's
something that might make sense for the Neo4j or someone else to do. I wouldn't be surprised
if people have to tweak it for application specific reasons. I mean, one simple one actually
is that we were looking at this is probably not critical but for example, with the OMOP model, you'll have all of the
ancestors linked at every level for each ontology, which
has done primarily, I'm assuming, because if you were loading this into a SQL database and you want to find, the parents to grandparents
or great-grandparents, you can do it with a single join.

Obviously if you've got it sitting in a graph database, you can just traverse
the links to do that. So it's not really necessary. So there are some possibly
simplifying things that could be done if you were
loading this into a graph. – So that there needs to be
more work and more discussion in the artist community, essentially how this would work and how we help them send them. Okay, there's some, another
really interesting question from LinkedIn, a great presentation.

Are you using this approach also from my three lingual insurance data. And if yes, how did you do the
mappings across taxonomies? – It's a good question. So we're not doing everything, or everything we're looking at so far. We've been looking at the US only. So it's all been in English. I mean, in theory, if you're using one of these, if you're using one of these ontologies, all these coding systems, I would have thought it would be fairly language independent, but I don't know for sure. And we haven't looked into that yet. – Yeah, I think you would be surprised. About the different data source but I mean, this is
what the OHDSi community is all about, right? Also, including, okay, let me jump back to different languages and let me jump back to the chat and see if we have other questions. There is a question
about the vocabularies, we also use, what about the use of UMLS? – So we haven't been using UMLS.

I mean, we've essentially
just took the ODHSI mappings off the shelf and just
took the simplest parts we could find between the
vocabularies we needed. And I guess as we find holes, then we might look to use other solutions as you mentioned. – Okay, let me quickly scroll to our chat at the Q and A section. But I think that was the last question. So a lot of questions
about the vocabularies also like a lingual lens on, so thank you so much
for your presentation. It was really interesting to see more, more approaches, small
projects from the head caspase on let's say from the patient level patient database data. Thank you so much.

I hope you enjoyed the
workshop and the other talks, and I hope that you can also find, interesting ideas input. I've made a potential
place to collaborate. So thank you so much. – Thanks so much. – And with that, I would like to hand
over to our next speaker, let's talk Emre Varol from John Snow Labs will present I think an interesting topic that came up a couple of times yesterday, represents how base a
clinical knowledge Graph with Spark NLP and implemented
obviously in Neo4j.

So Emre Varol welcome here
I'm really looking forward to your presentation. The stage is yours. – Thank you, Martin. Let me start sharing
my screen and test it. Am I sharing? – Yes, we can see the screen. – Okay, thank you very
much for great intro. I really appreciate it. I'm Emre I'm based in
Istanbul Turkey at the moment. I'm a data scientist in John
Snow Labs and here is the topic creating Clinical Knowledge
Graph by Spark NLP and Neo4j and here is the agenda and I'll start explaining Spark NLP first. And then we will continue
with Spark NLP health care, how it works and the pillars of it. Next step, it will be
our other main component to create Knowledge graph. As you know Neo4j I think most of the time they are some big content for about
Neo4j after quick reminds, I will jump into clinical Knowledge graph what should be the properties and features of clinical knowledge graph and the points to be considered
by the practitioners.

And then the last one is the, I mean, here is the most important
part of this talk, how we can create clinical knowledge graph by Spark NLP and Neo4j, we'll do this part alive the live demo part and let's start. So Spark NLP is an open
source and NLP library released in 2017. And right now it has around
40K data downloads from PyPI Our monthly download is about 1.2 million and total, we hit the 10 million already. We support four languages,
Python or Java and Scala. The goal was to be able
to create a single unified and active library that will
require no other dependencies other than Spark itself. And it should also work
on the clusters by DB as far as I know, there was no other library
that could support around Spark clusters I mean NLP libraries, we also wanted to take advantage
of the transfer learning and implementing the latest and greatest state of the art algorithms, as you can see we have
Spark and healthcare models, Healthcare models are
licensed, which is not free, but the public is totally free.

We also have another library,
which is called Spark OCR mostly data entry starts
from like PDF or images when it comes to clinical texts. I mean, so EHR, electronic health records or medical records are
usually go through Sparkle OCR for that, but I'm not
going to cover that today. On the left-hand side, on this part, you see the public modules, we are supporting more than 200 languages by using the same NLP architecture, which is another nice point
to underline by the way. And because we can switch one
language to another easily, as long as we have the
relevance Verd embeddings or the broad coverage to
accomplish certain tasks in NLP, we have some deep learning
based emotion detection or text classification algorithms, or pre-literate models that
you can just plug and play. So instead of trying to
build your own pipeline, you can just download and
use our pre-teen pipeline and then just feed your data frame. And so you end up with the latest, like state of the art resource using which were various word embeddings, including glow and the latest
cool kids in the NLP town, BERTS, Elmo, ALLBERTS, DistilBERTS, so on and so forth.

So we all support them and
you can just use any models like a building block
and like plug and play on the right hand side, you'll see conical versions like public models. We have four main clinical models. We have more than that, but
let's just cover four of them, clinical and still recognition, clinical and still linking, assertion status and relation extraction.

We have around more than
50 different NDR models, clinical NER models, which means like preteens
on clinical NLP, datasets that are being used in
clinical NLP challenges like I2B2 and n2c2 data
sets, we call them. And I mean, we call these
models like anatomy NDR pothologene NER, I mean, PHI NER, etc, to extract the meaningful
chunk given the task, clinical ask the linking
against assigning some slomet ICD10 codes to these entities
detected through NER. So the assertion is very important model in healthcare domain. It's like if the clinical mode is talking about the patient's father,
not the patient itself, we need to know that information so that we can assign different features. I mean, maybe we will create
different features accordingly, or if it is not that we can
just filter out those symptoms, maybe for the identification, here, this part, we remove or mask the
sensitive information.

According to HIPAA rules, we need to hide or conceal
some of the information. And these are called sensitive information and we need to identify or off skate, off skate means like replacing
the fake names or entities for the real ones, okay? So we also have some pertinent models that you can plug and play, and then combined with the public ones. I mean, you can use, these I mean, these models and you can combine with the clinical models. We keep Spark NLP up to date and upgrade in every
two weeks with releases, we have released more than 75 times. The main focus is here to
have a single unified library for all NLP and NLU needs. And also we have Slack channels John Snow apps, Slack channel and GitHub on the right side, you see some NLP features and comparison of the other libraries.

I mean, here is the HF face, CoreNLP, NLTK, SpaCy
and R product Spark NLP. Let me show you an example
to answer the question. What is the art of these Spark NLP models? I mean the text here, clinical texts here a 28 year old female with a history of type-2 diabetes mellitus
diagnosed eight years ago, takes 500 milligram
metformin three times per day as you see the first in here is age, age of the patient. And the second one is gender. And as you see the clinical findings type-2 diabetes mellitus, we can use our assertion model and also our resolution models to map these clinical
findings to Snomed code, ICD10CM code and ULMS code, And after that, you see some related dates
here eight years ago, and we can extract these
using our NER pipelines and the dosage of the drug and
the Metformin 500 milligram.

And we can map the, I
mean, we can just map these drugs I mean, the RxNorm codes and the NDC codes, and you see the frequency
of the medication. Also, we can calculate the
HCC risk adjustment score, according to these
kinds of clinical texts. And let's talk about a little bit about the NRE healthcare sites. Everything starts with NER because it's the minimal meaningful chunk of any clinical text. The importance of NER
might not be that much in any other domain, but in health care, that's very important. Everything builds upon
clinical NER models, the immense clinical recognition models. We have many other models that have nothing to do with NER, but when it comes to
extract some knowledge to get a sense of what is
going on inside the documents, we start with clinical NER model and then use the other models.

Use the output of NER as an input so that it's called a science something instead of dealing with the entire text and still linking assertions status, the identification and relation extraction use the output of the NER models for an example, Assertions Status model assigned some status. If the fever and sore throat is a problem and still coming from NER. A situation starts to find it is absence or presence given the text, given the sentence that chunk lives in, one more Relation Extraction models, try to find the relation
between NER chunks so that why NER is very important. So, I mean, I explained
why NER is very important and highly valuable in clinical texts.

Here's another example
in my humble opinion, annotating the data and NER
is the brain and the heart of the NLP body that manages every other
tasks and sub tasks. Recognition of the name
entities is basically classification of tokens, NER tries to locate and
classify predefined categories, such as persons, locations,
organizations, hospitals, medical centers, medical
codes, measurements, units, monetary values, percentages,
quantities and etc. We use NER to Don sprint, the related tasks to answer
real world questions, which hospital and
department have been admitted by the patient? Which clinical tests have
been applied to the patient? What are the test results? Which medication or
procedure has been started? And here is the example
of Relation Extraction. As you know, Relation Extraction is the task of predicting semantic
relationship from the text, relationships, usually occur between NER and NER trunks and the RE. Relation Extraction is a core component to build relational knowledge graph. Relational Knowledge graph. It is essential for NLP applications like questionnaire serene
and summarization, etc. Clinical already plays a key
role in clinical NLP tasks to extract information
from health care reports.

You can use it for multiple purposes, such as detecting temporal relationships between clinical events or
drug to drug interactions, the relation between medical
problems to treatment or medication interactions, etc. I will not go deeper into the importance of the RE in medical studies today. But I think you know about it, in the healthcare world, automatic extraction of the
patient's natural history from clinical texts or
EHR is a critical step. It helps to build intelligent systems that can reason about clinical variables and support decision-making.

Any intelligent system should be able to extract medical
concepts, data expressions, tamper relations and the temporal
ordering of medical events from clinical text. But this task is not easy. I mean, it is way too hard to tackle due to the domain specific
nature of clinical data, such as a writing quality
and the lack of structure, and also the presence of the redundant information more generally, I think we can handle this task by combining some rule-based methods and the unified NLP library and the power of knowledge graph.

In my opinion, this combination
is the most appropriate. And the quickest way, I mean, here is the brief
introduction to Neo4j Neo4j is highly scalable
need to graph database, the core belief behind it, connections between data or as
important as the data itself. I'm pretty sure that using
the connections between data will create a competitive advantage to produce actionable
insights in healthcare. And you'll see some information about
what's Graph Database, and I will skip this part
quickly to save the time for the live demo site, live the more part, and you see some other
informations about the Neo4j and NEo4j graph.

I mean, Cypher is the
Neo4j graph query language. It's Declarative Pattern-Matching language I mean this part is
important pattern goals. It's all about patterns. As it says, it's SQL-like syntax and it is designed for graphs. This is a basic pattern. What if we only want the
relationships for specific type? I mean, DD as you see a active in M and the actor in acted in the movie. This is like two nodes
and one relationship. And we try to get these
pattern from the clinical texts to issue our goals. And you'll see the Cypher query example about this graph, okay? And let me start to talk
about Knowledge Graphs, and here's an important distinction. These are charts, not graphs, so it's good to start using
the right terminology. And we will jump into what
is Clinical Knowledge Graph.

Healthcare data complex exist in multiple places and
the redundancy structured and on structured
inconsistent definitions, you can find inconsistent
definitions in the clinical text. And also the regulations are changing and we need new requirements. So for clinical Knowledge Graph sites, we need to create data points, which accurately represents
the patient history rather than creating a patient history, which contains a variety of data points. And it should be a representation
structure and methods, and it should organize the entire body of clinical knowledge. It should contain high
relevance in clinical data. It should be universally accessible. And these are the
properties and the features of the Knowledge Graph. This illustration is one of
my conceptual graph models. Here's the tip of this talk, especially for clinical ones before creating Knowledge Graph, you should spend some time working on conceptual graph models.

Otherwise it is very easy to create graphs that are irrelevant to your purposes. It will be completed in mass. I mean,you have it. I mean, you have a big chance to connect irrelevant
entities to each other. After creating the Knowledge Graph, you should validate the graph by querying and crosschecking with the raw data, like the input output analysis. And here is the one of our Knowledge Graph for the text on the right side. You'll see that text. And then we extract this. I mean, we create this Knowledge Graph from this clinical text. And as I said before, I spent some time to create this. I mean, the conceptual graph model to create this model, because this is not very easy to see. We have level one level two
and level three level four and level five, I mean, and also this
one that will sell them, okay, I mean, you have
to always keep in mind that before creating Knowledge Graph, what should be the conceptual graph and how can we map the clinical texts into our conceptual graph to
create the Knowledge Graph.

I'm done with the presentation sites I will just reshare my
screen after stopping it to jump into live demo. Just give me a minute, please. Okay, I think I'm sharing the screen so we can start with the live demo part and you can find these I mean, you can find this node, looking our workshop in the GitHub. If you want, I can share
the links with you, or you can just jump to our workshop here. Can I send a message to all people, all hands here, maybe. So you can find our repull from the GitHub and you can find the
Knowledge Graph side here. It's 10.2 okay. This node book, you can
find the details here. And I will just cover one
example for the public site, one example for the licensed version. First of all, the licensed version, after you get these notebook, you can also get the trial license from the Spark NLP. I mean, from our website, our formal website, after filling the form, you will get a 30 days free trial and you have to load your
license key before starting because you will download according to these secrets.

Yeah, then just download it and we will install all these libraries. It's just really easy. I'm not just skipping
this part because I mean, we also experiencing this part, I think, and you will see I'm using Spark NLP, public version 3.3.2 and the licensed version 3.3.2, we are using still the PI
Spark version 3.1.2, okay. And here is the pipeline. First of all we have
document assembler onnotator and after that Santos
detector and tokenizer or BERT embeddings model and the post tiger, NER model and also NER converter internal, I mean, this is NER commercial to make the, NER to chunks. I mean, combine them in chunks and the dependence parson model and the relation extraction model. You can just create these inside the code snippet, and you will just write the pipeline. I mean, you just put them in line like a Spark ML modules, okay? And we will fit the
pipeline with empty data.

And after that, we will
use our texts to feed it, to get the, I mean, to get the results. This is a get relations, DF, the helper function to get
relations of the light pipeline and so I mean, maybe you have a question. What is my pipeline? My pipeline is I mean, you can
get the results very quickly. I think it's 10 times faster
than the normal pipeline here is the text. A 20 year old female, I mean, blah-blah-blah, you can do it. And I will just create a live
pipeline using this pipeline. And I will get to relations
with these it helper function and show you in pondless data frame. So when we run this our pipeline, the pipeline has followed this one. I mean, lets check what is the pipeline? This is pothology relation
extraction pipeline, pothology I mean, if you are
working on the drugs, right? So we will see some
relation between drugs, drugs and I mean, modifiers of the drugs, like the duration and
the routes and so forth.

As you see the first one is five days. The first chunk is five
day and amoxicillin. The other name of the drug
and the for six months and insulting, outfield and the
40 units, insulin, Metformin and the strength and the
frequency of this drug, okay. And after that, we need to connect with the Neo4j and I will show you how you
can open up a Neo4j sandbox. I mean, I will use a
sandbox in the Neo4j sides, and you can just launch
a free sandbox from here. And after you launch it, you will see a window. You will just choose new project. And after that, you will
see a blank sandbox here, which is not available for now, because I'm using it in here. And you will see some username passwords and bolt you will have here after running this part,
this posting reports, we will use the class
for the Neo4j connection, and we need the help function to create nodes and
relationships in batches.

And we have function here, but I will skip for now. And I will return back here. So this is the, your
already password and user, you will just copy and
paste from the right side, to left side, and you will create a
connection here, okay? So we can talk about what is the NERs and the relations, what will be our conceptual graph model to put all these information properly in our Knowledge Graph, okay.

First of all, you will see
the chunk one and chunk two. These are related entities, right? And these are the entity types here, duration five day, is the duration. Amoxicillin is a drug and you will see the
relational also here, okay. But I will not use this part. So you can also relate all these entities using dynamic relation, using APOC, APOC the query languages for Neo4j and other query language, and what I'm doing here. First of all, I'm moving line-by-line. And when I get the chunk one, I create NER, NER node here and for the chunk two, and another NER nodes, and I relate them using
links keyword, okay. And as you'll see, I also use the role relation here. It is the role relation. I just check out what's happening I mean, what is the type of the relation between these two guys, these two chunks, okay? I think the creating this
kind of functions, I mean, to map the result of the NLP
pipeline or Knowledge Graph is the key point.

And I will run also this function. And after that, first of all, I will draw all constraints
throughout all modes inside this sandbox. And I will create constraints on the NER nodes. And after that, I will run the my main function
to put all the entities into our NER nodes graph, okay? I'm sorry. So we are done, all our NER and relations are stored in our Knowledge Graph here, graph database, okay? You can just check out here and you'll see the relations
and the entities as you see, okay, for this patient she used Metformin two times a day, 1000 milligram okay, like this, and also you can run some
queries on this Knowledge Graph.

As you see, you can
extract the information, useful information from
this Knowledge Graph. Maybe you can use these
results as a queenie Queenie application in NLP. It's very hot topic now, and we will just filter out this NERs I mean, entities, just on the Advil. You can get these one. So we can say that the patient used Advil. Sometimes I mean the 12 units, 40 units for five days and one unit, okay? And I will skip this part quickly because we are running
out of time, I think. And this part is the another part for the creating Knowledge Graph.

But this is public one. Let me just run this part I mean, you can walk through the components of this pipeline. Again, we have document
assembler sentence., talk nicer embeddings, NER tagger. And the NER chunker, post
tagger, dependence parser. And here is the different
components from the first one, type dependence parser, okay? And the graphics section, we will use graphics section here. This is, as I said before,
this is public one. So we will not use any licensed version of the Spark NLP, okay? And we have a gain, the some helping functions, like the get graph results. I will create the graph
and I will feed these texts to my Knowledge Graph, okay? You can just check out the text here. Some guys where they're born
and where they live for now and where they work also, and there's some other
kinds of information about the countries,
relates countries, okay? And we get the results and we also use our helper
function to put all the results into our last data frame.

Okay, While it's running,
let me show you the results, because I think it will
take a little bit time, because I didn't use a Lifepak time here to show you all the results. I mean, how it is working
the usual pipeline, a base pipeline then you run all these code snippets here called cells here, we will
get this kind of information. John Snow born in England, John Snow lives in New York, Peter lives in New York, so on and so forth, okay? And we will, again, run these helper functions to put all these information, graph information to our Knowledge Graph.

And as you see the same works, I didn't change anything
else in this function because this function can
work with this structure. It can work for any kind of relation, I mean, output relation extraction models, as you see again, same kind of things. And the rebuild to replicate the same reproduce the same steps
again and again, okay? Oh, it's dumping you'll see the results here, okay? I think there's something
else in the CoLab side, because I couldn't see on the other side.

If I didn't click on the opposite side and you will see the results here. And also we will check
from the other parts. These are all running, there's a lot running and
this is also the same. This is also the same. I will just run this part. I'm sorry. I will drop all nodes, create constraint on the NERs and I will rerun this part to put everything on the
Knowledge Graph side here. As you know we're using same sandbox to, I mean, we will just raise all of them. I mean, remove all the
relationships and the NERs and just, increase NERs in
this knowledge graphs, okay? You'll see the results here. Okay, I'm done. I hope everything is clear for you. It's clear for everyone. – Fantastic, thank you. Thank you so much for your presentation with a talk from Davide abit earlier, we kind of had a good look in, in the broad annual data
analysis and data extraction and data and explanation that comes after you get your data from the Knowledge Graphs.

And I think they showed us a couple of really interesting
examples from the NLP area, how we can enrich data, use data and method to a Knowledge Graph. Just one, I mean, don't have
much time for questions, but just one quick question. So the is a free version
of Spark NLP, right? – I'm sorry, I couldn't hear you well. Let me just, just remind speaker again.

– Just a quick question. It's as documented P three. – We have public version
and healthcare version, healthcare version is licensed version. It's not free, but the public
version is totally free or you can use this graph extraction for the nonclinical text, okay? I mean, this is the example
for the public version, this one, this part. – Okay, then the head conversion, which you've said, okay. – Public version free. – Okay, fantastic, so thank you so much for your presentation.

Think as much on the live demo. I think this topic is actually relevant for almost every other
presentation that we have, because we all need data from texts. I nicely summarized since yesterday in her presentation we needed
because so much information is just hidden in PDFs, in presentations, in the text agenda. So that was our last
presentation for today and for the entire workshop of course. And like we announced a bit earlier, what we would like to do
in our closing session is to get a couple of
speakers from yesterday and from today at back onstage and discuss with them a
bit how we can now take the next steps, because we say at least 500,000 times that we wanna do
networking collaborations.

Maybe share data, share analysis approach but how do we do that now?
So what are our next steps? And I would like to ask Alexander Jarasch, Rica Colaco, Peeyush Sahu, Sixing Huang back on stage to discuss
a couple of ideas, how we can do that. And the reason why we ask
you is that you represent a very different organizations. So we now have someone from
the pharmaceutical company someone at stage fully
engaged in academic research and somebody from the data
management perspective. So somehow sitting in
between these two parts, and if the host joins, we even have that you've never
biotech and chemical company. So join us, let's maybe start with a very, I think very obvious question. So we all have genes and proteins in our Knowledge Graph, but we all have our own
Alzheimers data loaders and data pipelines.

So what do you think? So this is their practical
and I think ideal suggestion, do you think it will be possible to agree on kind of a standard model for this super basic stuff, but there's basic biology
and actually share something like our data loading pipeline? Just short you're all live. – [Instructor] I think
sharing is caring, right? And I think we should really
talk about sharing knowledge and also sharing the way of how we deal with the same entities
as Martin mentioned, like genes and transcripts and proteins, as well as ontologies. And I think the overlap
is let's say 80 to 90%. And I think there are only some minor. How do you say some minor differences in the way of how we
see genes, for example, I think it's a pretty straightforward way to work on a common data model for the molecular entities, for example, I guess that would be a good start because we are all talking
about the same genes. We are all talking
about the same proteins. And if we see that from cross species, that's the same thing, right? So we are talking about genes in bacteria and germs in viruses and genes in humans.

And I think we could sit together and at least propose a graph model for these entities so
that we can reuse that and not reinvent the wheel at each and every company or academia. – Oh, sorry, just
continue, please continue. – I completely agree with Alexander here. I mean, of course the basic
idea is always the same central dogma is there. Of course, organism can
be less complicated. So if you start with the human
even complicated organisms, you can always go low
in terms of complexity, which will not harm overall project.

And what I see in this current workshop, that what is very important is of course the bringing knowledge, right? So after making this central drug model, gene protein, transcript and other things, I think there should be a standard way, how we want to show relationships between these entities. And I think it was also mentioned already. So like how someone models
protein interaction network, how someone models a regulatory network, and because we put like to in the end, get something out of it, right? So a standard, a way of modeling
would really help actually, and actually gave us a headstart in terms of if somebody
wants to build something. – I think that really comes down like Alexander mentioned to the data model.

If we have one that can accommodate most of the general databases, the unit part, broad
pathways, something like that, it can always be extracted as sub graphs and people can build from it. So I think that really
comes down to what we think is the basic common unit for most people and how we can design that. – You bid and presented
a huge Knowledge Graph with a lot of applications
and a lot of Grammies and a lot of things that
Peeyush asked of the graphs What do you think, do you need the flexibility to model and do everything kind of in your own way? Or do you think the call can be reused. – Sorry, what's the question for me? – It's for Sixing Huang. – It's okay, right now? – Yes anybody can hear. – So there is a basic question, but it's not a
straightforward to model it.

It's basically a gene,
a transcript we put in, because basically as a nation of gene, you can have your information
from uniport and CBI and sunbird gene code, and so on that have a pupil identify your system so where we might result into piggy use is about, we have a non-TT gene coming from unipot with Costco currencies, from the gene that is
the same coming from NCBI and SnoBall and so on. And we do the same thing for the genes, the transcripts and the reporting. So in our graph, we don't have the big NTT Gene that have plenty of medDRA information between we split the notion of gene into gene node, that are coming from different databases. And one thing we are thinking about is about a couple of data model.

You showed during, these
to the presentation, yesterday is about a
gene can give a protein through a process of
translation and transcription, but it's not always true. It's depending on the cellular context, on a particular tissue to given
step during the development and so on and more of that, I think in all of your graph, you have about 20,000 protein node, but there are plenty of protein isophome, so the notion of how can
we model a complex object such as, as isophome is also
an important question, I think.

And yes, in plenty of
databases, such as Kimball, we have notion of a chemical
can have an activity on a particular target. But when we are looking at
the real scientific data that describes CSA, it's not a monomeric form. It can be an aggregated form
of the protein and so on. So really is a basic notion
of how can we model the genes, the transcript and the proteins that are the basic notion of biology is not straightforward. – So with all this complexity, there'd be need in, in
these specific projects. And Jeremy, what do you
think about Alex suggestion to not kind of reuse a data
loading or data passing, but coming up with ideas
always to share sub graphs so that you can and I
think the idea is kind of give a graph way maybe
with literature references and there could be a way
that it can just kind of put this into my graph. So doing your data loading, however that works and
just connect to a graph and put data. – Yeah, so in the beginning we used that for debugging purposes because
our database was way too big to debug on that and to reinitialize it when we broke the database and we came up with an
idea of just exporting a sub graph can be only a part of the genes are
part of the proteins, a part of it, whatever.

And then import that into your, let's say, on your local machine. But I could also see the same. Let's say if we have
the same understanding of the gene in a transcript, in a protein, like in textbooks or in
any other scientific paper, I guess the graph model could
be reused and in other areas, and not only in healthcare, but also in biotech, right? So we had a very nice
presentation by Clariant who has a completely different use case, but we are talking about the same things. And I think you mentioned
a really good point post-translational modification would be one of the more complex things to represent the protein, right? This is what we currently
do, not tackle so far, but this is a very important point. If you talk about active
proteins or inactive proteins that big difference with protein, right? But if you have some, maybe an expertise where we have relationships and data integration processes
on the entity protein, that could be on top, let's say another level in the graph model. And that would be an interesting idea to discuss that and
maybe have several layers in one graph database and
depending on your query or your application, we are traversing the one
or the other sub graph or only specific relationships.

So that could also be a possibility. – I agree that it is a
complex topic in the end, when you are considering
post-translational modification on proteins and epigenetics, and there are things and things which you can add on to graph. And like I said, so of course the structure of the graph is always there. But adding one to actually, this is something which we are currently working on Clariant because we would like to
actually bring this sequence information actually in the graph And of course we have the
granularity until one gene results into multiple transcripts and could result into multiple proteins and variants can actually attach to different transcripts.

So we have that kind
of granularity already, but that is indeed a good
thing about modification because we are going to incorporate, or we are incorporating
proteomics data there that could be identified whether a protein is mult-dilated or any kind of information where it is. I think a very good idea we have discussed is always creating a separate, which connects to that protein
and give information there. So you have the information accessible, but it will not hinder if
you want to query something, which is not that finding. – So we identified a couple
of ways and a surplus areas in our graphs whichever
makes sense and also ways how we can engineer. So unextended to figure with shouting. So it should be simple because you work for public
research institutions and you don't, and you won't I found a company and a chemical company.

So maybe she has some insights
kind of like general ideas. How easy is it for you to
share details of your graph and how much does the business needs, how much it doesn't hold you back? – Basically, I have to ask
her internet if it's possible, but for me, it says there are no problems because it's by sharing that
we are growing basically. And maybe so some it's always, it was been, it was the same question for all the talks about refreshing data, about data injection and so on. And we are currently
using a Python framework, which is named, which
is really interesting to separate into
different kind of pipeline specific way of ingesting data. And for the graph data
model, it's always a good, it's not an easy question. Does it better to create and
to try to create a unified knowledge graph to represent
what is the biology and the connects to domain
such as a chemical sensor on or is it better to create a smaller graph, but that can ask specific questions.

Here at survey let's
say we are in this way. So we have specific therapeutic questions. We are building a data model, and if the new data model
can be uncoperated easily into the bigger use, it's okay. But we can see a separate graph to answer separate questions as well. – I can give my two sense also here about sharing in the field. So of course I'm coming from industry. It is bit harder sometimes
to share all the information, but of course this is something which we are looking forward
to and looking into it. I was having this talk with Frank, our head of group CB,
computational biology. And this sounds like an idea because at least I know
from another project. So they have been in a consortium where they give their
ideas and they share ideas.

So I think something like this
could be possible, of course, until it's a sensitive
information for company. Everything should be
possible means of course it means we came here today. We presented how we are
working with our graphs and actually the idea
how the graph looks like. So I think this is something
which should be fine, but of course this has to be seen. Maybe we can create a
consultants given that. – Yeah, definitely, I think hope that, and that the data community
might become a place and reconnect those ideas
can share out across and discuss methods automated. So I would, I think I would like to
thank all four of you for coming back and kind
of taking the next step and discussing specific ideas, how we can share what we can do and how we can approach that.

And I think we had a very,
very good workshop here and I've heard some talks and told to lots of interesting topics and obviously a lot of overlap. So, like I mentioned, at the beginning, we definitely wanna turn this into a yearly event and continue. And we will also follow up with the survey for our
presenters entertain these, of course, because we
wanna connect some input. How did you like the workshop? Are there any unmet needs
things we have to discuss? And I think for us the next steps, would it be to figure out which editor is interesting for you, GDS international data sharing and so on and see if we can provide
more workshops and insights. So thanks again, to all four of you and thanks to all the speakers, thanks for listening, I think that we had a great workshop.

Alex do you wanna come in again? We had a fantastic workshop, so many interesting talks and there's so many
interesting applications. I don't know what to say. – Yeah, no, I think you, yeah. Thank you all again, from Neo4j to all of you speaking, sharing your experiences
and sharing your work with the wider community. I think these two half days
were really insightful. Lots of great presentations today. I think that the topic was more on the knowledge
graphs side of things yesterday, it was with sharing the visualization or the
connected data approaches from life like, for example, as a session yesterday, I still kept in mind until today, Alexander presenting
the DCD knowledge graph, as well as others, Davide for example, how to boost up a knowledge graph and how to get more information out of it. So I think that was really good. I had a good time. I enjoyed it a lot. We saw a lot of questions so that's also always a good sign people interacting.

So I hope you all enjoyed this time. Yeah, and like you said, Martin, I think we wanna try to keep doing this, to do another session next year. Maybe even on site somewhere possible, that would be cool to even do more sharing and more exchanging, maybe have working groups, even something like this. I don't know but there could be ways to make this happen. So I would definitely
would like to see it again. – Yeah, absolutely, so I
have nothing more to add. It was great. – All right, well then everybody, thank you again for joining, for watching, for presenting, and we will see you again
another session soon. – And all the recordings will be available in the text file, right? – Exactly, yes, yes. – Fantastic, thank you so much. – Bye. – Have a nice day..



Please enter your comment!
Please enter your name here