Neo4j data integration pipeline


We’ve been using Neo4j ( for around five years in a variety of projects, sometimes as the main database ( and sometimes as part of a larger platform ( We find creating queries with Cypher intuitive and query performance to be good. However, the integration of data into a graph is still a challenge, especially when using many data from a variety of sources. Our latest project EpiGraphDB ( uses data from over 20 independent sources, most of which require cleaning and QC before they can be incorporated. In addition, each build of the graph needs to contain information on the versions of data, the schema of the graph and so on.

Most tutorials and guides focus on post graph analytics, not how the graph was created. Often the process of bringing all the data together is overlooked or assumed to be straight forward. We are keen to provide access and transparency to the entire process and designed this pipeline to help with our projects, but believe this could be of use to others too.

Our data integration pipeline aims to create a working graph from raw data, whilst running checks on each data set and automating the build process. These checks include:

  • Data profiling reports with pandas-profiling  to help understand any issues with a data set
  • Comparing each node and relationship property against a defined schema
  • Merging overlapping node data into single node files.

The data are formatted for use with the neo4j-import tool ( as this keeps build time for large graphs reasonable. By creating this pipeline, we can provide complete provenance of a project, from raw data to finished graph.

The pipeline

The code and documentation for the pipeline are here –

Below is a figure representing how this might fit into a production environment, with the pipeline running on a development server and shared data on a storage server.


The project comes with a set of test data ( that can quickly be used to demonstrate the pipeline and create a basic graph. This requires only a few steps, e.g.

# clone the repo (use https if necessary)
git clone
cd neo4j-build-pipeline

# create the conda environment
conda env create -f environment.yml
conda activate neo4j_build

# create a basic environment variable file for test data
# works ok for this test, but needs modifying for real use
cp example.env .env

# run the pipeline
snakemake -r all --cores 4


For a new project, the steps to create a graph from scratch are detailed here and proceed as follows:

  1. Create a set of source data.
    • These can be local to the graph or on an external server
    • Scripts that created them should be added to the code base
  2. Set up a local instance of the pipeline
  3. Create a graph schema
  4. Create processing scripts to read in raw data and modify to match schema
  5. Test the build steps of individual or all data files and visualise data summary
  6. Run the pipeline
    1. Raw data are checked against schema and processed to produce clean node and relationship CSV and header files
    2. Overlapping node data are merged
    3. Neo4j graph is created using neo4j-admin import
    4. Constraints and indices are added
    5. Clean data are copied back to specified location

Future plans

We think the work we have done here may be of interest to others. If anyone would like to get involved in this project we would love to collaborate and work together towards refining and publishing the method. Comments also welcome.



Twitter: @elswob



Reducing drug development costs

This short animation explains how we use Mendelian randomization and colocalization to help prioritise drug targets. One of our aims in both  programme 4 of the MRC IEU and the Integrative Cancer Epidemiology Programme is to integrate such prioritizations with other data to help inform drug development.

The animation is based on recent work by Dr Jie (Chris) Zheng, Vice-Chancellors Fellow based in programme 4 of the MRC IEU, who recently published an innovative Mendelian randomization and colocalization study of plasma protein levels in Nature Genetics, that demonstrated how genetic data can be used to support drug target prioritisation by identifying the causal effects of proteins on diseases.

Using a set of genetic epidemiology approaches, including Mendelian randomization and genetic colocalization, we built a causal network of 1002 plasma proteins on 225 human diseases. In doing so, we identified 111 putatively causal effects of 65 proteins on 52 diseases, covering a wide range of disease areas. The results of this study are accessible via EpiGraphDB. 



Exploring Elasticsearch architectures with Oracle Cloud

The IEU GWAS Database

The MRC Integrative Epidemiology Unit (MRC IEU) at the University of Bristol hosts the IEU GWAS Database, one of the world’s largest open collections of Genome Wide Associate Study data. As of April 2019, the database contains over 250 billion genetic association records from more than 20,000 analysis of human traits.

The IEU GWAS database underpins the IEUs flagship MR-Base analytical platform ( which is used by people all over the world to carry out analyses that identify causal relationships between risk factors and diseases, and prioritize potential drug targets. The use of MR-Base by hundreds of unique users per week generates a high volume of queries to the IEU GWAS database (typically 1-4 million queries per week).

Objectives of the Oracle Cloud/MRC IEU collaboration

Both the IEU GWAS database and the MR-Base platform are open to the entire scientific community at no cost, supporting rapid knowledge discovery, and open and collaborative science. However, the scale of the database and the volume of use by academics, major pharma companies and others, create significant compute resource challenges for a non-profit organization. The primary objective of the collaboration was to explore ways to use Oracle Cloud Infrastructure to improve database performance and efficiency.


Indexing 200 billion records in 2 days

Previously we had successfully run GWAS on almost all of the UKBiobank traits ( Our next job was to make these searchable at scale. This post explains how we have done this and how you can access the data.


MR-Base ( is a platform for performing 2-sample Mendelian Randomization to infer causal relationships between phenotypes. In its beginnings GWAS were manually curated and loaded into a database for use with the platform. Each GWAS (after QC) consisting of a set number of columns (9) and a variable number of rows (typically in the millions, for UKBiobank 10 million). This process quickly became problematic due to large numbers of GWAS and increasing numbers of users. For those interested in how we managed to successfully implement the indexing of over 20,000 GWAS, including each problem we encountered along the way, you can read the details of the journey below. For those who aren’t, you can skip to the end.


Collaboratively analysing thousands of phenotypic traits using UK Biobank data, Google APIs and HPC

Around 18 months ago we began developing a platform to provide a collaborative environment for running GWAS using the UKBiobank data.  To date we have completed 20,000 GWAS on automated phenotypes and 400 GWAS on curated phenotypes. These are being made available via the MR-Base platform with regular updates and an anticipated full release date of December 2018. (more…)