Exploring Elasticsearch architectures with Oracle Cloud

The IEU GWAS Database

The MRC Integrative Epidemiology Unit (MRC IEU) at the University of Bristol hosts the IEU GWAS Database, one of the world’s largest open collections of Genome Wide Associate Study data. As of April 2019, the database contains over 250 billion genetic association records from more than 20,000 analysis of human traits.

The IEU GWAS database underpins the IEUs flagship MR-Base analytical platform (www.mrbase.org) which is used by people all over the world to carry out analyses that identify causal relationships between risk factors and diseases, and prioritize potential drug targets. The use of MR-Base by hundreds of unique users per week generates a high volume of queries to the IEU GWAS database (typically 1-4 million queries per week).

Objectives of the Oracle Cloud/MRC IEU collaboration

Both the IEU GWAS database and the MR-Base platform are open to the entire scientific community at no cost, supporting rapid knowledge discovery, and open and collaborative science. However, the scale of the database and the volume of use by academics, major pharma companies and others, create significant compute resource challenges for a non-profit organization. The primary objective of the collaboration was to explore ways to use Oracle Cloud Infrastructure to improve database performance and efficiency.


Indexing 200 billion records in 2 days

Previously we had successfully run GWAS on almost all of the UKBiobank traits (https://ieup4.blogs.bristol.ac.uk/2018/10/01/ukb_gwas/). Our next job was to make these searchable at scale. This post explains how we have done this and how you can access the data.


MR-Base (http://www.mrbase.org/) is a platform for performing 2-sample Mendelian Randomization to infer causal relationships between phenotypes. In its beginnings GWAS were manually curated and loaded into a database for use with the platform. Each GWAS (after QC) consisting of a set number of columns (9) and a variable number of rows (typically in the millions, for UKBiobank 10 million). This process quickly became problematic due to large numbers of GWAS and increasing numbers of users. For those interested in how we managed to successfully implement the indexing of over 20,000 GWAS, including each problem we encountered along the way, you can read the details of the journey below. For those who aren’t, you can skip to the end.


Collaboratively analysing thousands of phenotypic traits using UK Biobank data, Google APIs and HPC

Around 18 months ago we began developing a platform to provide a collaborative environment for running GWAS using the UKBiobank data.  To date we have completed 20,000 GWAS on automated phenotypes and 400 GWAS on curated phenotypes. These are being made available via the MR-Base platform with regular updates and an anticipated full release date of December 2018. (more…)