Human Genome Informatics

The Australian BioCommons is committed to ensuring that infrastructure for human genome data warehousing, sharing and analysis is implemented in Australia that adheres to various global best practice standards.

This work will benefit human health and medicine, and is in alignment with our mission to actively support life science research communities with community scale digital infrastructure that is developed and maintained in concert with international peer infrastructures.

Context

Affordable DNA sequencing at scale has enabled the genomes of hundreds of thousands of people to be determined across the world and has led to a better understanding of the causes of complex diseases, better diagnosis / early disease detection and more options for identifying tailored treatment options.

In order to achieve these outcomes, genomic information from one individual needs to be compared with multiple other genomes from similar cases in order to form cohorts of sufficient size to produce statistically meaningful outputs. This is often done across multiple efforts/jurisdictions, at a national or global scale, and requires the genomic data to be findable, searchable, shareable, and linkable to analytical capabilities.

Due to the sensitive nature of genomic information, the privacy of individuals must always be protected, and any data processing must always be done ethically, securely and safely.

How can sharing human genome data help accelerate research into understanding disease causes, treatments and prevention?

Cancers of Unknown Primary (CUP) are metastatic cancers where the cancer’s tissue of origin remains unknown despite extensive investigation. There are 2,700 new cases of CUP in Australia annually and the lack of a definitive primary site can restric… — **Cancers of Unknown Primary (CUP)** are metastatic cancers where the cancer’s tissue of origin remains unknown despite extensive investigation. There are 2,700 new cases of CUP in Australia annually and the lack of a definitive primary site can restrict treatment options with patients having a five year survival rate of only 14%. Researchers seek to develop new techniques that can inform that identification.
Genomic sequencing of CUPs uses the molecular fingerprint of the cancer to identify a likely primary tumour site/type. For the most accurate fingerprinting, the CUP’s genome sequence must be compared against very large numbers of genome sequences from other already diagnosed cancers.

Childhood cancers are a heterogeneous collection of rare diseases and the leading cause of disease-related mortality in Australian children.Due to the relatively low prevalence of childhood cancers, matched data for Australian children is likely to … — **Childhood cancers** are a heterogeneous collection of rare diseases and the leading cause of disease-related mortality in Australian children.
Due to the relatively low prevalence of childhood cancers, matched data for Australian children is likely to be from cases diagnosed elsewhere, and national/international data discoverability and sharing of matched genomic datasets is critical to accelerate research into understanding causes, treatments and prevention.

Rare diseases affect a small percentage of the population. There are a large number of rare genetic diseases (est. ~10,000) and globally 350 million people suffer from a rare disease; however, due to the relative rarity of specific disease cases, af… — **Rare diseases** affect a small percentage of the population. There are a large number of rare genetic diseases (est. ~10,000) and globally 350 million people suffer from a rare disease; however, due to the relative rarity of specific disease cases, affected individuals are often undiagnosed. This impacts on both treatment options and mental health issues due to not being able to identify with a group of like-sufferers.
Genomic sequencing of undiagnosed cases can be used to determine the molecular fingerprint of these individuals which can aid in diagnosis. For the most accurate genetic diagnoses, the sequence must be compared against sequences from relevant similar cases, including those that have been diagnosed.

Human genome analysis across Australia: scale and challenges

Large-scale human genome sequencing and analysis efforts in Australia include those undertaken by ZERO Childhood Cancer, the Australian Genomics, the University of Melbourne Centre for Cancer Research (UMCCR), the Garvan Institute of Medical Research and QIMR-Berghofer Medical Research Institute.

As of Q3 2020, these and other groups across Australia have sequenced and analysed the genomes of tens of thousands of people. Thanks to the Federal Government's recent investment of $500M over 10 years for a Genomics Health Futures Mission to support new and expanded studies in rare disease, cancer, and complex conditions, this number is predicted to increase more than 10-fold by 2025.

To date, human genome sequencing and analysis efforts across Australia have developed in-house solutions based on different technologies for storing/warehousing genome data and describing the content of these collections, and have largely manual/laborious systems for managing and providing access to data for bona fide researchers. The content of each collection is largely not transparent to outside users, and although there is a desire to share data wherever possible for research use, most have no efficient way to expose the collection content to researchers or to distribute the data, so there is currently a substantial burden to do so. All have a need to operate scalable infrastructure that is easily administered and that allows for the efficient management of data files and metadata. This management needs to include storing, security, access control, findability and shareability with relevant authorised parties.

World’s best practice infrastructure enables faster and easier human genome research

Much work is being done globally to build for a future where the responsible genomic data sharing for the benefit of human health will be routine.

This includes the groundbreaking efforts of the Global Alliance for Genomic Health (GA4GH) to create frameworks, policies and standards that can be deployed by genome efforts around the world to enable the responsible, voluntary, and secure sharing of genomic and health-related data - at scale. The animation from GA4GH shown to the right succinctly explains GA4GH’s goals and mission (credit: SciAni).

Other significant global efforts to build human genome data sharing infrastructure include that of the US National Institutes of Health (NIH) to develop Gen3 - a cloud-based software platform for managing, analysing, harmonising, and sharing large human genomic datasets. Gen3 has been used to underpin several very large NIH-funded Genomic Datasets that collectively house and describe data derived from hundreds of thousands of human samples (eg. NCI Genomic Data Commons, BioData Catalyst, BloodPAC, BrainCommons).

Additionally, the continued development of the global repository of human genomes for research purposes (the European Genome-phenome Archive (EGA)) into a federated and globally distributed resource (Federated EGA) is building towards a future where data assets remain securely in one jurisdictional location but will be findable and ultimately analysable in situ by others in other jurisdictions (i.e. by moving compute to the data).

Establishing appropriate infrastructure for human genome sharing and analysis in Australia

The Australian BioCommons Human Genome Informatics initiative is working towards establishing infrastructure for human genome data warehousing, sharing and analysis in Australia that adheres to various global best practice standards, and building the necessary foundations so that Australia can participate fully in the global ecosystem of responsible human genomics data analysis.

The aims of the initiative are:

To build for a future where Australian human genomics and health data is stored in a global federated network of public clouds, and to enable a smooth process and remove friction and artificial barriers between researchers and insights they can glean from the data.
To support Australian human genomics sequencing and analysis efforts deploy and operate scalable and globally compatible infrastructure that is easily administered and allows for the efficient management of data files and metadata (storing, security, access control, findability and shareability).
To support the ZERO and AGHA Flagships, through enabling infrastructure to: (a) share data within their consortia, (b) share data to build virtual cohorts internationally, and (c) enable collaborative analysis of these data, nationally and internationally.

These aims will be achieved by working with a range of research partners (including ZERO, AGHA, UMCCR, Garvan, QIMR-Berghofer and others), multiple infrastructure partners (including AAF, NCI, AARNet and others), as well as expert international groups establishing relevant systems (including GA4GH, the developers of Gen3, the ELIXIR Federated Human Data Community, Children’s Hospital of Philadelphia D3b, Seven Bridges Genomics and others).

The impact of this work will be that genomic data from thousands of Australians will be able to be shared securely and responsibly on national and global scales, enabling comparison with very large numbers of other genomes to ensure their full research value can be realised.

Watch the excellent animation shown on the right from GA4GH which explains much of this global human data sharing vision, and how adoption of various global standards (such as those for data sharing and security developed by GA4GH) can be employed make this happen (credit: SciAni).

Activity areas

Systems to support virtual cohort assembly (underpinned by Gen3 technology)

User facing (public) interfaces to enable querying data held in participating genome repositories
Common data dictionaries and agreed minimum information standards applied across participating genome repositories
Systems to enable identification of virtual cohorts across multiple participating genome repositories
Interfacing with secure sequence file data storage at each genomics data repository

Providing/expediting safe and secure access to genomics data

Deploying systems to semi-automate User Approvals by Data Access Committees (eg. DUOS, REMS)
User Authentication (AuthN) and Authorisation (AuthZ) systems, with assurance levels appropriate for human genome data (eg. GA4GH Passports and GA4GH Authorization and Authentication Infrastructure [AAI])
Systems to:
- Send approved data to approved users
- Provide access to approved data through association with approved user’s cloud-based storage
- Move compute to the approved dataset(s)

Providing access to connected Cloud analysis platform(s) by:

Linking cloud-based data analysis platform(s) (eg. Cavatica, Illumina Analytics Platform, Terra etc) to approved data.
Deployment of globally harmonised analysis pipelines
Globally federated compute

File and Metadata submission to International EGA Human Genome Data Repository

Ensuring structured phenotype information can easily be produced from participating repositories that observes metadata required by EGA
Systems to automate / semi-automate a production feed of metadata in format required by EGA
Systems in place for streamlined encryption and uploading genome files to the EGA (be it Central or Local) repository

Exploring Local EGA Node(s) in Australia

Study to assess the Local EGA and the feasibility of Local EGA node deployment(s) in Australia from a technical, policy and funding perspective

Community Engagement and Workforce Transition

Resources (including Documentation, Training Materials and Events) to enable:
- Researchers and Clinicians to use the systems
- IT infrastructure providers elsewhere to deploy the systems