About CaVa

CaVa data platform overview

The CaVa data platform, together with its associated research governance framework, is an effort to streamline research into unwarranted variation in cancer treatment and outcomes. The primary measure of its success is its ability to improve the efficiency and depth of cancer health services research across NSW.

Currently, researchers who wish to access the wealth of data contained in clinical cancer care information systems must apply to multiple HRECs before it is even possible to determine the feasibility of such research within the available data. Once the necessary approvals are granted and feasibility established, custom data extracts must be generated from each individual source data system, which is a highly manual (and costly) process. Researchers then embark on the arduous task of data cleaning and preparation. Furthermore, it is not possible to repeat these analyses in other comparable datasets without a significant harmonisation phase, which may be infeasible in the face of undocumented, ad-hoc processes.

It is the aim of the CaVa platform to address these issues in two ways.

  1. Researcher-ready data

By extracting all historical clinical data with potential research utility from the primary oncology information management systems in two participating LHDs into a shared, secure, remote-access research repository, it becomes possible to bring the data under a single unified governance model. This also facilitates the decoupling of the extraction work from the core IT functionality of the local health districts, ensuring that automated mechanisms can periodically update the platform with new and changed data.

These data will be harmonised and normalised, with additional pertinent research items extracted via a validated, repeatable natural language processing mechanism.

  1. Data-ready research tools

Although every research question and associated data analysis must be shaped to suit its target, there are many core steps that can be generalised in such a way that they can be offered as packaged, parameterised tools.

CaVa will provide templates that researchers can use to bootstrap their analyses, allowing them to spend time focusing on the unique aspects of their research. This infrastructure leverages the cloud-based ERICA secure data analysis environment, in order to further streamline the research process.


If you see mistakes or want to suggest changes, please create an issue on the source repository.


Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://github.com/AustralianCancerDataNetwork/cava, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".