Insight by: Mamatha Shekar
Variety provides beauty, strength, excitement in many cases/instances such as food, flowers, plants, music, movie genres, weather, geography, etc. In some areas, the existence of variety has made it complicated. For example, in the electronic world, don’t we all dream of having a single/adjustable voltage for power, power plugs, earphone sets, audio visual cables, etc. that would work across all electrical/digital appliances or across the world. In my opinion, the two most advantages of standardization are the convenience to everyone, especially while traveling and being affordable (buy less and share) to people in upcoming economies. Business opportunities may have driven this kind of variety. Or is it because we missed an opportunity to standardize so everyone could design and manufacture against the established standards? With limited knowledge on this matter, I am not in a position to answer this.
In the clinical world, the impact of non-standardization is far reaching way beyond inconvenience and affordability. The non-standardization in the clinical world has slowed the discovery leading to delay in reaping the benefits of the data collected from biotechnologies such as genomics, proteomics, metabolomics, etc. I attempt to bring out the impact in the field of precision medicine, which is mostly genomic medicine today.
The human draft genome sequence was published in 2001. This largest scientific project took teams across the world around 15 years and 3 billion dollars. Since then, sequences of hundreds of thousands of complete human genomes are available publicly (GEO, dbGaP, etc.) and hundreds of thousands more are privately owned. The new influx of data has lead to amazing discoveries such as ~30K genes, 80-100K transcripts (different avatars of the same gene), existence of haplotypes, one SNP every 500 to 600 bases, etc. The question is, is this good enough? Or the bigger question is, what more can be done with this data?
The basic but interesting statistics published have limited utility. To unlock the full power of the genome, genomic data needs to be linked to clinical data to unravel its relevance to human body and disease development. This understanding advances the development of new therapeutics. Being one of the first ones attempting to link genomics data with clinical data in the last 15 years, I can confidently say it is a daunting effort. The core tenant of the big data platform we developed in the early 2000s was to aggregate hundreds of thousands of multi-omics data and integrate it with harmonized clinical data. Querying such large aggregated data paves way to understand the common mechanisms, pathways, mutations or regions of genomes involved in body and disease development. Very soon in our effort to harmonize the clinical data, we were paralyzed by the three big Vs of clinical data – variety, velocity and volume. The data received from pharma companies, premier institutes and top tier hospitals came in different flavors – unstructured, not harmonized, etc.
We started looking for a standard to use to build the framework for clinical data to tackle the variety and stumbled across the term “ontology”. There were way too many accredited ontologies developed by many scientific societies to serve the same purpose. The field of standards suffered from the existence of too many standards. Soon realizing that even widely accepted ontology was not sufficient, an ontology application was developed to make it extensible in a controlled manner with audit trails. It served as a single source of truth for all the other big data applications built henceforth. The FAIR principles were followed way before it was developed in 2016. We leveraged on existing accredited biomedical ontologies where available and created new ones where not available. This proved to be one of the best decisions and great investments we had made as it served as a solid foundation for all the other applications, which I would recommend to any one who starts on the journey of health data capture and collection.
For interoperability, data integrators/adopters etc. had to be built to address the variety and to format it to a single standard. It was an enormous effort, where 60% of the time of data scientist was spent in the preparation of data. Nevertheless, this big data platform was utilized to break data silos within an organization and make the data widely available for scientists and clinicians to test their hypothesis and make new discoveries at a higher pace.
What does it take to reduce or totally avoid similar challenges in mobile health applications that are being developed to collect our health data? Standards need to be defined, developed and published and mobile health applications have to be built using these standards. For example, mobile applications that track bowel movement will use the Bristol Stool Chart for users to record. This is not the case today. Use of standards like this will render the data collected from different applications interoperable. Interoperability will allow integration of data to derive a meaningful comprehensive dataset. Such a large dataset will empower AI and ML or else will result in junk in, junk out. No AI can help if the data is not harmonized.