07 Dec Translating our Genomes into (hella) bytes of data
Insight by: Debbie Lin
As genomic sequencing becomes prevalent in our lives. Each individual’s genetic response to disease and our environment, will be uploaded to The Cloud not just once but perhaps in multiple copy iterations as snapshot copies of our genome are analyzed to understand how it responds over time. That’s a massive amount of stored data on genetic effects on human health. Here the term genomics is used broadly to refer to all “omics” information, such as RNA, DNA, and epigenetic data. At the end of November, this year, the debut of Amazon Omics brings a new purpose-built service designed to support large scale analysis and collaborative research. It helps healthcare and life science organizations store, query, and analyze genomic data and generate insights to advance health and scientific discoveries. This foreshadows the tremendous amount of resources around data, data infrastructure and analytics that will be needed to help us better predict and better tailor treatments based on the individual genetic profiles.
In 2010, Google adopted the term hella as a scientific prefix to hellabyte to expand beyond giga, tera and yotta-byte. The term was used to represent the equivalent of 10 to the power of 27 bytes. This unofficial term has been replaced recently by the Ronnabyte. In November, representatives at the General Conference on Weights and Measures in Paris voted to adopt the new prefixes Ronnabyte and Quettabyte. Ronna became the official 10 to the power of 27 and Quetta, 10 to the power of 30.
As precision medicine and bioinformatics develops, the prefixes will soon be put to good use. Precision medicine is most widely known today in the application of profiling patients for cancer treatment or infectious disease detection (i.e. SARS, COVID-19) is the use of genomic information taken from tissue samples or liquid samples (i.e. blood or saliva even sweat) to help clinicians and researchers make decisions about what treatments might be optimal based on an individual’s disease status. To date, liquid biopsy technologies are expanding. Here, bodily fluids, primarily blood, are taken to detect molecular markers of disease. The ease of sending low cost, less invasive, faster to process diagnostic kits out to patients for testing (i.e COVID-19) makes these technologies rather accessible across a wide range of patient populations. Thus far, such technologies have been used primarily in research and translational medicine settings. However, companies such as Guardant Health, Grail, Freenome, Caris Life Sciences, among many others, are racing to validate and launch products into the market that will allow wide scale adoption.
The data that comes from the use of such products is already enormous and will grow exponentially once everyone can have their genomic information sequenced even more cheaply, easily and longitudinally. We need to think now about how this data will be stored, the infrastructure for data processing, how the data is stitched together, and how it is associated meaningfully with health outcomes. Once the data is acquired and stored, bioinformatics companies such as are already positioning themselves to provide plug and play infrastructure for pharmaceutical, diagnostic, biotech companies to support storage, processing and analysis of the data. Other companies in the natural language processing space are focused on the parsing and extraction of physician notes and health record information to clinically useful information, others in the analysis and algorithm development so that clinicians and patients can make actionable decisions.
We are at the tip of the iceberg. To give you an idea about the potential scale. Today, a single human genome sequence alone takes about 200 gigabytes of data. According to the National Humane Genome Research institute, we will need 40 exabytes (10 to the power of 18) to store the genome sequence information of the world’s population by 2025. One can imagine the amount of data will exponentially increase as our population grows, our individual access to genomic sequencing increases and sequencing becomes even more granular in depth and breadth. It took approximate thirteen years to sequence the first human genome. Today it takes approximately 3 days using NexGen Sequencing and sequencing technologies are getting even faster and cheaper.
I recently attended a bioinformatics roundtable hosted by Alliance for SoCal Innovation and San Diego State University powered by Tech Stars. Here healthcare and tech leaders from Southern California gathered to discuss various topics such as data privacy, the difficulty of integrating data across data silos, the ineffectiveness of current data integration with electronic health record information, data usability issues for clinicians and patients, healthcare delivery challenges where clinicians’ time are usurped with tasks around data entry and analysis, and more. While there are many challenges to be solved, according to current projections, bioinformatics is a growing market. It was $10.2B in 2021 and it is projected to hit $39.7B globally by 2030. In the future, leaders who understand the challenges at the intersection between biology, medicine, healthcare and information technology, who know how to lead a workforce to support data infrastructure and analysis for massive amounts of biological and medical information will be essential to our life sciences and healthcare ecosystem. For anyone interested in the intersection of informatics, data, healthcare and life sciences, this is a great area of opportunity to keep a close watch.