Genomics, the study of an individual's complete set of DNA - including all its genes and its hierarchical, three-dimensional structural configuration - is making it possible to predict, diagnose, and treat diseases more precisely than ever before. In genomics research, the data used is the DNA data of organisms. With almost every cell in a person's body containing a complete copy of the genome, genomics has turned into a data-intensive field.
“We have been observing an exponentially increasing trend in single-cell genomics data volume for many years. In the autumn of 2019, we started handling single-cell chromatin accessibility (scATAC-Seq) datasets and that’s when we realized that we needed to develop a solution for memory efficiency,” explained Parashar Dhapola, Ph.D. student and first author of the study.
As the treasure trove of DNA data generated by researchers pursuing new medical discoveries and breakthroughs continues to grow, scientists need powerful tools to study the genome. This is already true today as researchers routinely need to integrate their data with previously published datasets or at least compare against them. However, some of the richest datasets for such endeavors are large-scale and processed using supercomputing infrastructures.
So the team of Lund University researchers set out to develop a solution - a new, memory-efficient tool for single-cell genomic analysis called Single Cell Atlas Refreshed, or Scarf. Their software, which can enable researchers to handle large-scale, single-cell genomics data, was published earlier this week in Nature Communications as an open access article, available to all.
By applying the latest innovations in data sciences and optimizing multiple sets of algorithms, Scarf can help to ensure single-cell genomics data is analyzed with very high memory efficiency. “Scarf works to overcome a major bottleneck when handling large data - the computer’s memory, also known as RAM, while simultaneously making the tasks of comparison and integration of large-scale datasets easier,” described Parashar.
“This means that researchers can now analyze atlas-scale, or large datasets on their laptops, without needing dedicated servers to run their analysis. Also, Scarf provides a unique opportunity to try various data analysis parameters and run them in parallel when large servers are available,” noted Dr. Göran Karlsson, Principal Investigator and leader of the research group on Stem Cells and Leukemia at the Lund Stem Cell Center.
This new bioinformatics tool presents a unique opportunity for researchers to accelerate the research process and enhance the quality of their results. With Scarf in the hands of scientists around the world, answers to questions we all have been waiting for – and others not yet thought of – might be around the corner, according to Göran.
“The possibility of analyzing and integrating atlas-scale single-cell data independent of computational infrastructure means that many more scientists have the possibility to produce high-throughput single-cell data, or use and recycle publicly available datasets to answer new scientific questions. In our case, we have already extensively used Scarf for analysis of scRNA-Seq and scATAC-Seq datasets in another manuscript from the lab that is currently under revision,” he concluded.
Declaration of Interests
Parashar Dhapola and Göran Karlsson have submitted a patent application (No. 2051077-2) to the Swedish patent office (PRV). The application is under review and claims a patent on the part of the manuscript concerning the down/sub-sampling of cells (TopACeDo algorithm).
This study was supported by grants from the Swedish Cancer Society, The Ragnar Söderberg Foundation, the Knut and Alice Wallenberg Foundation, the Swedish Research Council, the Swedish Society for Medical Research, and the Swedish Childhood Cancer Foundation.
This study was conducted in collaboration with Thomas Bonald, Institut Polytechnique de Paris, Paris, France. Thomas is in the forefront of developing network-based data structures and is leading the development of a scikit-network package.