The browser you are using is not supported by this website. All versions of Internet Explorer are no longer supported, either by us or Microsoft (read more here: https://www.microsoft.com/en-us/microsoft-365/windows/end-of-ie-support).

Please use a modern browser to fully experience our website, such as the newest versions of Edge, Chrome, Firefox or Safari etc.

Introducing Scarf: a memory efficient solution for single-cell genomic analysis

Graphical illustration of Scarf.

A team of researchers from Lund University and the Lund Stem Cell Center have developed a new memory-efficient tool for single-cell genomic analysis called Scarf. Now available in Nature Communications, this innovative, bioinformatics software has the potential to help researchers navigate a growing treasure trove of data and set them on the path to answering new scientific questions related to human health.

Genomics, the study of an individual's complete set of DNA - including all its genes and its hierarchical, three-dimensional structural configuration - is making it possible to predict, diagnose, and treat diseases more precisely than ever before. In genomics research, the data used is the DNA data of organisms. With almost every cell in a person's body containing a complete copy of the genome, genomics has turned into a data-intensive field. 

“We have been observing an exponentially increasing trend in single-cell genomics data volume for many years. In the autumn of 2019, we started handling single-cell chromatin accessibility (scATAC-Seq) datasets and that’s when we realized that we needed to develop a solution for memory efficiency,” explained Parashar Dhapola, Ph.D. student and first author of the study. 

As the treasure trove of DNA data generated by researchers pursuing new medical discoveries and breakthroughs continues to grow, scientists need powerful tools to study the genome. This is already true today as researchers routinely need to integrate their data with previously published datasets or at least compare against them. However, some of the richest datasets for such endeavors are large-scale and processed using supercomputing infrastructures. 

So the team of Lund University researchers set out to develop a solution - a new, memory-efficient tool for single-cell genomic analysis called Single Cell Atlas Refreshed, or Scarf. Their software, which can enable researchers to handle large-scale, single-cell genomics data, was published earlier this week in Nature Communications as an open access article, available to all.

Portrait of researchers Parashar Dhapola and Göran Karlsson. Photo.
From left: Parashar Dhapola and Göran Karlsson. Photo credit: Isak Simonsson.

By applying the latest innovations in data sciences and optimizing multiple sets of algorithms, Scarf can help to ensure single-cell genomics data is analyzed with very high memory efficiency. “Scarf works to overcome a major bottleneck when handling large data - the computer’s memory, also known as RAM, while simultaneously making the tasks of comparison and integration of large-scale datasets easier,” described Parashar.

“This means that researchers can now analyze atlas-scale, or large datasets on their laptops, without needing dedicated servers to run their analysis. Also, Scarf provides a unique opportunity to try various data analysis parameters and run them in parallel when large servers are available,” noted Dr. Göran Karlsson, Principal Investigator and leader of the research group on Stem Cells and Leukemia at the Lund Stem Cell Center.

This new bioinformatics tool presents a unique opportunity for researchers to accelerate the research process and enhance the quality of their results. With Scarf in the hands of scientists around the world, answers to questions we all have been waiting for – and others not yet thought of – might be around the corner, according to Göran. 

“The possibility of analyzing and integrating atlas-scale single-cell data independent of computational infrastructure means that many more scientists have the possibility to produce high-throughput single-cell data, or use and recycle publicly available datasets to answer new scientific questions. In our case, we have already extensively used Scarf for analysis of scRNA-Seq and scATAC-Seq datasets in another manuscript from the lab that is currently under revision,” he concluded.


Declaration of Interests

Parashar Dhapola and Göran Karlsson have submitted a patent application (No. 2051077-2) to the Swedish patent office (PRV). The application is under review and claims a patent on the part of the manuscript concerning the down/sub-sampling of cells (TopACeDo algorithm). 

This study was supported by grants from the Swedish Cancer Society, The Ragnar Söderberg Foundation, the Knut and Alice Wallenberg Foundation, the Swedish Research Council, the Swedish Society for Medical Research, and the Swedish Childhood Cancer Foundation.

This study was conducted in collaboration with Thomas Bonald, Institut Polytechnique de Paris, Paris, France. Thomas is in the forefront of developing network-based data structures and is leading the development of a scikit-network package.

Contacts:

Göran Karlsson

Principal Investigator
PhD, Associate Professor
Division of Molecular Hematology

Phone: + 46 46 222 12 61
Mail: Goran [dot] Karlsson [at] med [dot] lu [dot] se


Parashar Dhapola

Ph.D. Student
Division of Molecular Hematology

Email: Parashar [dot] dhapola [at] med [dot] lu [dot] se

 

Find out more about the Research Group on Stem Cells and Leukemia

Publication:

Read the full scientific article ”Scarf enables a highly memory-efficient analysis of large-scale single-cell genomics data” in Nature Communications, 10 May, 2022 (open access).

Key Facts:

Bioinformatics: a subdiscipline of science which involves using computer technology to collect, store and analyze complex biological data such as genetic codes.

Chromatin: mixture of DNA and protein found in eukaryotic cells (cells with a nucleus).

Genome:  the complete set of DNA (or RNA in RNA viruses) of an organism - including all its genes and its hierarchical, three-dimensional structural configuration.

Single-Cell RNA-Seq: a type of analysis which provides profiling of thousands of individual cells. It enables researchers to understand at the single-cell level what genes are expressed, in what quantities, and how they differ across thousands of cells within a diverse sample.

Single cell ATAC-seq: a popular method which enables the study of highly diverse cell samples. It enables researchers to identify unique subpopulations of cell types based on their chromatin profiles. This could include cells at different developmental stages.