Tools
for
the
analysis
of
high-
dimensional
si
high
In a single organism,most cells have the same genome,but specific gene expression varies across different tissues and cell types.Any given tissue or cell type expresses 11,00013,000 genes,of which 3,0005,000 have a cell type specific expression pattern,whereas the remaining genes are ubiquitously expressed1.These unique patterns of gene expression translate to differences at the protein level between different cell types and result in the vast array of cellular phenotypes found throughout the body.Therefore,a snapshot of the gene expression profile of a cell can be indicative of its phenotype.Owing to the limited amount of RNA present in each cell,gene expression profiling was historically performed on pooled cells,but this bulk sequencing approach obscured the potential cell heterogeneity in a sample or tissue2.For example,in a pool of developing progenitor cells,different cells might be primed to make distinct fate decisions but these transcriptional programmes are indistinguishable in a bulk analysis of the average gene expression in the progenitor pool.The development of technologies that can isolate thousands to tens of thousands of cells and assess their gene expression profiles at the single cell level has enabled researchers to dissect this cellular heterogeneity and work towards a better understanding of physiology,biological development and disease26.For example,researchers generated an improved quantitative map of the cell types present in the developing human kidney,which has provided insights into renal physiology7.Another single cell study demonstrated the similarities between fetal human kidney and human kidney organoids,reaffirming the utility of kidney organoids as a model for the study of disease and for drug screening8.However,deriving biological insights from single cell RNA sequencing(scRNA seq)methods demands that researchers handle the large volume of data generated by these technologies and their accompanying sources of technical noise9.Addressing the scale and complexity of these datasets thus requires a complex ecosystem of computational methods.Beyond scRNA seq analysis,other available technologies can profile genomes10,methylation patterns11 and chromatin accessibility patterns12,13 at the single cell level.Each type of single cell profiling comes with its own challenges in terms of data analysis.Additionally,the development of multi omics approaches,in which multiple types of biological molecules are profiled in the same cell,has advanced substantially in recent years.Tools for the analysis of high-dimensional single-cell RNA sequencing dataYanWu and KunZhangAbstract|Breakthroughs in the development of high-throughput technologies for profiling transcriptomes at the single-cell level have helped biologists to understand the heterogeneity of cell populations,disease states and developmental lineages.However,these single-cell RNA sequencing(scRNA-seq)technologies generate an extraordinary amount of data,which creates analysis and interpretation challenges.Additionally,scRNA-seq datasets often contain technical sources of noise owing to incomplete RNA capture,PCR amplification biases and/or batch effects specific to the patient or sample.If not addressed,this technical noise can bias the analysis and interpretation of the data.In response to these challenges,a suite of computational tools has been developed to process,analyse and visualize scRNA-seq datasets.Although the specific steps of any given scRNA-seq analysis might differ depending on the biological questions being asked,a core workflow is used in most analyses.Typically,raw sequencing reads are processed into a gene expression matrix that is then normalized and scaled to remove technical noise.Next,cells are grouped according to similarities in their patterns of gene expression,which can be summarized in two or three dimensions for visualization on a scatterplot.These data can then be further analysed to provide an in-depth view of the cell types or developmental trajectories in the sample of interest.Department of Bioengineering,University of California at San Diego,La Jolla,CA,USA.e-mail:kzhang bioeng.ucsd.eduhttps:/doi.org/10.1038/s41581-020-0262-0REVIEWSNature reviews|NephrologyFor example,some methods simultaneously profile RNA and chromatin accessibility14,RNA and methylation15,or even a combination of chromatin accessibility,RNA and methylation,albeit at a lower throughput16.In this Review,we provide the non expert reader with a broad overview of the different steps required for scRNA seq analysis,including pre processing of data and downstream analysis(Fig.1).We discuss challenges that are typically encountered in every step of scRNA seq data analysis and examine the different computational tools and approaches developed to address these issues,including their strengths and their limitations.We also explore how experimental design choices can affect downstream data analyses.In depth,technical explanations of specific scRNA seq analysis steps are availableelsewhere1719.Data pre-processingThe raw data obtained from scRNA seq platforms must first go through several pre processing steps before it can be used to assess biologically relevant changes in gene expression.These pre processing steps transform the raw data into a more usable format and address issues related to sample quality,the wide range of gene expression levels and variance.Additionally,these steps can reduce the impact of technical batch effects if multiple datasets are to be analysed simultaneously.Generating a gene expression matrixThe initial output FASTQ file(or files)generated in an scRNA seq experiment consists of complimentary DNA(cDNA)reads.Each read contains an RNA sequence,a cell barcode that identifies the cell from which the read was generated and a unique molecular index(UMI)that identifies the exact mRNA molecule36.The first step of scRNA seq analysis is to process these reads into a counts matrix that summarizes the number of molecules of each gene detected in each cell in the dataset4,20,21.The counts matrix serves as the input for the remaining analysis steps and is also an efficient way of storing and sharing information on gene expression(Box1).The creation of a counts matrix typically involves aligning the cDNA sequence in each read to a reference genome to identify the specific gene that the read originated from and then assigning each read to its cell of origin through its cell barcode4,20,21(Fig.2a).scRNA seq technologies use PCR to exponentially amplify cDNA molecules and UMIs enable users to identify and collapse duplicate reads that might be generated during this amplification step,thus reducing technical noise22.Of note,sequencing errors in the UMI can artificially inflate gene expression,as duplicate reads that should be collapsed are treated as distinct molecules20,23.Conversely,distinct molecules might be incorrectly labelled with the same UMI sequence and thus be treated as one molecule20.For most sequencing technologies,background RNA contamination and sequencing errors result in a large number of cell barcodes that have a low number of reads but do not correspond to real cells.These empty barcodes can be detected and removed by setting a minimum number of reads or a UMI threshold for cell barcodes.More sophisticated methods such as dropEst are also available4,20.Several tools can be used for read processing(TABle1),including CellRanger,which accompanies the 10X genomics Chromium scRNA seq platform.CellRanger handles cDNA reads,runs sequence alignment,collapses duplicate reads by their UMIs and outputs a counts matrix along with quality control(QC)statistics4.CellRanger can also perform secondary analyses such as clustering(that is,grouping cells according to similarities in their patterns of gene expression)and visualization(discussed in more detail later),albeit using a rather basic pipeline4.However,CellRanger can be fairly slow and memory intensive,using a maximum of 30 GB of RAM and taking 22 h to process 784 million reads(equivalent to 50,000 cells at a depth of 15,000 reads per cell)21.Nevertheless,the integration of CellRanger with the Loupe Cell Browser,another piece of 10X genomics software,offers non expert users an interactive browser that can be used to visualize the results of clustering and the expression of marker genes4.In the past few years,researchers have developed scRNA seq methods that can profile hundreds of thousands to millions of cells in a single experiment by using combinatorial indexing.Such methods include split pool ligation based transcriptome sequencing and single cell combinatorial indexing RNA seq5,6.Given these technological advances and considering the amount of memory and processing time required by CellRanger21,alternative computational pipelines for processing cDNA reads into single cell gene expression counts have also been developed5,6.The dropEst pipeline,for example,has faster runtimes and lower memory usage than CellRanger,and provides more accurate gene expression estimates by correcting sequencing errors in the cell barcodes and UMIs20(TABle1).DropEst also improves data recovery by using a machine learning model to identify empty barcodes,enabling the recovery of cell types that are smaller than average in size,and cell types with low RNA content that might otherwise be excluded from the analysis20.UMI Tools is another pipeline that corrects sequencing errors in the cell barcodes and UMIs to provide more accurate quantification of gene expression23.One of the slowest steps in the CellRanger pipeline is the alignment of cDNA reads to the reference genome24.The Kallisto pseudo aligner,used alongside the BUStools suite of methods for storing and manipulating scRNA seq data,is a highly efficient alternative Key pointsAssingle-cellRNAsequencingdatasetsincreaseinscaleandcomplexity,fasterandmoreefficientcomputationaltoolsforprocessingandanalysisarerequired.Newcomputationaltoolsthatcorrecttechnicalandbatcheffectscanunlockadditionalheterogeneityandenablehigher-resolutionclusteringandtrajectoryinference.Graph-basedmethodsforclusteringandtrajectoryinferenceallowforthescalableanalysisoflargesingle-cellRNAsequencingdatasets.Visualizationmethodscandistortthestructureofthedataandbatchcorrectionmethodscanreducecell-typeresolution;bothmethodsshouldthereforebeusedwithcareandmightrequirespecificparametertuningforeachdataset.High-levelbiologicalinterpretation,suchascell-typeannotation,remainschallengingandtime-consumingnewautomatedmethods,alongsidethecreationofsingle-cellreferenceatlases,promisetoaddresstheseissues.FASTQ fileA text file that stores DNA sequences and their associated quality metrics and metadata;a single sequence in a FASTQ file is called a read.Counts matrixAn integer matrix(that is,numerical data arranged in a set of columns and rows)in which the columns typically correspond to cells,whereas the rows correspond to genes;each entry represents the number of molecules of that gene expressed in that CellRanger because it creates a list of compatible transcripts for each read(pseudo alignment)instead of aligning individual reads to an exact position in the transcriptome(alignment)21,24.The combined KallistoBUStools method is up to 51 times faster than CellRanger and uses a maximum of 12 GB of RAM when processing 50,000 cells21(TABle1).However,KallistoBUStools does not remove empty cell barcodes.STARSolo and Alevin are extensions of two alignment and pseudo alignment methods,respectively,that can also be used for processing of scRNA seq data25,26(TABle1).Both STARSolo and Alevin have significantly faster runtimes than CellRanger,but STARSolo has a higher maximum RAM usage21.In summary,the first step of scRNA seq analysis is to process raw reads into a matrix of single cell gene expression counts.For users of the 10X genomics scRNA seq platform,CellRanger offers a convenient,albeit slow and memory intensive method for this processing.CellRanger also runs basic clustering and marker gene analysis that can be visualized with the Loupe Cell Browser.DropEst,KallistoBUStools,UMI Tools,STARSolo and Alevin are alternative read processing methods that offer substantial runtime and memory improvements,enabling users to process their scRNA seq runs without having to invest as much in computational infrastructure.Additionally,the enhanced correction of UMI and cell barcode errors available with DropEst,UMI Tools and KallistoBUStools can improve gene expression estimates compared with CellRanger.Quality control and doublet detectionAll scRNA seq methods generate technical biases and noise some basic QC addresses these issues before downstream analysis.Protocols used for single cell dissociation and sequencing,for example,can induce cellular stress and result in cell death,which biases gene expression and can result in artificial clusters of dead cells in downstream analyses27.Filtering out cells with either a low cDNA read or UMI count,as well as cells with a large number of mitochondrial reads per total number of UMIs(also known as mitochondrial fraction)can help to remove dead cells2.Unlike cytoplasmic RNA,the presence of mitochondrial RNA is indicative of cell death.The appropriate threshold for the number of reads or UMIs,and mitochondrial read fraction depends on the cell types present in the dataset and the scRNA seq method being used.Setting a threshold for the minimum number of cells in which a gene is detected can also help to exclude genes that are only expressed in a small number of cells and are unlikely to be informative.However,users should ensure that this threshold is not too high,as rare cell types might be otherwise missed in the downstream analysis.For most scRNA seq methods,the presence of doublets,generated when two or more cells are assigned to the same cell barcode,can create artificial clusters in the downstream analysis,as merging the gene expression patterns of two distinct cell types might create a unique expression signature that is not found in any real cell type.However,manually differentiating doublet clusters from true clusters can be challenging,especially for large datasets with many cell types28.One common strategy for identifying doublets involves generating simulated doublets by combining cells from different clusters in the dataset and assessing which cells have similar expression profiles to the simulated doublet cells28.However,this strategy is only feasible when the dataset con tains discrete cell types,rather than continuous cellular trajectories28.QC thresholds might differ between datasets and some exploratory data analysis,such as histograms of the distribution of UMIs per cell or gene,can help to set thresholds for each dataset.In some cases,such as Raw scRNA-seq dataGenerate single-cellcounts matrixRun QC checksand normalize countsVariance stabilizationand feature selectionBatch effect correctionand data integrationDimensionality reductionVisualizationCel