scRNA-seq optimized transcriptomic references
We have generated scRNA-seq optimized transcriptomic references as outlined in our 2023 Nature Methods manuscript “Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references”. One common frustration in using 3’ scRNA-seq such as provided by the 10x Genomics platform is missing or barely detectable expression of genes that can readily be observed with other methods like in situ hybridization. We show that this hidden transcriptome does not stem from lack of sensitivity of scRNA-seq but rather discarded sequencing data due to suboptimal transcriptomic references. Specifically we show that sequencing data from thousands of genes gets discarded due to same strand gene overlaps (multigene reads), poor annotation of 3’ UTRs/exons (intergenic reads) and the precise approach for incorporating intronic reads. We have optimized the mouse and human genome annotations to recover these data (including a hybrid intronic read recovery approach) and have generated whole genome scRNA-seq optimized transcriptomic references for use with Cell Ranger, STARsolo or other platforms:
Mouse mm10 scRNA-seq optimized transcripomic reference v.2.0 (see summary of mouse annotation modifications with used input files, gtf file only: optimized mouse annotation v.2.0)
Human GRCh38 scRNA-seq optimized transcriptomic reference v.2.0 (see summary of human annotation modifications with used input files, gtf file only: optimized human annotation v.2.0)
These are optimized versions of the latest 10x Genomics transcriptomics references (as of 6/1/2023) for GRCm38/mm10 mouse and GRCh38 human genome builds. You can download and use them in lieu of the latter. For best results use the following bash script in your terminal to map sequencing reads to the reference:
cellranger count --id <output folder> --transcriptome
<reference transcriptome location> --fastqs
<location of fastqs> --include-introns --sample <sample>
You can also access the code for optimizing the genome annotation for your favorite species or fine tuning the mouse/human references for your own datasets from our lab’s github page.
Finally, we have also generated an R package “ReferenceEnhancer” (see manual) that can be deployed to optimizing any unoptimized genome reference (.gtf file) for scRNA-seq analysis or discovering obscured genes. This resource can be accessed via its Github repository or installed in R as follows:
``` r
install.packages("devtools")
require(devtools)
install_github("PoolLab/ReferenceEnhancer")
```
Updates
3/1/2023 we have uploaded the v2.0 of scRNA-seq optimized genome references. The optimized annotations are now compatible with 10x Multiome / cellranger arc workflows. We have further resolved several issues (self-overlapping genes, reanalyzed hundreds of new genes for unannotated 3’ UTRs, removed problematic gene extensions). Older v.1.1. human annotation can still be accessed here and v1. of mouse annotation is available here.
07/21/2022: Human scRNA-seq optimized reference v1.1 is uploaded. Minor update with two gene coordinates reverted back to from Refseq to Gencode coordinates (IGHD, CPEB1) significantly improving intergenic read registration form Chromosome 14 & 15. Older v1.0 files can still be accessed here: human v1.0 reference and annotation.