SPRING2 Assay Reference: Construction and Rationale¶
Overview¶
This reference was constructed to support fast, high-confidence assay type inference from small FASTQ alignment sketches in SPRING2. Rather than aligning against a full genome or transcriptome, the reference consists of carefully selected, diagnostic genomic subsets that expose assay-specific signals with minimal computational cost.
The design emphasizes:
- robustness across library preparations,
- resistance to sequencing depth and GC bias,
- biological interpretability,
- long-term reproducibility.
Input resources¶
- Genome: Human GRCh38 / hg38
- Gene annotation: GENCODE v49 (GTF)
- Coordinate system: UCSC-style chrN contigs
- Sequence extraction tool: bedtools getfasta (strand-aware where appropriate)
Reference structure¶
The final reference FASTA (spring2_assay_ref_hg38_gencode49.fa) is a concatenation of four biologically and statistically distinct blocks:
- RNA exon block (RNA detection)
- ATAC promoter block (chromatin accessibility detection)
- Intron / intergenic control block (background normalization)
- Genome backbone block (fragment geometry and periodicity)
Each block was constructed independently and validated before concatenation.
- RNA exon block (RNA vs DNA discrimination)
Purpose The RNA exon block robustly distinguishes RNA-derived libraries (bulk RNA-seq, scRNA-seq) from DNA-derived libraries using exonic enrichment and splicing signal.
Gene selection A curated set of ubiquitously expressed housekeeping genes was selected: ACTB, GAPDH, EEF1A1, RPLP0, RPS18, RPL13A, HPRT1, B2M, PABPC1, HNRNPK, MALAT1
Transcript selection
- For protein-coding genes: MANE Select transcripts were used when available.
Construction method
- GENCODE v49 GTF was parsed to extract exon coordinates for selected transcripts.
- Exons were strand-aware, sorted, and concatenated to form spliced exon-only sequences.
Result
-
11 FASTA records, one per transcript, representing continuous spliced mRNA sequences.
-
ATAC promoter block (ATAC vs non-ATAC discrimination)
Purpose
To detect chromatin accessibility libraries (ATAC-seq, scATAC-seq) via sharp enrichment at transcription start sites (TSSs).
Gene selection ACTB, GAPDH, EEF1A1, RPLP0, RPL13A, RPS18, HNRNPA1, HSP90AA1, CFL1, TUBA1B, B2M
TSS determination
- Protein-coding transcripts from GENCODE v49 were used.
- One representative TSS per gene was selected after collapsing transcript isoforms.
Window definition
- Strand-aware ±1,000 bp windows around the TSS.
Result
-
11 FASTA records, each exactly 2,000 bp long.
-
Intron / intergenic control block (background normalization)
Purpose
To provide a neutral background DNA reference for normalization and ratio-based scoring.
Region selection
- 14 intergenic regions, approximately 200 kb each.
- Distributed across chromosomes: chr1, chr3, chr5, chr7, chr12, chr17, chr19.
Result
-
14 FASTA records (~2.8 Mb total).
-
Genome backbone block (fragment geometry and periodicity)
Purpose
To capture fragment-length distributions and nucleosome periodicity for refining DNA, ATAC, and ChIP discrimination.
Region selection
- 7 intergenic regions (400 kb each) across chr1, chr3, chr5, chr7, chr12, chr17, chr19.
Result
- 7 FASTA records (2.8 Mb total).
Final reference assembly¶
Blocks were concatenated in the following order:
- RNA exon block
- ATAC promoter block
- Intron / intergenic control block
- Genome backbone block
Summary statistics
- RNA exons: 11 records
- ATAC promoters: 11 records
- Intron / intergenic controls: 14 records
- Genome backbone: 7 records
- Total: 43 FASTA records
The reference was indexed using samtools faidx for efficient random access.
Intended use in SPRING2¶
This reference supports very small alignment sketches (10–50k reads) to enable fast and robust assay inference, including RNA vs DNA and ATAC vs non-ATAC discrimination, and complements barcode-based single-cell detection.
Versioning¶
Genome: hg38 (GRCh38) Annotation: GENCODE v49