SPRING2 Assay Reference: Construction and Rationale¶

Overview¶

This reference was constructed to support fast, high-confidence assay type inference from small FASTQ alignment sketches in SPRING2. Rather than aligning against a full genome or transcriptome, the reference consists of carefully selected, diagnostic genomic subsets that expose assay-specific signals with minimal computational cost.

The design emphasizes:

robustness across library preparations,
resistance to sequencing depth and GC bias,
biological interpretability,
long-term reproducibility.

Input resources¶

Genome: Human GRCh38 / hg38
Gene annotation: GENCODE v49 (GTF)
Coordinate system: UCSC-style chrN contigs
Sequence extraction tool: bedtools getfasta (strand-aware where appropriate)

Reference structure¶

The final reference FASTA (spring2_assay_ref_hg38_gencode49.fa) is a concatenation of four biologically and statistically distinct blocks:

RNA exon block (RNA detection)
ATAC promoter block (chromatin accessibility detection)
Intron / intergenic control block (background normalization)
Genome backbone block (fragment geometry and periodicity)

Each block was constructed independently and validated before concatenation.

RNA exon block (RNA vs DNA discrimination)

Purpose The RNA exon block robustly distinguishes RNA-derived libraries (bulk RNA-seq, scRNA-seq) from DNA-derived libraries using exonic enrichment and splicing signal.

Gene selection A curated set of ubiquitously expressed housekeeping genes was selected: ACTB, GAPDH, EEF1A1, RPLP0, RPS18, RPL13A, HPRT1, B2M, PABPC1, HNRNPK, MALAT1

Transcript selection

For protein-coding genes: MANE Select transcripts were used when available.

Construction method

GENCODE v49 GTF was parsed to extract exon coordinates for selected transcripts.
Exons were strand-aware, sorted, and concatenated to form spliced exon-only sequences.

Result

11 FASTA records, one per transcript, representing continuous spliced mRNA sequences.
ATAC promoter block (ATAC vs non-ATAC discrimination)

Purpose

To detect chromatin accessibility libraries (ATAC-seq, scATAC-seq) via sharp enrichment at transcription start sites (TSSs).

Gene selection ACTB, GAPDH, EEF1A1, RPLP0, RPL13A, RPS18, HNRNPA1, HSP90AA1, CFL1, TUBA1B, B2M

TSS determination

Protein-coding transcripts from GENCODE v49 were used.
One representative TSS per gene was selected after collapsing transcript isoforms.

Window definition

Strand-aware ±1,000 bp windows around the TSS.

Result

11 FASTA records, each exactly 2,000 bp long.
Intron / intergenic control block (background normalization)

Purpose

To provide a neutral background DNA reference for normalization and ratio-based scoring.

Region selection

14 intergenic regions, approximately 200 kb each.
Distributed across chromosomes: chr1, chr3, chr5, chr7, chr12, chr17, chr19.

Result

14 FASTA records (~2.8 Mb total).
Genome backbone block (fragment geometry and periodicity)

Purpose

To capture fragment-length distributions and nucleosome periodicity for refining DNA, ATAC, and ChIP discrimination.

Region selection

7 intergenic regions (400 kb each) across chr1, chr3, chr5, chr7, chr12, chr17, chr19.

Result

7 FASTA records (2.8 Mb total).

Final reference assembly¶

Blocks were concatenated in the following order:

RNA exon block
ATAC promoter block
Intron / intergenic control block
Genome backbone block

Summary statistics

RNA exons: 11 records
ATAC promoters: 11 records
Intron / intergenic controls: 14 records
Genome backbone: 7 records
Total: 43 FASTA records

The reference was indexed using samtools faidx for efficient random access.

Intended use in SPRING2¶

This reference supports very small alignment sketches (10–50k reads) to enable fast and robust assay inference, including RNA vs DNA and ATAC vs non-ATAC discrimination, and complements barcode-based single-cell detection.

Versioning¶

Genome: hg38 (GRCh38) Annotation: GENCODE v49