Impact of Insert Length on Variant Calling Quality

ASHG 2022, Los Angeles, CA

The impact of insert length on variant calling quality in whole genome sequencing.

B. Lajoie, A. Altomare, K. Blease, R. Kelley, E. Miller, J. Moreno, C. Thompson, J. Zhao, S. Kruglyak;
Element BioSci.s, San Diego, CA

Accurate variant calling is critical for whole genome sequencing applications, including rare disease and oncology. The availability of high-quality truth sets provided by NIST has enabled various benchmarking efforts across both sequencing platforms and NGS algorithms. Based on these benchmarking results, we have an understanding of the most accurate variant calling methods, the impact of greater read length, and the properties of the remaining difficult regions. However, we do not yet have a careful examination of the impact that varying insert length distributions have on accuracy. This is in part because amplifying long inserts is challenging for certain sequencing chemistries.

Our hypothesis is that longer inserts, like longer reads, could improve alignment in certain genomic regions, thus improving overall variant calling accuracy. The intuition is that a short insert may match several repetitive locations in the genome, whereas a longer insert may have at least one end outside of the repetitive region, thus providing an anchor for the alignment.

We began with a simulation study to determine the appropriate conditions to try experimentally. The open source NEAT simulation framework (https://github.com/ncsa/NEAT) was first used to generate inserts with typical mean insert length of ~350 bp. Benchmarking of the simulated reads produced SNP and Indel metrics similar to what we observed from benchmarking public sequencing data from the Precision FDA Truth Challenges. We then synthetically varied the insert length distributions and repeated the benchmarking. As we increased the insert length, the total number of variant calling errors decreased, particularly in the false negative category for positions that span the “difficult regions.” This supported our initial hypothesis.

Next, we generated sequencing libraries with mean insert sizes ranging from 350 bp to 1000 bp. We then benchmarked alignment and variant calling quality as a function of insert length. Consistent with the simulation, we saw a higher fraction of reads aligned at high mapping quality and lower number of total variant calling errors as a function of insert length. We conclude that some of the benefits of longer reads can be attained through the use of longer inserts

Download Poster