Measuring the Accuracy of Element AVITI™ Sequencing Data

At Element Biosciences, we are committed to setting new standards for accuracy while driving applications that leverage that accuracy to achieve new or more efficient results. The Element AVITI System combines exceptional accuracy with a high number of reads and low operating costs.

In this article, we describe the accuracy of the AVITI System and how that accuracy is measured.

View Q40+ on AVITI Presentation from AGBT ’22

What Are Q-Scores?

The field of genomics has developed and evolved tremendously over the last 25 years, but a few central pillars have remained largely intact.

Quality scores (Q-scores) based on the Phred scale are a great example. In 1992, the concept of Q-scores was proposed as part of the SCF file format. A formal proposal of a numerical scoring method soon followed in a 1995 Nucleic Acids Research paper authored by James Bonfield and Rodger Staden at the University of Cambridge MRC Laboratory. The new ability to generate machine-readable data from sequencing traces for the ABI 373A and Pharmacia ALF instruments motivated the paper.

Until the arrival of automated sequencing, base calls were subjective and based on autoradiographs. Bonfield and Staden proposed the novel idea that base-calling algorithms can produce numerical estimates of accuracy at each base, building on earlier error correction methods. Their simple concept flourished, becoming the cornerstone of competitive positioning for a variety of sequencing platforms generating billions of reads per run.

How Are Q-Scores Determined?

Modern FASTQ files encode Q-scores for each base in each sequencing read.

The field has generally settled on Q30—or one error in every 1000 bases—as a reasonable accuracy standard for sequencing. Some long-read platforms consider one error in every 100 bases (Q20) as a reasonable standard. Although Q20 and Q30 are often mentioned, base call values of Q40 and above have not been broadly considered due to the accuracy limitations of current sequencing technologies.

Accuracy is typically described in terms of Q-scores, which are the log-transformed error probabilities.

Table 1 presents the interpretation of common Q-scores. The quality specification for the AVITI System is at least 90% Q30 computed on a 2 x 150 bp run. The minimum output on each of two independently operated flow cells is 800 million read pairs. In practice, both the accuracy and number or reads specifications are often significantly exceeded. We begin by defining Q-scores and explaining how they are assigned. We then demonstrate that the Q-scores accurately represent the underlying data quality.

Table 1: Interpretation of Q-scores

Quality Score

Interpretation

Q10

1 error in 10 bases

Q20

1 error in 100 bases

Q30

1 error in 1000 bases

Q40

1 error in 10,000 bases

Q50

1 error in 100,000 bases

How Element Biosciences Assigns Q-Scores

To assign Q-scores, Element follows the methods of Ewing et al. but with custom predictors. Briefly, we generated 20 runs of whole-human sequencing (WGS) data for training and leveraged Covaris sheared, PCR-free library prep to limit upstream errors. The base calls were labeled as either correct or as erroneous based on alignment data. The base calls and labels served as input for the training process. In turn, the output of the training process was a table that maps predictor values to Q-scores. The table was applied to an independent run (not used in the training) to generate Figure 1 (seen below), which illustrates the run. Download the run from our Sequencing Datasets page here.

R1 & R2 Predicted Q-Score Distribution Plot
Figure 1: Predicted and observed quality scores for a 2x150 bp human genome sequencing run
  • The histogram shows that most of the base calls exceed Q40, with Q44 being the most frequent assignment.
  • The blue dots show that the predicted Q-scores match the recalibrated Q-scores well.

To obtain the recalibrated (empirical) Q-scores, GATK Recalibrator is paired with publicly available known-sites files to prevent bases overlapping variant positions from being counted as sequencing errors. Figure 2 details how GATK Recalibrator was applied to generate the data.

Application of GATK BaseRecalibrator to the data
Figure 2: Application of GATK BaseRecalibrator to the data

Data Filtering

In addition to showing the Q-scores and verifying accuracy, describing any data filtering is also important:

  • First, we eliminate read pairs that have poor quality in the early cycles of either read. This procedure typically removes ~5% of the overall data. For the cited sequencing run, the number of read pairs that pass filter is ~1.02 billion, far exceeding the 800 million read pair specification.
  • Second, we trim adapter sequences by matching the ends of the reads to the known adapter sequence. The amount of filtering this step performs depends on the insert length distribution and read length. Both the adapter-trimmed and untrimmed data are available. Notably, we do not perform any filtering based on alignment or any other knowledge that secondary analysis generates.

Achieving Q40+ on AVITI

In conclusion, Element data generated from PCR-free libraries has a high proportion of Q40 data. The assigned Q-scores are accurate across the entire range of scores as determined by open source third-party tools. Minimal filtering is applied to the raw the data, and our website offers the trimmed and untrimmed data in FASTQ format.

To learn more about what the Element AVITI System can do for you, reach out to our team today.

Contact Us