FASTQ Files and Index Reads

Typically, our service will deliver one or two FASTQ files for your reads, one for each main part of the read sequence. We do not as a matter of course deliver FASTQ files for index reads. There are exceptions (discussed below), but for most cases there is no value in doing so and these files can significantly increase the size of your lane on disk for no benefit.

One should consider what difference having the index quality scores will make to your analysis. Will you really demultiplex again based on the quality scores of the index reads? Doing so would probably only allow you to make the criteria stronger for the quality, resulting in fewer reads. BCL Convert has already passed your reads through the purity filter and has allocated them to an index, so would you go to the lengths of interpreting those reads differently to demultiplex in a different way?

"Unspecified Index"

This is our “sequencing something novel at your own risk” option. For this we will not be sure what is in the index reads, if indeed they are indexes, so we will generate FASTQ for the index reads of this index type. See "Custom Indexing" on the Genomics Indexes page.

10x Single Cell (excluding ATAC)

We do not generate FASTQ index reads for most single cell library types. We have had communication with 10x support and they have confirmed that not having the index read FASTQ makes no difference to the analysis. As such, the additional storage outweighs having a number for index read quality. The same logic as quoted earlier applies: would you re-demultiplex in a different way with index quality scores?

10x ATAC

Applies to both single cell ATAC and multiome ATAC.

The two ATAC methods are the exception to the rule: for these we do provide index read FASTQ. 10x have advised that, while the analysis will work without these files, their presence can help CellRanger ARC.

From 10x support:

If it helps, it is only the I2 (i5 index read) that is required for the single cell and Multiome ATAC products, because of the placement of the 10x barcode in these reads. The Index 1 file is only used for demultiplexing (like the Index files for the other 10x products) and the Index 1 files are not needed. The quality score information is important to be preserved for not only barcode correction part of the cellranger-arc analysis algorithm, but also for long term data preservation/storage purposes. After the Cell Ranger ARC data analysis pipeline is run to process the raw FASTQ files, the critical barcode information is lost in the output BAM files. Original FASTQ files can be generated from the BAM, but if the quality scores are lost from the index reads in the case of Single Cell/Multiome ATAC files, then FASTQ cannot be re-generated and the barcode information is lost because the headers are not preserved. In raw data file storage repositories such as EBI and NCBI, the FASTQ headers are not preserved after files are uploaded. If the Index reads are not a separate FASTQ file, then the 10x barcode is lost and the single cell properties of the data are lost.

“CAGACGCG” is the known Multiome spacer sequence. This is expected, constant sequence that is expected to be found in the first 8 bp of the Index 2 (i5) read and is just a part of the chemistry of this product, related to being able to connect the Multiome GEX product barcodes to the Multiome ATAC product barcodes. This sequence is found in the Mulitome product user guide on page 81. For analysis, we only care about the subsequent 16 bp 10x barcode. So if you really want to mask out the spacer sequence and save a bit of extra space by eliminating those 8 bp from the Index 2 file, then that is fine. They are not needed by the analysis software. But yes, the Cell Ranger ARC software will automatically read specifically the 16 bp 10x barcode sequence directly after the spacer sequence, so there is no need to do any kind of manipulation to extract it on your end.

Yes, the cellranger-arc pipeline will run just fine when the 10x barcode is located in the header. If it does happen by accident that the Index 2 (i5) read is sent to the header as in this case, it will not be fatal to the data analysis software. As you have seen, there is little difference in the data analysis output. However for long term and high throughput considerations, it is important to be aware that this is not the optimal configuration for this product for the reasons that I stated above. If something had gone wrong during sequencing of the Index 2 read, then this would not be able to be diagnosed without this Index 2 FASTQ file. Storing the 10x barcode in the header as a UMI without quality score information is not equivalent to storing the 10x barcode in a separate Index 2 FASTQ file with quality score information.

So the optimal solution for ATAC sequencing would be to provide a FASTQ file for the second index read but not the first. One cannot create a FASTQ for one of the index reads and not the other, so FASTQs are produced for both. We could delete the i7 FASTQ after creating it but this introduces a special case that is not, on balance, worth the complication.

CRUK-CI Genomics Help

Table of Contents

FASTQ Files and Index Reads

"Unspecified Index"

10x Single Cell (excluding ATAC)

10x ATAC