Illumina Sequencing Data Files

Your sequencing data will made available in the standard FASTQ file format. We also provide some smaller files giving information about those files, and a report for each lane of sequencing.

Processing the sequenced run folders is done using Illumina's BCL Convert program. Further details on this tool can be found on the Illumina web site.

FASTQ Files

Your data will be demultiplexed according to the information supplied in your submission for sequencing. What you receive will depend on the type of sequencing done.

If one is downloading files outside of the CRUK-CI institute, it is strongly recommended that one checks the files have no corruption during transfer (usually truncation rather than corruption). We provide checksums for the FASTQ files, which can be used to make sure all the files have transferred properly.

Regular Sequencing

You will normally receive one (single read) or two (paired end) FASTQ files per sample for each lane of sequencing, plus one or two additional FASTQ files that contain the reads that the demultiplexer (Illumina's BCL Convert) cannot assign to any sample (what we call “lost reads”). The files will be named according to the pattern:

<SLX>.<barcode>.<flowcell>.s_<lane>.r_<read>.fq.gz

A small subset of kits (library types) also return an index read: a separate FASTQ file or pair of files containing the index reads with quality scores. These files will be named with an “i” instead of the “r” for the index read number:

<SLX>.<barcode>.<flowcell>.s_<lane>.i_<index read>.fq.gz

The lost reads file will have a different naming pattern:

<SLX>.<flowcell>.s_<lane>.lostreads.fq.gz

There will also be checksum files for each sample and the lost reads.

<SLX>.<barcode>.<flowcell>.s_<lane>.md5sums.txt
<SLX>.<flowcell>.s_<lane>.lostreads.md5sums.txt

External users of our service in particular should use the checksums to make sure your data files have copied to your local systems without corruption or truncation. See the section below for instructions on how to do this.

Custom Indexing

If your submission requested custom indexing, you will receive one set of files containing all your reads. The barcode will be “UnspecifiedIndex”. You will always receive one or two index FASTQ files with custom indexing. There is no lost reads file (all the reads are in the sample FASTQ). The files you can expect are:

<SLX>.UnspecifiedIndex.<flowcell>.s_<lane>.r_<read>.fq.gz
<SLX>.UnspecifiedIndex.<flowcell>.s_<lane>.i_<index read>.fq.gz
<SLX>.UnspecifiedIndex.<flowcell>.md5sums.txt

No Indexing or Inline Indexing

Similar to custom indexing, you will receive one set of files containing all your reads. The barcode will be “NoIndex” or “INLINE”. There is no index information so there will be no index reads, nor lost reads. The files you can expect are:

<SLX>.NoIndex.<flowcell>.s_<lane>.r_<read>.fq.gz
<SLX>.NoIndex.<flowcell>.md5sums.txt

10x Sequencing

As of November 2020 we supply 10x data as a set of FASTQ files in the same manner as any other regular multiplexed library type, for both single index (SI-GA and SI-NA) and dual index (TT/TS/TN/NT/NN) kits. If you have received your single cell data files as compressed FASTQ files, refer to the regular sequencing section. Also refer to the section on file name conversion if these files need to be renamed for 10x downstream pipelines.

The FASTQ files you receive for SI-GA and SI-NA indexing will have all four index sequences for each barcode combined in the set of FASTQ files for the sample. A quick run through a demultiplexer will confirm the presence of the four indexes at approximately 25% of reads for each one.

The rest of this section refers to single cell sequencing before November 2020, when single cell data was delivered as a TAR file containing twelve or sixteen FASTQ files. If you have received a TAR file per sample, continue reading.

With the single index SI-GA and SI-NA indexing (and indeed previous, now discontinued, 10x kits), each barcode is actually four individual barcode sequences, so to associate what will be four sets of FASTQ files with the single sample given in submission the FASTQ is put into a tar archive. There will be a checksum file for the archive. The files will be:

<SLX>.<barcode>.<flowcell>.s_<lane>.tar
<SLX>.<barcode>.<flowcell>.s_<lane>.md5sums.txt
<SLX>.<flowcell>.s_<lane>.lostreads.tar
<SLX>.<flowcell>.s_<lane>.lostreads.md5sums.txt

Inside each archive there will be twelve or sixteen FASTQ files. These will be named as they were produced by BCL Convert, which should allow 10x's Cell Ranger and Long Ranger to read the files as if they had been produced by these tools' demux pipelines. There will be FASTQ files for the index reads too.

For example, for a sample that was labelled with the 10x SINA H9 barcode, the tar archive will contain:

SIGAH9_S5_L001_I1_001.fastq.gz
SIGAH9_S5_L001_R1_001.fastq.gz
SIGAH9_S5_L001_R2_001.fastq.gz
SIGAH9_S6_L001_I1_001.fastq.gz
SIGAH9_S6_L001_R1_001.fastq.gz
SIGAH9_S6_L001_R2_001.fastq.gz
SIGAH9_S7_L001_I1_001.fastq.gz
SIGAH9_S7_L001_R1_001.fastq.gz
SIGAH9_S7_L001_R2_001.fastq.gz
SIGAH9_S8_L001_I1_001.fastq.gz
SIGAH9_S8_L001_R1_001.fastq.gz
SIGAH9_S8_L001_R2_001.fastq.gz

The sample number (the second part of the file name, “S5” etc) is an internal sample number generated by BCL Convert that has no significance.

Supporting Files

There will be some additional files delivered with the FASTQ data. These hold some statistics and QC test results that may be useful.

<SLX>.<flowcell>.s_<lane>.bclconvert.zip: This is a zip archive of BCL Convert's Reports folder and the sample sheet we used to create the FASTQ files. The statistics are CSV files that can be parsed to extract the numbers of reads, yield, unknown barcodes and so forth. It is created by BCL Convert so the keen reader can refer to Illumina's documentation for the tool for details. While it is supported, we include the bcl2fastq format of the same reports under the legacy folder.
<SLX>.<flowcell>.s_<lane>.contents.csv: A small CSV file detailing the barcoding of samples in the pool. Gives the sample name, barcode name and barcode sequence for the pool. May not be available for those sequencing methods that do not have multiplexing information in Clarity (no index, inline barcode, custom index).

The QC Report

We deliver with the data a report that we also use to ensure the sequencing has gone as expected.

Our demultiplexing and barcode balance reports. These give charts and numbers for the reads created from the sequencing. If there are indexing problems they'll show up here.
Our single cell reports. These are only present for 10x single cell lanes and are produced per sample.
FastQC: Assorted sequencing metrics. Where the lane is regular paired end sequencing, there will be a FastQC report for each read (FastQC is not a tool that handles paired end).
Multi Genome Alignment: A contaminant screen for the pool.

This report will be named <SLX>.<flowcell>.s_<lane>.html.

10x Visium Data

If any of your samples have been through the 10x CytAssist as part of the Visium work flow you will receive the image files alongside your sequencing data. They will be fetched automatically by the download tool (CRUK-CI researchers) or will be put into your FTP directory (external users). The CytAssist image file follows the naming convention:

<SLX>.<barcode>.<flowcell>.s_<lane>.cytassist.tif

There is an additional information file produced for Visium containing sample information (name and barcode label) together with our Visium identifier for each sample, the slide identifiers and the position on the slides.

<SLX>.<flowcell>.s_<lane>.visiuminfo.csv

10x OCM / GEM-X Flex

10x On-Chip Multiplexing (OCM) and GEM-X Flex protocols are pooled after an early stage barcoding step before being pooled again for sequencing. The data as delivered is based around the latter pooling step; the same step as all other sequencing work flows go through. Thus your lane of data will be demultiplexed based only on this barcoding: the early stage indexing and pooling you will have to demultiplex yourself. However, to help with this we provide an additional file for any lane that contains libraries of these types. The file will be:

<SLX>.<flowcell>.s_<lane>.pooling.csv

This file will contain the Flex pool name (this is a name given to the first pool made in the protocol, and is pretty meaningless), the Flex number (barcode), and your sample's name, repeated for every sample in the lane.

Olink Sequencing

If the Library Type of your library is “Olink” we will run the program ngs2counts on each Olink lane to get the counts across the flowcell for all lanes containing the same SLX pool. The additional file produced will be:

<SLX>.<flowcell>.s_<lane>.olink_counts.zip

The lane number will be the lowest lane number in which the SLX pool was sequenced, though the counts will be for all lanes containing that pool.

This is a change from the early Olink runs of 2025. Please see the page Olink Sequencing Counts for full details.

File Name Conversion

Some tools require the FASTQ files to be named as they would be when delivered by BCL Convert, not as we provide them. 10x's pipelines are a notable example. To this end, the script crukci_to_illumina.py will convert a directory containing our FASTQ files to names as they would have been immediately after BCL Convert.

It changes our file names to the pattern:

<barcode>_S<number>_L00<lane>_[IR]<read>_001.fastq.gz

barcode is the barcode as it appears in our file name; number is an arbitrary sample number that BCL Convert adds to the file name (it corresponds to a row in the sample sheet but really it counts for little); lane is the lane number; [IR] is either 'R' for regular reads or 'I' for index reads; read is the read number.

The tool can be run with the command:

python3 crukci_to_illumina.py [<fastq directory>]

If fastq directory is not given, the script will look at files in the current directory. It does not recurse into subdirectories.

CRUK-CI Genomics Help

Table of Contents