Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
support:demultiplexing [2024/04/11 12:12] – Copy from existing help with additions and changes Richard Bowerssupport:demultiplexing [2025/08/06 09:10] (current) Richard Bowers
Line 3: Line 3:
 The data files the CRUK-CI sequencing service provides after sequencing your samples are demultiplexed according to the barcodes provided in the submission spreadsheet. It sometimes happens that users of the service make a mistake when filling in this spreadsheet and put an incorrect index against one or more samples, or samples are omitted from the submission spreadsheet. This manifests as those samples' FASTQ files being much smaller than one might expect, or not present at all. The data files the CRUK-CI sequencing service provides after sequencing your samples are demultiplexed according to the barcodes provided in the submission spreadsheet. It sometimes happens that users of the service make a mistake when filling in this spreadsheet and put an incorrect index against one or more samples, or samples are omitted from the submission spreadsheet. This manifests as those samples' FASTQ files being much smaller than one might expect, or not present at all.
  
-The good news is that the reads are still present in the data: they've just not been allocated to the sample. There is a file for each lane of sequencing called SLX-????.<flow cell id>.s_?.r_?.lostreads.fq.gz that contains all the reads that could not be allocated to any of the indexes listed in the submission spreadsheet.+The good news is that the reads are still present in the data: they've just not been allocated to the sample. There is a file for each lane of sequencing called ''SLX-????.<flow cell id>.s_?.r_?.lostreads.fq.gz'' that contains all the reads that could not be allocated to any of the indexes listed in the submission spreadsheet.
  
 The bad news is that correcting the submission information regarding the indexes attached to the samples is troublesome to the point of impossibility in our Clarity LIMS system. We cannot therefore fix your submission to rerun demultiplexing and attach those files so you can fetch them with the download tool or on the FTP site. Rerunning demultiplexing is also not an automatic process: a little time and effort will be needed to reprocess the FASTQ files. The bad news is that correcting the submission information regarding the indexes attached to the samples is troublesome to the point of impossibility in our Clarity LIMS system. We cannot therefore fix your submission to rerun demultiplexing and attach those files so you can fetch them with the download tool or on the FTP site. Rerunning demultiplexing is also not an automatic process: a little time and effort will be needed to reprocess the FASTQ files.
Line 13: Line 13:
 ===== Getting Started ===== ===== Getting Started =====
  
-The initial creation and demultiplexing of the FASTQ files is done with [[https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html|Illumina'//BCL Convert// program]]. This can only work from the intermediate proprietary files produced by the sequencer, so for any demultiplexing from FASTQ to FASTQ we use our own program, //demuxFQ//. You can download the tool using the links below:+The initial creation and demultiplexing of the FASTQ files is done with [[https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html|Illumina's BCL Convert program]]. This can only work from the intermediate proprietary files produced by the sequencer, so for any demultiplexing from FASTQ to FASTQ we use our own program, //demuxFQ//. You can download the tool using the links below:
  
-  * [[https://genomicshelp.cri.camres.org/tools/demultiplexer.rhel.tar.gz|Redhat / CentOS binary]] (RHEL 7 or newer) +  * [[https://genomicshelp.cruk.cam.ac.uk/tools/demultiplexer.rhel.tar.gz|Redhat / CentOS binary]] (RHEL 7 or newer) 
-  * [[https://genomicshelp.cri.camres.org/tools/demultiplexer.macos.tar.gz|MAC OS X binary]] +  * [[https://genomicshelp.cruk.cam.ac.uk/tools/demultiplexer.macos.tar.gz|MAC OS X binary]] 
-  * [[https://genomicshelp.cri.camres.org/tools/demultiplexer.cygwin64.zip|Windows Cygwin binary]] (requires Cygwin 64 bit) +  * [[https://genomicshelp.cruk.cam.ac.uk/tools/demultiplexer_src.tar.gz|Source tar ball]]
-  * [[https://genomicshelp.cri.camres.org/tools/demultiplexer_src.tar.gz|Source tar ball]]+
  
 If you build from source, the INSTALL file in the tar ball gives instructions on how to build the program. If you build from source, the INSTALL file in the tar ball gives instructions on how to build the program.
Line 50: Line 49:
 The first column is the sequence of the first barcode of the pair. The second column is the sequence of the second barcode of the pair. The third column is the name of the file to write reads matching the barcode pair to. This should be a simple file name (not a path), and can be any name you wish. The output will be FASTQ and, optionally, gzip compressed. The fourth column is optional and is the sample name or some other user friendly identifier. It is used solely to make the summary report easier to read. The first column is the sequence of the first barcode of the pair. The second column is the sequence of the second barcode of the pair. The third column is the name of the file to write reads matching the barcode pair to. This should be a simple file name (not a path), and can be any name you wish. The output will be FASTQ and, optionally, gzip compressed. The fourth column is optional and is the sample name or some other user friendly identifier. It is used solely to make the summary report easier to read.
  
-Note that there are complications when demultiplexing FASTQ files with dual indexes on some sequencing platforms. Refer to the section Complications with Dual Index Barcoding for details.+Note that there are complications when demultiplexing FASTQ files with dual indexes on some sequencing platforms. Refer to the section [[support:demultiplexing#complications_with_dual_index_barcoding|Complications with Dual Index Barcoding]] for details.
  
 ==== All Kits ==== ==== All Kits ====
  
-The demultiplexer must be run with the ''-e'' option on the command line if the additional identifier column is used (see below). The identifiers must not contain spaces.+The demultiplexer must be run with the ''-e'' option on the command line if the additional identifier column is used ([[support:demultiplexing#running_the_demultiplexer|see below]]). The identifiers must not contain spaces.
  
 === Paired End === === Paired End ===
  
 Regardless of the type of barcoding kit, you will need to produce a configuration file per read for your data. If your data is single end, one file is sufficient. If it is paired end, you will need a second configuration file with different file names for read two. In our naming conventions (as above), this is the "//r_?//" part of the name. So we have //r_1// for read one and //r_2// for read two. The easiest way to produce the configuration files is to prepare for one read, check it is correct, make a copy and tweak the file names for the second read, keeping the same index sequences. Regardless of the type of barcoding kit, you will need to produce a configuration file per read for your data. If your data is single end, one file is sufficient. If it is paired end, you will need a second configuration file with different file names for read two. In our naming conventions (as above), this is the "//r_?//" part of the name. So we have //r_1// for read one and //r_2// for read two. The easiest way to produce the configuration files is to prepare for one read, check it is correct, make a copy and tweak the file names for the second read, keeping the same index sequences.
 +
 +==== Two or More Barcodes per Sample ====
 +
 +It is possible to demultiplex in such a way as to put reads from more than one barcode into a single file. One can do this by listing each barcode sequence in the index file with the same file name.
 +
 +<code>
 +AGTTCC  SLX-9214.Sample01.r_1.fq.gz  Sample_01
 +AGTCAA  SLX-9214.Sample01.r_1.fq.gz  Sample_01
 +CAGATC  SLX-9214.Sample03.r_1.fq.gz  Sample_03
 +CTTGTA  SLX-9214.Sample03.r_1.fq.gz  Sample_03
 +</code>
 +
 +The example above will create two FASTQ files, each containing reads tagged with two different barcodes.
 +
 +The most likely real world example of this is the old style single index 10x SIGA/SINA barcoding. These have four barcodes per sample, so do need to be combined into a single file for their relevant sample. For example, to demultiplex and extract the SINAA1 and SINAA2 barcodes from a multiplexed file one would use an index file similar to below.
 +
 +<code>
 +AAACGGCG  SLX-1000.SINAA1.r_1.fq.gz
 +CCTACCAT  SLX-1000.SINAA1.r_1.fq.gz
 +GGCGTTTC  SLX-1000.SINAA1.r_1.fq.gz
 +TTGTAAGA  SLX-1000.SINAA1.r_1.fq.gz
 +AGCCCTTT  SLX-1000.SINAA2.r_1.fq.gz
 +CAAGTCCA  SLX-1000.SINAA2.r_1.fq.gz
 +GTGAGAAG  SLX-1000.SINAA2.r_1.fq.gz
 +TCTTAGGC  SLX-1000.SINAA2.r_1.fq.gz
 +</code>
 +
 +The examples here are for single index, but dual index works in exactly the same way with the additional column for the i5 sequence.
  
 ===== Running the Demultiplexer ===== ===== Running the Demultiplexer =====
Line 76: Line 103:
 | ''-d'' | Actually perform the demultiplexing. This causes the FASTQ files to be written. Without the option, the source FASTQ is analysed as if it were being demultiplexed but no output files are produced, giving just a report of what indexes were found in the source files. | | ''-d'' | Actually perform the demultiplexing. This causes the FASTQ files to be written. Without the option, the source FASTQ is analysed as if it were being demultiplexed but no output files are produced, giving just a report of what indexes were found in the source files. |
 | ''-e'' | Add user friendly labels in the demultiplexing reports. This option tells the demultiplexer that the final column in the index file is a user friendly name associated with the barcode (typically a sample name). If the column is present in the index file, this option must be provided on the command line. | | ''-e'' | Add user friendly labels in the demultiplexing reports. This option tells the demultiplexer that the final column in the index file is a user friendly name associated with the barcode (typically a sample name). If the column is present in the index file, this option must be provided on the command line. |
-| ''-i'' | The read headers are in (now) standard Illumina format. See the section Read Header Formats. | +| ''-i'' | The read headers are in (now) standard Illumina format. See the section [[support:demultiplexing#read_header_formats|Read Header Formats]]. | 
-| ''-l [number]'' | Explicitly tell the demultiplexer the length of the first index in a dual index kit. Sometimes more index cycles are created than the length of the barcode and this causes problems with dual index kits. Again, see the section Read Header Formats for further details. |+| ''-l [number]'' | Explicitly tell the demultiplexer the length of the first index in a dual index kit. Sometimes more index cycles are created than the length of the barcode and this causes problems with dual index kits. Again, see the section [[support:demultiplexing#read_header_formats|Read Header Formats]] for further details. |
 | ''-m'' | Generate MD5 checksums for the output files. | | ''-m'' | Generate MD5 checksums for the output files. |
 | ''-n'' | Prevents the addition of the "''.gz''" suffix to output file names even when the ''-c'' option is used. | | ''-n'' | Prevents the addition of the "''.gz''" suffix to output file names even when the ''-c'' option is used. |
 | ''-o [directory]'' | Specify an output directory for the demultiplexed files. Without this option, the files are written to the current working directory. | | ''-o [directory]'' | Specify an output directory for the demultiplexed files. Without this option, the files are written to the current working directory. |
-| ''-R'' | Use the reverse complement of the sequence given in the sample sheet for the second index of a dual index pair. This can be necessary depending on the sequencing technology used. See the section Complications with Dual Index Barcoding. |+| ''-R'' | Use the reverse complement of the sequence given in the sample sheet for the second index of a dual index pair. This can be necessary depending on the sequencing technology used. See the section [[support:demultiplexing#complications_with_dual_index_barcoding|Complications with Dual Index Barcoding]]. |
 | ''-r [number]'' | Report barcodes in the summary that appear with a frequency greater than this number. This is a fraction of the total number of reads. Barcodes that appear frequently that are not in the index file can indicate a mislabelling. The default is 0.001 (or 0.1%). | | ''-r [number]'' | Report barcodes in the summary that appear with a frequency greater than this number. This is a fraction of the total number of reads. Barcodes that appear frequently that are not in the index file can indicate a mislabelling. The default is 0.001 (or 0.1%). |
 | ''-S'' | Summarise the barcodes in the FASTQ files without a sample sheet. This gives a report that will tell you what indexes are found in the source files regardless of what indexes are expected (or indeed if you do not know the expected indexes). | | ''-S'' | Summarise the barcodes in the FASTQ files without a sample sheet. This gives a report that will tell you what indexes are found in the source files regardless of what indexes are expected (or indeed if you do not know the expected indexes). |
-| ''-s [file]'' | Write a summary report to the given file (use "''-''" as the file name to write to standard out). The report gives details of the indexes found, their frequency and so forth. It is the report you can find in the sequencing section of this web site. See the section Report Information. |+| ''-s [file]'' | Write a summary report to the given file (use "''-''" as the file name to write to standard out). The report gives details of the indexes found, their frequency and so forth. It is the report you can find in the sequencing section of this web site. See the section [[support:demultiplexing#report_information|Report Information]]. |
 | ''-t [number]'' | Allow up to the given number of mismatched nucleotides compared to the expected index (Hamming distance). | | ''-t [number]'' | Allow up to the given number of mismatched nucleotides compared to the expected index (Hamming distance). |
 | ''-v'' | Print the version of the demultiplexer and exit. | | ''-v'' | Print the version of the demultiplexer and exit. |