Table of Contents
Oxford Nanopore Sequencing Data Files
File Types
Your files from Oxford Nanopore Technologies (ONT) sequencing are not quite as straightforward as for Illumina sequencing. The format depends on the type of experiment.
BAM Files
If capturing base modifications is necessary for your sequencing you will receive your data as unaligned BAM files.
FASTQ Files
If your experiment doesn't require the capture of base modifications, you will receive gzip compressed FASTQ files.
Pod5 Files
Pod5 is a proprietary format developed by Oxford Nanopore. It can be considered an intermediate format, but can also be reprocessed in a way that the Illumina intermediate files cannot. Thus we always deliver the Pod5 files as produced by the sequencer with the BAM or FASTQ data.
File Naming
The ONT data files have an additional component to them, here referred to as <hash>. This is an arbitrary hexadecimal number generated by the sequencer and put into the run identifier. For example, in 20240521_1826_1C_PAW28166_38d23152 the hash is 38d23152. This is required because an ONT run can be stopped and restarted with the same pool and flowcell, and in such a case the files produced would overwrite one another on the FTP site and when downloaded with clarity-tools.jar.
In all other respects, the files will be named using the same pattern as files from Illumina sequencing:
<SLX>.<barcode>.<flowcell>.<hash>.s_<lane>.r_<read>.bam <SLX>.<barcode>.<flowcell>.<hash>.s_<lane>.md5sums.txt <SLX>.<flowcell>.<hash>.s_<lane>.lostreads.bam <SLX>.<flowcell>.<hash>.s_<lane>.lostreads.md5sums.txt
It is more likely that there is no barcoding on the PromethION than on the Illumina instruments, in which case the files will be:
<SLX>.NoIndex.<flowcell>.<hash>.s_<lane>.r_<read>.bam <SLX>.NoIndex.<flowcell>.<hash>.md5sums.txt
The Pod5 files are delivered in a TAR file. The structure inside this file is the directory structure of the run's pod5 directory.
<SLX>.<flowcell>.<hash>.s_<lane>.pod5.tar <SLX>.<flowcell>.<hash>.s_<lane>.pod5.tar.md5sums.txt
File Retention
This section is only relevant for CRUK-CI researchers. External users of the service will be able to retrieve their files from the FTP site for thirty days as normal.
Thus far we keep CRUK-CI groups' sequencing data in the archive forever. This is not going to be completely the case for ONT sequencing.
We will keep the BAM or FASTQ files indefinitely.
The Pod5 files will be kept only until the tier two storage fills up and we need to move older runs to the archive. This is a minimum of one month and normally a maximum of seven months, depending on the volume of data coming out of the sequencing service. If you want to keep the Pod5 files for longer, it is your responsibility to take a copy and arrange your own storage for them. You should consider that they will be available for a month: they might be around longer, but they may not.
The Pod5 files are large: typically nine times the size of the equivalent BAM. It is not practical to keep these files in the archive just in case they might be needed later.
Supporting Tools
Oxford Nanopore have produced a library for reading and processing Pod5 files. The documentation for this library is https://pod5-file-format.readthedocs.io and its home on Github is https://github.com/nanoporetech/pod5-file-format.
The library comes with some Python tools around the C++ core that allow you to manipulate the files. There is one shortcoming in the tool set though: the ability to easily split a large Pod5 file into chunks of a fixed size (by number of reads). We have created a tool for this job, which is available at https://github.com/crukci-bioinformatics/pod5split.