This is an old revision of the document!


Oxford Nanopore Sequencing Data Files

File Types

Your files from Oxford Nanopore Technologies (ONT) sequencing are not quite as straightforward as for Illumina sequencing. The format depends on the type of experiment.

BAM Files

If capturing base modifications is necessary for your sequencing you will receive your data as unaligned BAM files.

FASTQ Files

If your experiment doesn't require the capture of base modifications, you will receive gzip compressed FASTQ files.

Pod5 Files

Pod5 is a proprietary format developed by Oxford Nanopore. It can be considered an intermediate format, but can also be reprocessed in a way that the Illumina intermediate files cannot. Thus we always deliver the Pod5 equivalent of the BAM or FASTQ from a sequencing run.

File Naming

The files will be named using the same pattern as files from Illumina sequencing:

<SLX>.<barcode>.<flowcell>.s_<lane>.r_<read>.bam
<SLX>.<barcode>.<flowcell>.s_<lane>.md5sums.txt
<SLX>.<barcode>.<flowcell>.s_<lane>.r_<read>.pod5
<SLX>.<barcode>.<flowcell>.s_<lane>.pod5.md5sums.txt
<SLX>.<flowcell>.s_<lane>.lostreads.bam
<SLX>.<flowcell>.s_<lane>.lostreads.md5sums.txt
<SLX>.<flowcell>.s_<lane>.lostreads.pod5
<SLX>.<flowcell>.s_<lane>.lostreads.pod5.md5sums.txt

It is more likely that there is no barcoding on the PromethION than on the Illumina instruments, in which case the files will be:

<SLX>.NoIndex.<flowcell>.s_<lane>.r_<read>.bam
<SLX>.NoIndex.<flowcell>.md5sums.txt
<SLX>.NoIndex.<flowcell>.s_<lane>.r_<read>.pod5
<SLX>.NoIndex.<flowcell>.pod5.md5sums.txt

File Retention

This section is only relevant for CRUK-CI researchers. External users of the service will be able to retrieve their files from the FTP site for thirty days as normal.

Thus far we keep CRUK-CI groups' sequencing data in the archive forever. This is not going to be completely the case for ONT sequencing.

We will keep the BAM or FASTQ files indefinitely.

The Pod5 files will be kept only until the tier two storage fills up and we need to move older runs to the archive. This is a minimum of one month and normally a maximum of seven months, depending on the volume of data coming out of the sequencing service. If you want to keep the Pod5 files for longer, it is your responsibility to take a copy and arrange your own storage for them. You should consider that they will be available for a month: they might be around longer, but they may not.

The Pod5 files are large: typically nine times the size of the equivalent BAM. It is not practical to keep these files in the archive just in case they might be needed later.

Supporting Tools

Oxford Nanopore have produced a library for reading and processing Pod5 files. The documentation for this library is https://pod5-file-format.readthedocs.io and its home on Github is https://github.com/nanoporetech/pod5-file-format.

The library comes with some Python tools around the C++ core that allow you to manipulate the files. There is one shortcoming in the tool set though: the ability to easily split a large Pod5 file into chunks of a fixed size (by number of reads). We have created a tool for this job, which is available at https://github.com/crukci-bioinformatics/pod5split.

The PromethION creates many Pod5 files as it runs. It is impractical for us to distribute this collection of many files easily, so these small files are merged into one very big Pod5 file (using the pod5 merge tool). The ONT get the name from Matt pipeline runs very slowly with a single big file though, so we recommend splitting the file again using the tool we have written.