Differences

This shows you the differences between two versions of the page.

--- data:ont [2024/04/11 10:52] – Created the first draft Richard Bowers
+++ data:ont [2024/06/10 11:03] (current) – Richard Bowers
@@ Line 15: / Line 15: @@
 ==== Pod5 Files ====
-Pod5 is a proprietary format developed by Oxford Nanopore. It can be considered an intermediate format, but can also be reprocessed in a way that the Illumina intermediate files cannot. Thus we always deliver the Pod5 equivalent of the BAM or FASTQ from a sequencing run.
+Pod5 is a proprietary format developed by Oxford Nanopore. It can be considered an intermediate format, but can also be reprocessed in a way that the Illumina intermediate files cannot. Thus we always deliver the Pod5 files as produced by the sequencer with the BAM or FASTQ data.
 ===== File Naming =====
-The files will be named using the same pattern as files from [[data:illumina|Illumina sequencing]]:
+The ONT data files have an additional component to them, here referred to as ''<hash>''. This is an arbitrary hexadecimal number generated by the sequencer and put into the run identifier. For example, in ''20240521_1826_1C_PAW28166_38d23152'' the //hash// is ''38d23152''. This is required because an ONT run can be stopped and restarted with the same pool and flowcell, and in such a case the files produced would overwrite one another on the FTP site and when downloaded with //clarity-tools.jar//.
+In all other respects, the files will be named using the same pattern as files from [[data:illumina|Illumina sequencing]]:
 <code>
-<SLX>.<barcode>.<flowcell>.s_<lane>.r_<read>.bam
+<SLX>.<barcode>.<flowcell>.<hash>.s_<lane>.r_<read>.bam
-<SLX>.<barcode>.<flowcell>.s_<lane>.md5sums.txt
+<SLX>.<barcode>.<flowcell>.<hash>.s_<lane>.md5sums.txt
-<SLX>.<barcode>.<flowcell>.s_<lane>.r_<read>.pod5
+<SLX>.<flowcell>.<hash>.s_<lane>.lostreads.bam
-<SLX>.<barcode>.<flowcell>.s_<lane>.pod5.md5sums.txt
+<SLX>.<flowcell>.<hash>.s_<lane>.lostreads.md5sums.txt
-<SLX>.<flowcell>.s_<lane>.lostreads.bam
-<SLX>.<flowcell>.s_<lane>.lostreads.md5sums.txt
-<SLX>.<flowcell>.s_<lane>.lostreads.pod5
-<SLX>.<flowcell>.s_<lane>.lostreads.pod5.md5sums.txt
 </code>
@@ Line 35: / Line 33: @@
 <code>
-<SLX>.NoIndex.<flowcell>.s_<lane>.r_<read>.bam
+<SLX>.NoIndex.<flowcell>.<hash>.s_<lane>.r_<read>.bam
-<SLX>.NoIndex.<flowcell>.md5sums.txt
+<SLX>.NoIndex.<flowcell>.<hash>.md5sums.txt
-<SLX>.NoIndex.<flowcell>.s_<lane>.r_<read>.pod5
+</code>
-<SLX>.NoIndex.<flowcell>.pod5.md5sums.txt
+The Pod5 files are delivered in a TAR file. The structure inside this file is the directory structure of the run's //pod5// directory.
+<code>
+<SLX>.<flowcell>.<hash>.s_<lane>.pod5.tar
+<SLX>.<flowcell>.<hash>.s_<lane>.pod5.tar.md5sums.txt
 </code>
@@ Line 52: / Line 55: @@
 The Pod5 files are large: typically nine times the size of the equivalent BAM. It is not practical to keep these files in the archive just in case they might be needed later.
+===== Supporting Tools =====
+Oxford Nanopore have produced a library for reading and processing Pod5 files. The documentation for this library is [[https://pod5-file-format.readthedocs.io]] and its home on Github is [[https://github.com/nanoporetech/pod5-file-format]].
+The library comes with some Python tools around the C++ core that allow you to manipulate the files. There is one shortcoming in the tool set though: the ability to easily split a large Pod5 file into chunks of a fixed size (by number of reads). We have created a tool for this job, which is available at [[https://github.com/crukci-bioinformatics/pod5split]].