Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
data:ont [2024/04/11 10:52] – Created the first draft Richard Bowersdata:ont [2024/06/10 11:03] (current) Richard Bowers
Line 15: Line 15:
 ==== Pod5 Files ==== ==== Pod5 Files ====
  
-Pod5 is a proprietary format developed by Oxford Nanopore. It can be considered an intermediate format, but can also be reprocessed in a way that the Illumina intermediate files cannot. Thus we always deliver the Pod5 equivalent of the BAM or FASTQ from a sequencing run.+Pod5 is a proprietary format developed by Oxford Nanopore. It can be considered an intermediate format, but can also be reprocessed in a way that the Illumina intermediate files cannot. Thus we always deliver the Pod5 files as produced by the sequencer with the BAM or FASTQ data.
  
 ===== File Naming ===== ===== File Naming =====
  
-The files will be named using the same pattern as files from [[data:illumina|Illumina sequencing]]:+The ONT data files have an additional component to them, here referred to as ''<hash>''. This is an arbitrary hexadecimal number generated by the sequencer and put into the run identifier. For example, in ''20240521_1826_1C_PAW28166_38d23152'' the //hash// is ''38d23152''. This is required because an ONT run can be stopped and restarted with the same pool and flowcell, and in such a case the files produced would overwrite one another on the FTP site and when downloaded with //clarity-tools.jar//
 + 
 +In all other respects, the files will be named using the same pattern as files from [[data:illumina|Illumina sequencing]]:
  
 <code> <code>
-<SLX>.<barcode>.<flowcell>.s_<lane>.r_<read>.bam +<SLX>.<barcode>.<flowcell>.<hash>.s_<lane>.r_<read>.bam 
-<SLX>.<barcode>.<flowcell>.s_<lane>.md5sums.txt +<SLX>.<barcode>.<flowcell>.<hash>.s_<lane>.md5sums.txt 
-<SLX>.<barcode>.<flowcell>.s_<lane>.r_<read>.pod5 +<SLX>.<flowcell>.<hash>.s_<lane>.lostreads.bam 
-<SLX>.<barcode>.<flowcell>.s_<lane>.pod5.md5sums.txt +<SLX>.<flowcell>.<hash>.s_<lane>.lostreads.md5sums.txt
-<SLX>.<flowcell>.s_<lane>.lostreads.bam +
-<SLX>.<flowcell>.s_<lane>.lostreads.md5sums.txt +
-<SLX>.<flowcell>.s_<lane>.lostreads.pod5 +
-<SLX>.<flowcell>.s_<lane>.lostreads.pod5.md5sums.txt+
 </code> </code>
  
Line 35: Line 33:
  
 <code> <code>
-<SLX>.NoIndex.<flowcell>.s_<lane>.r_<read>.bam +<SLX>.NoIndex.<flowcell>.<hash>.s_<lane>.r_<read>.bam 
-<SLX>.NoIndex.<flowcell>.md5sums.txt +<SLX>.NoIndex.<flowcell>.<hash>.md5sums.txt 
-<SLX>.NoIndex.<flowcell>.s_<lane>.r_<read>.pod5 +</code> 
-<SLX>.NoIndex.<flowcell>.pod5.md5sums.txt+ 
 +The Pod5 files are delivered in a TAR fileThe structure inside this file is the directory structure of the run's //pod5// directory. 
 + 
 +<code> 
 +<SLX>.<flowcell>.<hash>.s_<lane>.pod5.tar 
 +<SLX>.<flowcell>.<hash>.s_<lane>.pod5.tar.md5sums.txt
 </code> </code>
  
Line 52: Line 55:
  
 The Pod5 files are large: typically nine times the size of the equivalent BAM. It is not practical to keep these files in the archive just in case they might be needed later. The Pod5 files are large: typically nine times the size of the equivalent BAM. It is not practical to keep these files in the archive just in case they might be needed later.
 +
 +===== Supporting Tools =====
 +
 +Oxford Nanopore have produced a library for reading and processing Pod5 files. The documentation for this library is [[https://pod5-file-format.readthedocs.io]] and its home on Github is [[https://github.com/nanoporetech/pod5-file-format]].
 +
 +The library comes with some Python tools around the C++ core that allow you to manipulate the files. There is one shortcoming in the tool set though: the ability to easily split a large Pod5 file into chunks of a fixed size (by number of reads). We have created a tool for this job, which is available at [[https://github.com/crukci-bioinformatics/pod5split]].