Differences

This shows you the differences between two versions of the page.

--- data:retrieval [2024/04/09 10:49] – ↷ Page moved and renamed from retrieving_data to data:retrieval Bioinformatics service admin
+++ data:retrieval [2026/06/18 10:09] (current) – [External Sequencing Service Users] Johanna Barbieri
@@ Line 1: / Line 1: @@
 ====== Retrieving Your Sequencing Data ======
-===== Data Files =====
-Your sequencing data will made available in the standard FASTQ file format. We also provide some smaller files giving information about those files, and a report for each lane of sequencing.
-Processing the sequenced run folders is done using Illumina's //bcl2fastq// program. Further details on this tool can be found on the [[https://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html|Illumina web site]].
-==== FASTQ Files ====
-Your data will be demultiplexed according to the information supplied in your submission for sequencing. What you receive will depend on the type of sequencing done.
-If one is downloading files outside of the CRUK-CI institute, it is strongly recommended that one checks the files have no corruption during transfer (usually truncation rather than corruption). We provide checksums for the FASTQ files, which can be used to make sure all the files have transferred properly.
-=== Regular Sequencing ===
-You will normally receive one (single read) or two (paired end) FASTQ files per sample for each lane of sequencing, plus one or two additional FASTQ files that contain the reads that the demultiplexer (Illumina's //bcl2fastq//) cannot assign to any sample (what we call "lost reads"). The files will be named according to the pattern:
-<code>
-<SLX>.<barcode>.<flowcell>.s_<lane>.r_<read>.fq.gz
-</code>
-A small subset of kits (library types) also return an index read: a separate FASTQ file or pair of files containing the index reads with quality scores. These files will be named with an "i" instead of the "r" for the index read number:
-<code>
-<SLX>.<barcode>.<flowcell>.s_<lane>.i_<index read>.fq.gz
-</code>
-The lost reads file will have a different naming pattern:
-<code>
-<SLX>.<flowcell>.s_<lane>.lostreads.fq.gz
-</code>
-There will also be checksum files for each sample and the lost reads.
-<code>
-<SLX>.<barcode>.<flowcell>.s_<lane>.md5sums.txt
-<SLX>.<flowcell>.s_<lane>.lostreads.md5sums.txt
-</code>
-External users of our service in particular should use the checksums to make sure your data files have copied to your local systems without corruption or truncation. See the section below for instructions on how to do this.
-=== Custom Indexing ===
-If your submission requested custom indexing, you will receive one set of files containing all your reads. The barcode will be "UnspecifiedIndex". You will always receive one or two index FASTQ files with custom indexing. There is no lost reads file (all the reads are in the sample FASTQ). The files you can expect are:
-<code>
-<SLX>.UnspecifiedIndex.<flowcell>.s_<lane>.r_<read>.fq.gz
-<SLX>.UnspecifiedIndex.<flowcell>.s_<lane>.i_<index read>.fq.gz
-<SLX>.UnspecifiedIndex.<flowcell>.md5sums.txt
-</code>
-=== No Indexing or Inline Indexing ===
-Similar to custom indexing, you will receive one set of files containing all your reads. The barcode will be "NoIndex" or "INLINE". There is no index information so there will be no index reads, nor lost reads. The files you can expect are:
-<code>
-<SLX>.NoIndex.<flowcell>.s_<lane>.r_<read>.fq.gz
-<SLX>.NoIndex.<flowcell>.md5sums.txt
-</code>
-=== 10x Sequencing ===
-As of November 2020 we supply 10x data as a set of FASTQ files in the same manner as any other regular multiplexed library type, for both single index (SI-GA and SI-NA) and dual index (TT/NT/TN) kits. If you have received your single cell data files as compressed FASTQ files, refer to the regular sequencing section. Also refer to the section on file name conversion if these files need to be renamed for 10x downstream pipelines.
-The FASTQ files you receive for SI-GA and SI-NA indexing will have all four index sequences for each barcode combined in the set of FASTQ files for the sample. A quick run through a demultiplexer will confirm the presence of the four indexes at approximately 25% of reads for each one.
-//The rest of this section refers to single cell sequencing before November 2020, when single cell data was delivered as a TAR file containing twelve or sixteen FASTQ files. If you have received a TAR file per sample, continue reading.//
-With the single index SI-GA and SI-NA indexing (and indeed previous, now discontinued, 10x kits), each barcode is actually four individual barcode sequences, so to associate what will be four sets of FASTQ files with the single sample given in submission the FASTQ is put into a tar archive. There will be a checksum file for the archive. The files will be:
-<code>
-<SLX>.<barcode>.<flowcell>.s_<lane>.tar
-<SLX>.<barcode>.<flowcell>.s_<lane>.md5sums.txt
-<SLX>.<flowcell>.s_<lane>.lostreads.tar
-<SLX>.<flowcell>.s_<lane>.lostreads.md5sums.txt
-</code>
-Inside each archive there will be twelve or sixteen FASTQ files. These will be named as they were produced by //bcl2fastq//, which should allow 10x's Cell Ranger and Long Ranger to read the files as if they had been produced by these tools' demux pipelines. There will be FASTQ files for the index reads too.
-For example, for a sample that was labelled with the 10x SINA H9 barcode, the tar archive will contain:
-<code>
-SIGAH9_S5_L001_I1_001.fastq.gz
-SIGAH9_S5_L001_R1_001.fastq.gz
-SIGAH9_S5_L001_R2_001.fastq.gz
-SIGAH9_S6_L001_I1_001.fastq.gz
-SIGAH9_S6_L001_R1_001.fastq.gz
-SIGAH9_S6_L001_R2_001.fastq.gz
-SIGAH9_S7_L001_I1_001.fastq.gz
-SIGAH9_S7_L001_R1_001.fastq.gz
-SIGAH9_S7_L001_R2_001.fastq.gz
-SIGAH9_S8_L001_I1_001.fastq.gz
-SIGAH9_S8_L001_R1_001.fastq.gz
-SIGAH9_S8_L001_R2_001.fastq.gz
-</code>
-The sample number (the second part of the file name, "S5" etc) is an internal sample number generated by //bcl2fastq// that has no significance.
-==== Supporting Files ====
- There will be some additional files delivered with the FASTQ data. These hold some statistics and QC test results that may be useful.
-  - ''<SLX>.<flowcell>.s_<lane>.bcl2fastq.zip'': This is a zip archive of //bcl2fastq//'s ''Stats'' folder and the sample sheet we used to create the FASTQ files. The statistics are XML and JSON files that can be parsed to extract the numbers of reads, yield, unknown barcodes and so forth. It is created by //bcl2fastq// so the keen reader can refer to Illumina's documentation for the tool for details.
-  - ''<SLX>.<flowcell>.s_<lane>.contents.csv'': A small CSV file detailing the barcoding of samples in the pool. Gives the sample name, barcode name and barcode sequence for the pool. May not be available for those sequencing methods that do not have multiplexing information in Clarity (no index, inline barcode, custom index).
-==== The QC Report ====
-We deliver with the data a report that we also use to ensure the sequencing has gone as expected. As of summer 2020, this is a MultiQC report containing the individual reports previously delivered separately.
-  - Our demuliplexing and barcode balance reports. These give charts and numbers for the reads created from the sequencing. If there are indexing problems they'll show up here.
-  - Our single cell reports. These are only present for 10x single cell lanes and are produced per sample.
-  - [[https://www.bioinformatics.babraham.ac.uk/projects/fastqc|FastQC]]: Assorted sequencing metrics. Where the lane is regular paired end sequencing, there will be a FastQC report for each read (FastQC is not a tool that handles paired end).
-  - [[https://github.com/crukci-bioinformatics/MGA|Multi Genome Alignment]]: A contaminant screen for the pool.
-This report will be named ''<SLX>.<flowcell>.s_<lane>.html''.
-==== File Name Conversion ====
-Some tools require the FASTQ files to be named as they would be when delivered by //bcl2fastq//, not as we provide them. 10x's pipelines are a notable example. To this end, the script [[https://genomicsequencing.cruk.cam.ac.uk/sequencing/resources/crukci_to_illumina.py|crukci_to_illumina.py]] will convert a directory containing our FASTQ files to names as they would have been immediately after //bcl2fastq//.
-It changes our file names to the pattern:
-<code>
-<barcode>_S<number>_L00<lane>_[IR]<read>_001.fastq.gz
-</code>
-**barcode** is the barcode as it appears in our file name; **number** is an arbitrary sample number that //bcl2fastq// adds to the file name (it corresponds to a row in the sample sheet but really it counts for little); **lane** is the lane number; **[IR]** is either 'R' for regular reads or 'I' for index reads; **read** is the read number.
-The tool can be run with the command:
-<code>
-python3 crukci_to_illumina.py [<fastq directory>]
-</code>
-If //fastq directory// is not given, the script will look at files in the current directory. It does not recurse into subdirectories.
-===== Retrieving Files =====
 There are two methods we employ to allow users of the sequencing service to retrieve their sequencing data. Persons working inside CRUK-CI should use our data download tool; everyone else must fetch their data from our FTP site.
@@ Line 143: / Line 5: @@
 ==== CRUK-CI Researchers ====
-We provide a tool for downloading files for projects, libraries and runs that you can use from the command line. The full user manual for the download tool can be found [[https://internal-bioinformatics.cruk.cam.ac.uk/docs/clarity/internalapi/downloadtool.html|on this internal web page]]. Please visit the page to download the tool, find instructions for how to use it and also how to install Java on your personal machine (link requires you to be in the building or running the VPN).
+We provide a tool for downloading files for projects, libraries and runs that you can use from the command line. The full user manual for the download tool can be found [[https://internal-bioinformatics.cruk.cam.ac.uk/docs/clarity/downloadtool/userguide.html|on this internal web page]]. Please visit the page to download the tool, find instructions for how to use it and also how to install Java on your personal machine (link requires you to be in the building or running the VPN).
 ==== External Sequencing Service Users ====
@@ Line 149: / Line 11: @@
 Outside of CRUK-CI, data is delivered through the FTP site: ''ftp1.cruk.cam.ac.uk''.
-This is a vanilla FTP site that should be accessible by any FTP client you choose. Your group will have been provided a user name and password for accessing the site when your group arranged to use the CRUK-CI sequencing service. Your data will be in a private region of the server only accessible with your group's credentials. The site is read only.
+This is an FTP site running the FTP protocol with TLS encryption (sometimes known as ''FTPS''). You should be able to connect to the site using any up to date FTP client you choose. Your group will have been provided a user name and password for accessing the site when your group arranged to use the CRUK-CI sequencing service. Your data will be in a private region of the server only accessible with your group's credentials. The site is read only.
 **Files are available on the FTP site for a guaranteed thirty (30) days after sequencing. You MUST fetch the files in this time period. The files will be removed from the FTP site once this time has elapsed.**
@@ Line 167: / Line 29: @@
 === FTP Clients ===
- There are many FTP clients available on the web one can use to fetch files from our FTP site. Some clients are:
+There are many FTP clients available on the web one can use to fetch files from our FTP site. We officially support two of them:
   * [[https://filezilla-project.org|Filezilla]] (desktop)
+  * [[https://lftp.yar.ru|LFTP]] (command line)
+There are others you can use, though we have not properly tested them:
   * [[https://cyberduck.io|Cyberduck]] (desktop)
   * [[https://www.coffeecup.com/free-ftp|CoffeeCup Free FTP]] (desktop)
-  * [[https://lftp.yar.ru|LFTP]] (command line)
   * [[https://www.ncftp.com/ncftp|NcFTP]] (command line)
-On Linux, you might find that these programs are available through the platform's package management system. Most web browsers will also allow you to navigate the FTP site if you use the FTP protocol in the address bar: [[ftp://ftp1.cruk.cam.ac.uk/]]. The browser will prompt for your user name and password. The web browser is handy for having a look at your area of the server and previewing the reports but a proper FTP program is recommended for fetching the files.
+On Linux, you might find that these programs are available through the platform's package management system.
 We have become aware of some users using the Mac's //Finder// application to connect to the FTP server and copy the files. While convenient, it appears that //Finder// can silently truncate files while copying if the connection to the FTP server drops. Thus we do not recommend using //Finder// or //Windows Explorer// to copy the files: use a proper FTP program that will report errors. Above all, and regardless of the program used, **you must check your files against the checksums after downloading** as described above.
@@ Line 181: / Line 46: @@
 === Troubleshooting ===
-Our FTP server is pretty much as vanilla as they come and should not cause any problems. Nonetheless, occasionally people do tell us they cannot connect, and so far this has always been problems at the client end. Here are some things to check.
+Most FTP clients, and certainly //lftp// and //FileZilla//, handle the TLS encryption automatically. Other clients may need you to specifically tell it to use an encrypted connection.
+Occasionally people tell us they cannot connect, and so far this has always been problems at the client end. Here are some things to check.
-  - Make sure any encryption option is turned off. Some clients have encryption, sometimes called SSL or FTPS, turned on by default.
+  - ''FTPS'' is not the same as ''SFTP''. The former is the FTP protocol with encryption, the latter is file transfer over the secure shell protocol. You cannot use ''sftp'' or ''scp'' with our FTP server.
+  - Make sure the option for ''FTPS'' or ''SSL'' encryption is turned on in your client if it has explicit options for this.
   - Make sure your computer can see our server. Try "pinging" the FTP server with "''ping ftp1.cruk.cam.ac.uk''" from the command line. The server will echo back a reply if your pings are getting through. If ping says the packets are not being returned, please check with **your** IT department to check network connectivity. The problems have never yet been at the CRUK-CI end; if our FTP site does need to be taken out of commission for a while we will let all our collaborators know beforehand.
+  - If you are able to log in but get errors when trying to download, please ask **your** IT team to check they are allowing access for FTP ephemeral ports. The range we use is 41698-41707.
+  - If you still have problems after checking all of the above, please contact the CRUK-CI IT team on ''[[mailto:ithelpdesk@cruk.cam.ac.uk|ithelpdesk@cruk.cam.ac.uk]]'' and cc Genomics ''[[mailto:Core-Genomics-Staff@cruk.cam.ac.uk|Core-Genomics-Staff@cruk.cam.ac.uk]]''. Note this is a different address to the usual Genomics help desk and should only be used for queries about connectivity problems to our FTP server; all other queries need to go to the Genomics help desk as normal.