Guides

How to use the DANIO-CODE trackhub on the UCSC genome browser

1. Connect the genome browser to the trackhub

  • Go to the UCSC genome browser
  • Click on the "trackhub” button underneath the tracks
  • Paste the URL of the danio-code trackhub: https://danio-code.zfin.org/trackhub/DANIO-CODE.hub.txt into the "My Hubs" section.
  • This sends you back to the genome browser connected to the DANIO-CODE trackhub

2. Add a track to the genome browser

  • Click on the desired assay type underneath the tracks
  • Click on the assay type name again, to get to the track settings
  • Select the desired tracks, e.g. based on developmental stage, by ticking the respective box
  • Click on submit to update the genome browser tracks

Track hub organisation

Tracks are grouped together based on assay. When possible, regions of enrichment are generated and showed (ChIP-seq and ATAC-seq peaks, CAGE-seq tag clusters and methylation level of cytosines in each context for BS-seq).

Inside assay type composites, the tracks are organised in subgroups based on developmental stages. This enables all tracks from the same stages to be shown in bulk. Tracks are by default ordered based on developmental time course. BS-seq and ChIP-seq have additional grouping, corresponding to cytosine context and ChIP-seq target respectively.

The tracks are named and labeled according to their respective DANIO-CODE DCC sequencing id. The long label shown above the track in the browser represent the developmental stage, group which generated the data, tissue, and geo/sra reference number. Missing information are represented by `nan`. Futher information about how the tracks were generated can be found on each composite page.

How to download the data from the DANIO-CODE DCC

There are two different data types you can download, annotation and sequencing data.

The annotation data are provided as csv files, which follow the terms of the annotation system.

You can either download it using the data export tool and clicking "Download annotation" or from the details page of a series.

The sequencing/processed files can be downloaded either using the data export tool, the individual download buttons on the series details page or via the URLs provided in the annotation csv files.

How to upload data and annotations

Video Tutorial

Series

  • The title of the series
  • The series type, i.e. if the series is a case-control or a survey study
  • A short description of the series, similar to an abstract
  • If the data is public
optional:
  • The DOI to a publication containing this series, if the series was part of a scientific paper.
  • The GEO ID, if the series was uploaded to GEO

Biosample

The following information is needed for each biosample and its replicates

  • The lab which handled the biosamples
  • The genetic background of the Zebrafish (the backgrounds are given as a list, see appendix)
  • The type of the sample, i.e. does it consist of the whole organism, a specific tissue (the anatomical term of which also has to be specified), stem cells or cell line (the specific cell line type has to be specified as well)
  • If the sample type is whole organism or tissue the developmental stage has to be stated (list of controlled vocabulary terms)
  • If the sample type is whole organism or tissue the hours post fertilization has to be given
  • If the sample type is tissue, the anatomical term describing the sample has to be given
  • If it is a case control study, the control biosample and its replicates (for anntotation uploads using the web interface the first group is always considered as the control)
optional:
  • The description of the mutation, e.g. the alleles.
  • The treatment applied on the zebrafish
  • The source of zebrafish, e.g. a zebrafish distribution center
  • The sex of the sample

Assay

The following information is needed for each assay.

  • The lab which applied the assay to the biosample.
  • The specific assay type (given as a list of possible techniques, see appendix).
  • If the technique uses a specific target, e.g. a protein or a Histone modification, this target is also needed.
  • For RNA-based techniques, information of the library preparation e.g., poly(A)+, poly(A)-, rRNA depletion is also needed.
  • The description of the Assay, e.g. a concise version of its protocol.
  • The lab which applied the assay to the biosample.

Applied Assay

  • The information on which assay was applied to which biological replicate is needed for the Applied Assay section.

Sequencing

The following information is needed for each biosample-assay pair.

  • The lab which performed the sequencing
  • The platform the sequencing was performed on
  • The specific instrument of the platform which was used for sequencing
  • If the sequencing was done single or double ended
  • The date of the sequencing (YYYY-MM-DD)
  • For RNA-seq, if the sequencing was performed unstranded, forward or reverse stranded
optional:
  • The maximal read length of the sequences
  • The chemistry version used for sequencing

Data

  • The file name, if it is inside a folder this field should contain the file path from the user directory. For files on web-accesible servers, their full url is needed here. Only demultiplexed FASTQ files, exactly one if single end sequencing was used and exactly two for paired end sequencing.

Data Transfer

Only demultiplexed FASTQ files are allowed. Exactly one file per sequencing if it was one in single end mode and exactly two for paired end sequencing each with a web accessible URL to download the data from (entered in the fields described below). For large amounts of data, write to daniocode@gmail.com to request a preloading on the server.

Data Annotation Protocol

The data annotation tool is available at danio-code.zfin.org. You need to create a user account to be able to view data. After a security check your account will have access to the annotation upload tools.

Data producers are requested to start gathering their metadata according to the nomenclatures listed below, and collect them in a format which will allow them to use CSV files during data annotation and batch annotation using the online DCC tool.

CSV Batch Annotation

In addition to the web application annotation procedure, we have provided a batch CSV uploading procedure. This will be useful for users that have a large amount of data for which the web application may be cumbersome or error prone.

Batch uploads are accomplished using CSV files that describe the annotation details (series, biosamples, assays, sequences) and associated data files. CSV files can be created in excel and "saved as" any provided CSV format. CSV files can also be automatedly produced CSV file, but must adhere exactly to the guidelines below. This CSV file can be uploaded and processed through the batch upload page at danio-code.zfin.org.

An excel document with an example upload can be found at this link.
In order to use this as a batch upload csv file the first column must be removed and this file must be saved in any supported CSV format from excel.
You can find the an example of a correctly formated CSV file at this link. If you open this file in excel please reformat the sequencing__sequencing_date to the correct date format (YYYY-MM-DD).

Each row of the uploaded CSV file can create new annotation records or work with references to already existing records via the DANIO-CODE id’s. Links between annotation records (i.e. that a biosample belongs to a series) are created between any annotation/data records on each line of the CSV file. Thus each row must contain columns describing the annotation details (series, biosample, assay, and sequencing) and the associated data. Once an annotation record has been entered on a line of the CSV file, all following lines may simply reference the interal_id field to link that record to several data files. If additional annotation record attributes are provided they must match those in the first record with that annotation record internal_id. The requirements for all annotation record attributes are listed below.

Unless otherwise noted fields may contain free text (most fields allow a maximum of 200 characters). Unless otherwise noted each field is required (unless a previous line has defined the annotation record attributes associated with this internal_id).

The header for the CSV files is strictly controlled to ensure correct annotation of each sample. Header fields for each attribute must match exactly the annotation record (series, biosample, assay, etc.) follow by two underscores (“__”) and the attribute name (internal_id, series_type, etc.). Thus the header field for a series id would be “series__series_id” and the header field for a biosample type would be “biosample__biosample_type”. Optional fields need not be included in the CSV file if this attribute will not be defined for any sample.

The following terms and their restrictions may change, please always check the most recent version of this document.

Series ("series__" fields)

internal_id
This field is a record within a single CSV file in order to show equality of a particular record as two different records may be identical in all other fields. If two lines reference the same series__internal_id all those after the first need not include additional fields describing the series, but if they are provided they must match the fields from the first record in the CSV file.
series_id
This field indicates that this line is adding information to a series already within the DANIO-CODE DCC. This series_id must match a record already found within the DANIO-CODE DCC.
series_type
This is a controlled vocabulary field.
is_public
This is a boolean text field, enter either “True” or “False”
title
description
publication (optional)
geo_sra_id (optional)

Biosample ("biosample__" fields)

internal_id
This field is a record within a single CSV file in order to show equality of a particular record as two different records may be identical in all other fields. If two lines reference the same biosample__internal_id all those after the first need not include additional fields describing the biosample, but if they are provided they must match the fields from the first record in the CSV file.
biosample_id
This field indicates that this line is adding information to a biosample already within the DANIO-CODE DCC. This biosample_id must match a record already found within the DANIO-CODE DCC.
biosample_type
This is a controlled vocabulary field.
biosample_lab
This is a controlled vocabulary field.
genetic_background
This is a controlled vocabulary field.
anatomical_term (required for biosample_type tissue)
This is a controlled vocabulary field.
stage (required for biosample_type tissue or whole organism)
This is a controlled vocabulary field.
cell_line_type (required for biosample_type cell line)
post_fertilization (required for biosample_type tissue or whole organism)
The number of hours after fertilization.
controlled_by (optional)
This field indicates that this biosample has as a biological control. This field must be either an existing DANIO-CODE biosample_id or a biosample__internal_id from the same submitted CSV file.
sex (optional)
This is a controlled vocabulary field.
treatment (optional)
description (optional)
source (optional)
mutation_description (optional)

Biosample replicate ("biosample_replicate__" fields)

internal_id

Assay ("assay__" fields)

internal_id
This field is a record within a single CSV file in order to show equality of a particular record as two different records may be identical in all other fields. If two lines reference the same assay__internal_id all those after the first need not include additional fields describing the assay, but if they are provided they must match the fields from the first record in the CSV file.
assay_id
This field indicates that this line is adding information to an assay already within the DANIO-CODE DCC. This assay_id must match a record already found within the DANIO-CODE DCC.
assay_type
This is a controlled vocabulary field
assay_lab
This is a controlled vocabulary field
description
target
(required if assay_type is one of the following: CHIP-SEq, SELEX, Rip-Seq, Par-Clip, iClip, ChIP-exo-seq) To indicate technical controls use the target “mock”
library_prep
(required if assay_type is RNA-based: RNA-seq, RIP-seq, etc)

Applied Assay (“applied_assay__” fields)

internal_id
This field is a record within a single CSV file in order to show equality of a particular record as two different records may be identical in all other fields. If two lines reference the same applied_assay__internal_id all those after the first need not include additional fields describing the applied assay, but if they are provided they must match the fields from the first record in the CSV file.
controlled_by
This field indicates that this applied assay has as a technical control. This field must be either an existing DANIO-CODE applied_assay_id or an applied_assay__internal_id from the same submitted CSV file.
Technical replicate ("technical_replicate__" fields)
internal_id

Sequencing ("sequencing__" fields)

internal_id
This field is a record within a single CSV file in order to show equality of a particular record as two different records may be identical in all other fields. If two lines reference the same sequencing__internal_id all those after the first need not include additional fields describing the sequencing, but if they are provided they must match the fields from the first record in the CSV file.
sequencing_lab
This is a controlled vocabulary field.
platform
This is a controlled vocabulary field.
lab
This is a controlled vocabulary field.
instrument
This is a controlled vocabulary field.
paired_end
This is a boolean field. Enter “True” if the paired end sequencing was performed or “False” otherwise.
Sequencing_date
The date the sequencing was performed in the ISO 8601, i.e. YYYY-MM-DD. If the exact date is unknown, enter at least the year and 01 for the month or day, respectively.
strand_mode (required if assay_type is RNA-based: RNA-seq, RIP-seq, etc)
This is a controlled vocabulary field.
max_read_length (optional)
Positive integer field.
chemistry_version (optional)

Data ("data__" fields)

internal_id
This field is a record within a single CSV file in order to show equality of a particular record as two different records may be identical in all other fields. If two lines reference the same data__internal_id all those after the first need not include additional fields describing the experiment, but if they are provided they must match the fields from the first record in the CSV file.
primary_file_name
This field should contain the filename of the sequencing file in the case of single end sequencing or the first reads for paired end sequencing (see section Data Transfer for details on the file name structure)
primary_file_URL
This field should contain the publicly accessible URL for the sequencing file in the case of single end sequencing or the first strand for paired end sequencing
secondary_file_name
Leave blank for single-end sequencing. For paired end please provide the name or path to the second reads file here.
secondary_file_URL
Leave blank for single-end sequencing. For paired end please provide the URL to the second reads file here.
If you need any help, please contact matthias.hortenhuber@ki.se.

Example of data annotation

In order to study the promoter and histone modification shift in early development, Lab A used CAGE-seq and H3K36me3 ChIP-seq. Lab A used whole organism samples of the AB strain at six time steps (2-cell, 8-cell, 16-cell, 64-cell, 256-cell, 1k-cell) with one sample each for CAGE-seq in three technical replicates per time step and one sample for each time step for ChIP-seq in three technical replicates per time step. Lab B sequenced all libraries these using Illumina’s HiSeq 2000 with single-end mode and a maximal read length of 58bp for CAGE-Seq sometime in April 2014; paired-end mode and a maximal read length of 58bp for ChIP-Seq, resulting in (6 time steps * 3 replicates * 1 single-end =) 18 FASTQ files for CAGE-Seq and (6 time steps * (3 replicates + 1 mock) * 2 paired-ends =) 48 FASTQ files for ChIP-Seq.

Series

Series Title
The promoter and methylation shift in early development
Publication (DOI)
12345
Series type
Survey
Number of Biosamples
6
Description
a nucleotide-resolution map of transcription initiation events in the zebrafish genome, generated by CAGE and ChIP-seq across 6 stages
Number of replicates
3
Number of different Assays
2

Biosamples

Developmental Stage
2-cell, 8-cell, 16-cell, 64-cell, 256-cell, 1k-cell
Lab
A
Gen. background
AB
Sample type
whole organism

Assay

Lab
A
Assay Type
CAGE-seq
Lab
B
Assay Type
ChIP-seq
Target
H3K36me3

Applied Assay

18 CAGE-seq libraries

24 ChIP-seq libraries (four out of them as mock IPs)

Sequencing

To each of the 42 libraries the following is assigned:

Lab
B
Platform
Illumina
Instrument
HiSeq 2000
Sequencing mode
paired end
Maximal read length
58
Sequencing date
2014-04-01

Data

To each of the 66 data sets (18 CAGE-Seq and 48 ChIP-Seq) the following is assigned:

File type
FASTQ

URL to each of the individual FASTQ files

Example CSV Files

An excel document with an example upload can be found at this link.
In order to use this as a batch upload csv file the first column must be removed and this file must be saved in any supported CSV format from excel.
You can find the an example of a correctly formated CSV file at this link. If you open this file in excel please reformat the sequencing__sequencing_date to the correct date format (YYYY-MM-DD).

Terms

Series

A series corresponds to a research question/experiment, which motivates the generation of the data. It is thereby connecting biosamples with each other.
Title
A title given to the series.
Series Type
The Series Type is either Survey or Case Control. A survey is collection of samples without any further order to them, i.e. the samples were not ordered into cases. In Case Control, biosamples are ordered into cases, e.g. treatment and control.
Description
The description of the series. This can entail the research question, the aim of the experiments, a description of the different cases in case control studies, etc.
Reference
The DOI of the published results of this series.
GEO/SRA ID
The ID to the uploaded Series in GEO or SRA.
Public
If the series should be publicly available or only to members of DANIO-CODE

Biosample

A biosample corresponds to the physical entity which is the source of the gene material. One biosample entity includes all its biological replicates.
Lab
The lab which handled the biosample and the extraction of the gene material.
Sample Type
The type describes the organic material of origin for the biosample.
Anatomical Term
The site or tissue type of the sample (controlled only required for sample type tissue).
Stage
The developmental stage of the organism (required for sample type whole organism and tissue).
Post Fertilization
The hours between fertilisation and the exitus of the organism or extrusion of the sample (only applicable for sample type whole organism and tissue).
Genetic Background
The specific biological background of the biosample (controlled).
Description
A description of the sample (optional).
Mutation description
The mutation of the biosample, i.e. the targeted Gene.
Source
The origin/distributor of the specific fish strain (optional).
Treatment
The treatment applied on the sample (optional).
Sex
The sex of the biosample (optional).
Cell line type
The type of the cell line (only applicable for the biosample type “cell line”).

Assay

Assay Type
The technique used to produce the sequenced library, e.g. RNA-seq, ChIP-seq (controlled)
Lab
The lab which applied the assay on the biosample
Description
A more detailed description of the assay, e.g. the protocol.
Target
The target-chemical of the different IP-seq assay (required for ChIP-seq, SELEX-seq, RIP-seq, PAR-Clip-seq, iCLIP-seq, ChIP-exo-seq, Methyl-seq)
Library Preparation
The library preparation used for RNA-based assays, e.g. poly(A)+, poly(A)-, rRNA depletion, etc. (required for RNA-seq, short-RNA-seq, miRNA-seq, Ribo-seq, RIP-seq, PAR-Clip-seq, iCLIP-seq, GRO-seq, CAGE-seq)

Applied assay

The applied assay which will be sequenced and which is the product of an assay applied to a biosample entity, i.e. the group of replicates.

Sequencing

The specific sequencing techniques used to produce the data sets.
Platform
The sequencing platform used to produce the data (controlled).
Instrument
The sequencing instrument used to produce the data (controlled).
Lab
The lab which performed the sequencing (controlled).
Strand Mode
The direction the reads have, with respect to the target mRNA (required for RNA based assays, controlled)
Sequencing Date
The date the sequencing was performed in the ISO 8601, i.e. YYYY-MM-DD. If the exact date is unknown, enter at least the year and 01 for the month or day, respectively.
Chemistry Version
The specific chemistry version used for the sequencing (optional).
Maximum read length
The longest read length of the sequencing (optional).
Sequencing mode
The sequencing direction mode, i.e.single/paired ended (controlled)

Data

The data which is the result of all the steps above.
File
The name of the sequencing file, if it is inside a folder this field should contain the file path from the user directory (only one filename per sequencing for single ended and two for paired end sequenced files)

Name
initial entry
complete upload
with processed files
Name
whole organism
tissue
stem cells
cell line
Name
abdominal musculature
abducens motor nucleus
abductor hyohyoid
abductor muscle
abductor profundus
absorptive cell
accessory chamber of the maxillary blood sinus
accessory pretectal nucleus
acellular anatomical structure
acid secreting cell
acinar cell
actinotrichium
adaxial cell
adductor
adductor arcus palatini
adductor hyohyoid
adductor hyomandibulae
adductor mandibulae
adductor mandibulae complex
adductor muscle
  • Page 1 of 147
Lab Name Principal Investigator Institution Country
Johns Hopkins Deep Sequencing and Microarray Core Facility USA
Novogene - - China
Liu Lab Jiang Liu Beijing Institute of Genomics, CAS China
Zon Lab Leonard Zon Boston Children's Hospital USA
Skarmeta Lab José Luis Gómez-Skarmeta Centro Andaluz de Biologia del Desarrollo Spain
Gomez Lab Manuel J. Gomez CNIC Spain
Chatterjee Lab Aniruddha Chatterjee Department of Pathology, Dunedin School of Medicine, University of Otago New Zealand
Horsfield Lab Julia Horsfield Department of Pathology, Dunedin School of Medicine, University of Otago New Zealand
de Wit Lab Elzo de Wit Division Of Gene Regulation, Netherlands Cancer Institute The Netherlands
Rotterdam Genomics core Erasmus MC The Netherlands
Schier Lab Alex Schier Harvard University USA
Cairns Lab Brad Cairns HHMI USA
Ferrer lab Jorge Ferrer IDIBAPS Spain
Winata Lab Cecilia Winata IIMCB Warsaw Poland
Lenhard Lab Boris Lenhard Imperial College London UK
Shkumatava Lab Alena Shkumatava Institut Curie France
Postlethwait Lab John H. Postlethwait Institute of Neuroscience, University of Oregon USA
Pandey Lab Akhilesh Pandey Johns Hopkins University School of Medicine USA
Strahle Lab Uwe Strähle Karlsruhe Institute of Technology Germany
Kere Lab Juha Kere Karolinska Institute Sweden
Name
RNA-seq
ChIP-seq
DNAse-seq
FAIRE-seq
ATAC-seq
CAGE-seq
Bru-seq
PAS-seq
Ribo-seq
SELEX-seq
STARR-seq
RIP-seq
PAR-Clip-seq
iCLIP-seq
ChIP-exo-seq
BS-seq
TAB-seq
GRO-seq
MeDIP-seq
4C-seq
Name
1-cell
2-cell
4-cell
8-cell
16-cell
32-cell
64-cell
128-cell
256-cell
512-cell
1k-cell
High
Oblong
Sphere
Dome
30%-epiboly
50%-epiboly
Germ-ring
Shield
75%-epiboly
Name
AB
AB/C32
AB/EKW
AB/TL
AB/TU
C32
KOLN
DAR
EKW
HK/AB
HK/SING
HK
IND
INDO
SPF 5-D
SPF AB
NA
RW
SAT
SING
Name
AB SOLiD System
ABI 377 automated sequencer
Genome Sequence 20
Genome Sequence FLX+ / FLX
HiSeq X Ten
HiSeq 2000
HiSeq 2500
HiSeq 3000
HiSeq 4000
HiSeq 5000
NextSeq 500 High-Output
NextSeq 500 Mid-Output
HiSeq High-Output v4
HiSeq High-Output v3
HiSeq Rapid Run
HiScanSQ
GAIIx
Li-Cor 4300 DNA Analysis System
MiSeq v3
HiSeq 1500
Name
Illumina
Ion
PacBio
Roche 454
SOLiD
Name
m
f
Name
unstranded
forward
reverse