Data acquisition¶
Raw sequencing data¶
Raw data, including sequences in FASTQ format were obtained from the following public databases:
- ENA (European Nucleotide Archive) at EBI (European Bioinformatics Institute) of EMBL (European Molecular Biology Lab) ,
- SRA (Sequence Read Archive) database at NCBI (National Center for Biotechnology Information) ,
- HMP (Human Microbe Project) , and
- AGP (American Gut Project) .
Data were downloaded using enaBrowserTools and SRA-Tools facilitated by Aspera (a high-speed data transfer tool).
Meta data¶
Meta-data were first extracted using in-house Perl/R/Python scripts and then manually curated at least two-rounds to ensure the quality. Meta-data curation was not painless because sometimes such information were often incomplete, misplaced or even completely missing. Very often we had to consult the description of the samples, supplementary data of related publications or sometimes even the authors.
Technical meta-data extracted include:
- experiment type (16S or Metagenomics),
- sequencing devices / instruments, and
- number of obtained sequencing reads.
Host-related, biological-relevant meta-data extracted include:
- disease or health of the host (refered as to
phenotype
in our database), - age,
- sex,
- BMI (body mass index), and
- antibiotic usage.
More meta-data will be added in the future.