Read2Tree

相关链接

github: https://github.com/DessimozLab/read2tree

作者:https://lab.dessimoz.org/blog/2023/04/23/read2tree-infers-trees-from-raw-reads-behind-the-paper

文章: Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree | Nature Biotechnology

安装

conda安装【实际使用】

1
2
3
4
conda create -n r2t python=3.10
# condact r2t
conda install mamba
mamba install -c bioconda  read2tree

docker安装

1
docker pull dessimozlab/read2tree:latest

源码安装

依赖用conda安装,read2tree从github安装

1
2
3
4
5
6
7
conda install -c conda-forge biopython numpy Cython ete3 lxml tqdm scipy pyparsing requests natsort pyyaml filelock
conda install -c bioconda dendropy pysam
conda install -c bioconda mafft iqtree ngmlr nextgenmap samtools

git clone https://github.com/DessimozLab/read2tree.git
cd read2tree
python setup.py install

输入文件

输入一:

测序的fastq文件

输入文件二:marker genes

步骤来源:https://github.com/DessimozLab/read2tree/wiki/obtaining-marker-genes

冠状病毒marker genes https://corona.omabrowser.org/oma/export_markers

运行

单物种模式

1
read2tree --tree --standalone_path marker_genes/ --reads read_1.fastq read_2.fastq  --output_path output

多物种模式

1
2
3
4
5
read2tree --standalone_path marker_genes/ --output_path output --reference  # this creates just the reference folder 01 - 03
read2tree --standalone_path marker_genes/ --output_path output --reads species1_R1.fastq species2_R2.fastq
read2tree --standalone_path marker_genes/ --output_path output --reads species2_R1.fastq species2_R2.fastq
read2tree --standalone_path marker_genes/ --output_path output --reads species3_R1.fastq species3_R2.fastq
read2tree --standalone_path marker_genes/ --output_path output --merge_all_mappings --tree

示例

使用github中自带的测试数据集tests 文件夹。

1
2
3
git clone https://github.com/DessimozLab/read2tree.git
cd read2tree/tests/
read2tree --threads 10 --tree --standalone_path ./marker_genes/ --reads ./sample_1.fastq ./sample_2.fastq --species_name sample1  --output_path ./
  • –threads mapping时的 线程数
  • –tree 计算进化树,否则只输出比对结果。
  • –standalone_path marker genes文件所在文件夹
  • –reads 测序文件
  • –species_name 所分析文件的物种名称,默认为reads文件名,决定了后续nwk文件的名称和nwk文件内分支的名称
  • –output_path 指定输出结果文件夹,如果文件夹不存在,将创建

发育树文件tree_sample1.nwk

1
(sample1:0.0505889663,((MNELE:0.9367936021,XENLA:0.1449402450):0.1376429337,(HUMAN:0.0039311745,GORGO:0.0103980892):0.0495577264):0.0629576039,RATNO:0.0173433314);

所有结果文件:数字开头的为文件夹。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
mplog.log
tree_sample_1.nwk
concat_sample_1_dna.phy
concat_sample_1_aa.phy
06_align_sample_1_dna/
06_align_sample_1_aa/
05_ogs_map_sample_1_dna/
05_ogs_map_sample_1_aa/
sample_1_all_cov.txt
sample_1_all_sc.txt
04_mapping_sample_1/
03_align_aa/
03_align_dna/
02_ref_dna/
01_ref_ogs_aa/
01_ref_ogs_dna/

软件详细参数

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
usage: read2tree [-h] [--version] [--output_path OUTPUT_PATH]
                 --standalone_path STANDALONE_PATH [--reads READS [READS ...]]
                 [--read_type READ_TYPE] [--threads THREADS] [--split_reads]
                 [--split_len SPLIT_LEN] [--split_overlap SPLIT_OVERLAP]
                 [--split_min_read_len SPLIT_MIN_READ_LEN] [--sample_reads]
                 [--genome_len GENOME_LEN] [--coverage COVERAGE]
                 [--min_cons_coverage MIN_CONS_COVERAGE]
                 [--dna_reference DNA_REFERENCE] [--sc_threshold SC_THRESHOLD]
                 [--ngmlr_parameters NGMLR_PARAMETERS] [--check_mate_pairing]
                 [--debug] [--sequence_selection_mode SEQUENCE_SELECTION_MODE]
                 [-s SPECIES_NAME] [--tree] [--merge_all_mappings] [-r]
                 [--min_species MIN_SPECIES] [--single_mapping SINGLE_MAPPING]
                 [--ref_folder REF_FOLDER]
                 [--remove_species_mapping REMOVE_SPECIES_MAPPING]
                 [--remove_species_ogs REMOVE_SPECIES_OGS] [--keep_all_ogs]
                 [--ignore_species IGNORE_SPECIES]

read2tree is a pipeline allowing to use read data in combination with an OMA
standalone output run to produce high quality trees.

optional arguments:
  -h, --help            show this help message and exit
  --version             Show programme's version number and exit.
  --output_path OUTPUT_PATH # 输出结果文件夹
                        [Default is current directory] Path to output
                        directory.
  --standalone_path STANDALONE_PATH # marker genes数据集
                        [Default is current directory] Path to the folder where marker genes
                        (i.e. reference orthologous groups) in fasta format are located.
  --reads READS [READS ...] # 测序数据
                        [Default is none] Reads to be mapped to reference. If
                        paired end add separated by space.
  --read_type READ_TYPE # reads 类型,长读长与短读长
                        [Default is "short" reads] Type of reads to use for
                        mapping, either "short" or "long". Either ngm for short reads or ngmlr for long
                        will be used.
  --threads THREADS     [Default is 1] Number of threads for the mapping using
                        ngm / ngmlr! # 线程数
  --split_reads         [Default is off] Splits reads as defined by split_len
                        (200) and split_overlap (0) parameters.
  --split_len SPLIT_LEN
                        [Default is 200] Parameter for selection of read split
                        length can only be used in combinationwith with long
                        read option.
  --split_overlap SPLIT_OVERLAP
                        [Default is 0] Reads are split with an overlap defined
                        by this argument.
  --split_min_read_len SPLIT_MIN_READ_LEN
                        [Default is 200] Reads longer than this value are cut
                        into smaller values as defined by --split_len.
  --sample_reads        [Default is off] Splits reads as defined by split_len
                        (200) and split_overlap (0) parameters.
  --genome_len GENOME_LEN
                        [Default is 2000000] Genome size in bp.
  --coverage COVERAGE   [Default is 10] coverage in X. Only considered if
                        --sample reads is selected.
  --min_cons_coverage MIN_CONS_COVERAGE
                        [Default is 1] Minimum number of nucleotides at
                        column.
  --dna_reference DNA_REFERENCE
                        [Default is None] Reference file that contains
                        nucleotide sequences (fasta, hdf5). If not given it
                        will usethe RESTapi and retrieve sequences from
                        http://omabrowser.org directly. NOTE: internet
                        connection required!
  --sc_threshold SC_THRESHOLD
                        [Default is 0.25; Range 0-1] Parameter for selection
                        of sequences from mapping by completeness compared to
                        its reference sequence (number of ACGT basepairs vs
                        length of sequence). By default, all sequences are
                        selected.
  --ngmlr_parameters NGMLR_PARAMETERS
                        [Default is none] In case this parameters need to be
                        changed all 3 values have to be changed [x,subread-
                        length,R]. The standard is: ont,256,0.25.
                        Possibilities for these parameter can be found in the
                        original documentation of ngmlr.
  --check_mate_pairing  Check whether in case of paired end reads we have
                        consistent mate pairing. Setting this option will
                        automatically select the overlapping reads and do not
                        consider single reads.
  --debug               [Default is false] Changes to debug mode: * bam files
                        are saved!* reads are saved by mapping to OG
  --sequence_selection_mode SEQUENCE_SELECTION_MODE
                        [Default is sc] Possibilities are cov and cov_sc for
                        mapped sequence.
  -s SPECIES_NAME, --species_name SPECIES_NAME
                        [Default is name of read 1st file] Name of species for
                        mapped sequence.
  --tree                [Default is false] Compute tree, otherwise just output
                        concatenated alignment!
  --merge_all_mappings  [Default is off] In case multiple species were mapped
                        to the same reference this allows to merge this
                        mappings and build a tree with all included species!
  -r, --reference       [Default is off] Just generate the reference dataset
                        for mapping.
  --min_species MIN_SPECIES
                        Min number of species in selected orthologous groups.
                        If not selected it will be estimated such that around
                        1000 OGs are available.
  --single_mapping SINGLE_MAPPING
                        [Default is none] Single species file allowing to map
                        in a job array.
  --ref_folder REF_FOLDER
                        [Default is none] Folder containing reference files
                        with sequences sorted by species.
  --remove_species_mapping REMOVE_SPECIES_MAPPING
                        [Default is none] Remove species present in data set
                        after mapping step completed and only do analysis on
                        subset. Input is comma separated list without spaces,
                        e.g. XXX,YYY,AAA.
  --remove_species_ogs REMOVE_SPECIES_OGS
                        [Default is none] Remove species present in data set
                        after mapping step completed to build OGs. Input is
                        comma separated list without spaces, e.g. XXX,YYY,AAA.
  --keep_all_ogs        [Default is on] Keep all orthologs after addition of
                        mapped seq, which means also the OGs that have no
                        mapped sequence. Otherwise only OGs are used that have
                        the mapped sequence for alignment and tree inference.
  --ignore_species IGNORE_SPECIES
                        [Default is none] Ignores species part of the OMA
                        standalone pipeline. Input is comma separated list
                        without spaces, e.g. XXX,YYY,AAA.

其它

数据库源文件蛋白序列文件oma-seqs.fa.gz 中,> 后面紧接的是空格。