Input files supports 3 types
Read-based
Raw reads (clean reads) or Merged Reads (Forward and reverse reads were merged into longer sequences by the program e.g. PEAR)
Assembly-based
Contigs generated through metagenome assmebly by (e.g., MEGAHIT, MetaSPAdes, SPAdes).
Tabular Files
BLAST table (delimited with "\t") generated through sequences similarity searching tools (e.g., BLAST, USEARCH, DIAMOND)
You can download database directly through https://zenodo.org/records/10045943 or third-party download tools.
$ git clone https://github.com/ccycdb/CCycDB.PL
Usage:
perl GetFun_CCycdb.pl [-situation read-based|assembly-based|tabular] [-wd work_directory] [-m diamond|usearch|blast] [-f filetype] [-s seqtype] [-id] [-e] [-tpm] [-norm xx] [-rs xx] [-thread xx] [-od xx]
[Options:]
-situation | The situation for input files (read-based|assembly-based|tabular) |
-wd | Work directory. Ensure that the files downloaded in Step 1 and your input files be included in this directory. |
-od | Output file. This directory may or may not exist. |
-m |
Database searching program you plan to use (diamond|usearch|blast). |
-f | Specify the extensions of your sequence files (E.g. fastq, fastq.gz, fasta, fasta.gz, fq, fq.gz, fa, fa.gz) or (faa, fna) or (diamond|usearch|blast). When using "-situation tablular", -f supports "diamond|usearch|blast". Ensure that filetype is support for the tool selected by -m option. (E.g., if -m usearch, the supported file types for -f are "fastq|fasta," and for "-m blast," they are "fasta|fa".) |
-s | (nucl|prot) Sequence type. |
-tpm | (0|1) "1" need $sample.tpm exist in the work directory (default: 0). "-situation assembly-based" is a prerequisite for this option. |
-id | Minimum identity to report an alignment (default: 30). |
-e | Maximum e-value to report alignments (default: 1e-5). |
-norm | (0|1) 0: don`t need random sampling; 1: need random sampling. |
-rs | The number of sequences for random subsampling. (default: the lowest number of sequences). Note: "-norm 1" is a prerequisite for this parameter. |
-thread | Number of threads (default: 2) |
-situation read-based
$ perl GetFun_CCycdb.pl -situation read-based -wd ./ -m diamond -f fasta -s nucl -norm 0 -thread 10 -od ./output
$ perl GetFun_CCycdb.pl -situation read-based -wd ./ -m diamond -f fasta -s nucl -norm 1 -rs 10000000 -thread 10 -od ./output
Output:
FunProfile_read-based_$method_random.txt OR FunProfile_read-based_$method_norandom.txt:
Gene Mean identity SampleA SampleB
geneA 70 5 20
geneB 80 10 12
SEQ2GENE/$sample.SEQ2G.txt :
Query sequence Gene
k141_433371_length_91162_1 geneA
k141_455489_length_11328_1 geneB
Assembly-based
$ perl GetFun_CCycdb.pl -situation assembly-based -wd ./ -m diamond -f fasta -s nucl -norm 0 -thread 10 -od ./output
$ perl GetFun_CCycdb.pl -situation assembly-based -wd ./ -m diamond -f fasta -s nucl -tpm 1 -norm 0 -thread 10 -od ./output
Output:
FunProfile_read-based_$method_random.txt OR FunProfile_read-based_$method_norandom.txt
ORF2GENE/$sample.ORF2GENE.txt
ORF2GENE.tpm (If "-tpm =1" and exist "$sample.tpm")
Tabular Files
$ perl GetFun_CCycdb.pl -situation tabular -wd ./ -m diamond -f diamond -norm 0 -thread 10 -od ./output
$ perl GetFun_CCycdb.pl -situation tabular -wd ./ -m diamond -f diamond -norm 1 -thread 10 -od ./output
Output:
FunProfile_read-based_$method_random.txt OR FunProfile_read-based_$method_norandom.txt
SEQ2GENE/$sample.SEQ2GENE.txt
Depending on the tools used, you may want to cite also:
DIAMOND: Buchfink B, Xie C, Huson D H. Fast and sensitive protein alignment using DIAMOND[J]. Nature methods, 2015, 12(1): 59-60.
BLASTX: Boratyn G M, Camacho C, Cooper P S, et al. BLAST: a more efficient report with usability improvements[J]. Nucleic acids research, 2013, 41(W1): W29-W33.
USEARCH: Edgar R C. Search and clustering orders of magnitude faster than BLAST[J]. Bioinformatics, 2010, 26(19): 2460-2461.
CSVTK: Csvtk—CSV/TSV Toolkit. Available online: https://bioinf.shenwei.me/csvtk/
SEQKIT: Shen W, Le S, Li Y, et al. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation[J]. PloS one, 2016, 11(10): e0163962.