Starting from the --inputdir folder, BiG-SCAPE will recursively look for files with the .gbk extension. The following files are excluded:
Filenames that include the string(s) specified in --exclude_gbk_str (default: 'final'. This is in order to exclude the summary GenBank file produced by antiSMASH, which ends with <clustername>.final.gbk)
Files with spaces in their path (including the filename). Spaces don't work well with hmmer
Files with the string '_ORF', which is used internally by BiG-SCAPE
Files with duplicated names (e.g. in different folders)
Files where no protein sequences could be extracted
Files whose sequence (summed between all records) is shorter than min_bgc_size
Files with format issues not parseable by BioPython
By default, only the following files are included:
Files with 'cluster' in their name (antiSMASH 4)
Files with 'region' in their name (antiSMASH 5)
If you need to exclude or include files with certain strings in their name, use the --exclude_gbk_str and
If two CDS features overlap (e.g. splicing events), BiG-SCAPE's behaviour is to allow for a maximum of 10% of the shortest CDS. If more overlap is detected, BiG-SCAPE will discard the smallest feature from the analysis.
The file's name (without extension) will be used in the following as the BGC name.
Note that at the time being, BiG-SCAPE does not do any particular analysis for a given taxon (i.e. bacterial, archeal, fungal or plant BGCs are treated the same)