Starting from the --inputdir folder, BiG-SCAPE will recursively look for files with the .gbk extension. Currently, BiG-SCAPE is optimized to work with files from antiSMASH v4. The following files are excluded:
Filenames that include the string(s) specified in --exclude_gbk_str (default: 'final'. This is in order to exclude the summary GenBank file produced by antiSMASH, which ends with <clustername>.final.gbk)
Files with spaces in their path (including the filename). Spaces don't work well with hmmer
Files with the string '_ORF', which is used internally by BiG-SCAPE
Files with duplicated names (e.g. in different folders)
Files where no protein sequences could be extracted
Files whose sequence (summed between all records) is less than min_bgc_size
Files with format issues not parseable by BioPython
If two CDS features overlap (e.g. splicing events), BiG-SCAPE's behaviour is to allow for a maximum of 10% of the shortest CDS. If more overlap is detected, BiG-SCAPE will discard the smallest feature from the analysis.
The file's name (without extension) will be used in the following as the BGC name.
Note that at the time being, BiG-SCAPE does not do any particular analysis for a given taxon (i.e. bacterial, archeal, fungal or plant BGCs are treated the same)