To select the domains predicted in each BGC for the distance calculation, BiG-SCAPE uses two strategies. In the
global mode, all domains from both BGCs are taken into account, whereas in the
glocal mode, only a subset of the domains is chosen. The logic in
glocal mode is the following:
During pairwise distance calculation, genes are represented as a concatenation of (Pfam) domain ids, and each BGC in the pair is represented as a list of those domain concatenations (strandedness is not taken into account).
BiG-SCAPE then uses the SequenceMatcher method from Python's difflib library to to find the longest match (internally called the LCS or "Longest Common Subcluster"). The second BGC is tried in the reverse orientation and the orientation with the largest LCS is kept.
To proceed to the next step, the LCS must be either 3 genes long, or contain at least one gene marked by antiSMASH as "Core Biosynthetic"
In the extension stage the selection of domains is extended for the BGC with least amount of genes up(down)stream. If the amount of genes up(down)stream is the same, each expansion will be attempted and scored, and the one with the best score will be kept. The remaining BGC domain selection (per side) will be tried to be expanded according to the following scoring algorithm in the Alignment Stage: for every gene in the reference BGC, a gene with the same domain organization is searched for in the remaining BGC. If such gene is found, the score will be added a bonus (
match=5) plus a penalty proportional to the distance from the current position (
number of genes * gap penalty where
gap=-2) and the current position will be moved to the position of the matching gene. If a gene with the same domain organization is not found, the score will be decreased with a penalty (