... | ... | @@ -4,21 +4,21 @@ |
|
|
|
|
|
## What is BiG-SCAPE
|
|
|
|
|
|
Bioinformatically, mining (meta)genomes for **Biosynthetic Gene Clusters** (BGCs) encoding Secondary Metabolites entails first identifying and annotating BGCs on the genome.
|
|
|
Bioinformatically, mining (meta)genomes for **Biosynthetic Gene Clusters** (BGCs) encoding the production of Secondary Metabolites has become a key strategy for Naturel Product discovery. At the single-genome basis, this process is performed by tools such as [antiSMASH](https://antismash.secondarymetabolites.org).
|
|
|
|
|
|
BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) is a tool that takes additional steps to define a distance between BGCs in order to map the BGC diversity in similarity networks which are then processed for automated grouping. These similarity networks graphically summarize the diversity of the BGCs, as well as contain multiple annotations to help identify novel compounds, make ecological correlations and so on.
|
|
|
When studying large sets of genomes and metagenomes, it becomes essential to perform analyses at a large scale. **BiG-SCAPE** (Biosynthetic Gene Similarity Clustering and Prospecting Engine) is a tool that calculates distances between BGCs in order to map the BGC diversity onto sequence similarity networks, which are then processed for automated reconstruction of Gene Cluster Families, groups of gene clusters that encode biosynthesis of highly similar or identical molecules. BiG-SCAPE's interactive visualizations of these similarity networks allows effective exploration of the diversity of BGCs, linking them to knowledge from reference data within the [MIBiG repository](https://mibig.secondarymetabolites.org/)
|
|
|
|
|
|
## How does it work in a nutshell
|
|
|
|
|
|
BiG-SCAPE tries to (recursively) read BGC information stored as GenBank files from the [input](input) folder (which, preferrably, corresponds to identified gene clusters with a tool like [antiSMASH](https://antismash.secondarymetabolites.org/)).
|
|
|
BiG-SCAPE (recursively) reads BGC information stored as GenBank files from the [input](input) folder (which, preferrably, corresponds to identified gene clusters with a tool like [antiSMASH](https://antismash.secondarymetabolites.org/)).
|
|
|
|
|
|
BiG-SCAPE then uses the [Pfam database](http://pfam.xfam.org/) and `hmmscan` from the HMMER (v3.1b2) suite to [predict Pfam domains](domain_prediction) in each sequence.
|
|
|
BiG-SCAPE then uses the [Pfam database](http://pfam.xfam.org/) and `hmmscan` from the HMMER suite to [predict Pfam domains](domain_prediction) in each sequence, thus summarizing each BGC as a linear string of Pfam domains.
|
|
|
|
|
|
For every pair of BGCs in the set, the pairwise distance between this BGCs is calculated as the weighted combination of the [Jaccard](distance#jaccard), [AI](distance#ai) and [DSS](distance#dss) indices. Two types of output are generated: text files which include [Network files](output#network_files) and an [Interactive visualization](output#interactive_visualization). This is done taking into account different cutoff values for the distances (i.e. only pairs with Raw Distance < `cutoff` are written in the final `.network` file).
|
|
|
For every pair of BGCs in the set, the pairwise distance between them is calculated as the weighted combination of the [Jaccard](distance#jaccard), [Adjacency Index (AI)](distance#ai) and [Domain Sequence Similarity (DSS)](distance#dss) indices. Two types of output are generated: text files which include [Network files](output#network_files) and an [Interactive visualization](output#interactive_visualization). Different cutoff values for the distances can be taken into account in one or multiple runs (i.e. only pairs with Raw Distance < `cutoff` are written in the final `.network` file).
|
|
|
|
|
|
The distances for each cutoff value will be used to try to [automatically define](GCFs and GCCs) 'Gene Cluster Families' (GCFs) and 'Gene Cluster Clans' (GCCs).
|
|
|
The distances for each cutoff value will be used to [automatically define](GCFs and GCCs) 'Gene Cluster Families' (GCFs) and 'Gene Cluster Clans' (GCCs).
|
|
|
|
|
|
By default, BiG-SCAPE tries to use the `/product` information of antiSMASH-processed GenBank files to separate the analysis into eight [BiG-SCAPE classes](BiG-SCAPE classes). Each has different (tuned) sets of [weights](distance indices weights) for the distance components. You can also choose to combine all BGC classes in one network file (`--mix`) and deactivate the default classification (`--no_classify`). It is also possible to prevent analysis of any of the BiG-SCAPE classes by using the `--banned_classes` parameter.
|
|
|
By default, BiG-SCAPE uses the `/product` information of antiSMASH-processed GenBank files to separate the analysis into eight [BiG-SCAPE classes](BiG-SCAPE classes). Each has different (tuned) sets of [weights](distance indices weights) for the distance components. You can also choose to combine all BGC classes into a single network file (`--mix`) and deactivate the default classification (`--no_classify`). It is also possible to prevent analysis of any of the BiG-SCAPE classes by using the `--banned_classes` parameter.
|
|
|
|
|
|
Learn more about the BiG-SCAPE options with `python bigscape.py -h` or by going to the specific [wiki page](parameters).
|
|
|
|
... | ... | |