SADA is a structural analogue-based protein structure domain assembly method assisted by deep learning, which includes 5 steps. (1) detects structural analogues of the full-chain from the constructed multi-domain protein structure database (MPDB) according to the input protein domain models; (2) Constructs an initial model based on the detected 1st-ranked analogue; (3) Utilizes a deep learning network to predict the inter-residue distance distribution; (4) Builds multi-domain protein specificity force field for guiding domain assembly based on the predicted residues distance distribution and the property of multi-domain protein; (5) Assembles the domain models to generate final full-chain model by the proposed two-stage differential evolution algorithm from the initial model. (see example for a SADA assembly result of 2-domains protein ).
SADA also provides other 2 functions. (1) Structural analogues detection; (2) Culling the whole MPDB according to input criteria.
Culling proteins from MPDB
Multi-domain protein structure database (MPDB) is constructed through 3 steps. (1) CD-HIT is used to remove redundancy of protein structures with a sequence identity cutoff 100% in PDB, and then protein structures with sequence identity less than 100% are fetched from PDB; (2) DomainParser is next used to determine whether these proteins are multi-domain proteins or not; (3) the single-domain proteins determined by DomainParser are further confirmed by CATH and SCOPe on whether they are multi-domain proteins; All the multi-domain proteins selected in the above 3 steps are finally collected to construct the MPDB.
Until September, 2021, MPDB contains 48225 multi-domain proteins, in which 37495 proteins with 2 domains, 7539 proteins with 3 domains, 2182 proteins with 4 domains, 1009 proteins with more than 4 domains.
In this function, users can cull the whole MPDB according to input criteria, and produce subsets of multi-domain proteins structures from MPDB and info.txt file. The pdbinfo.txt file includes corresponding protein size, experiment type, resolution, R-factor, number of domains, methods for decomposing domains and domains boundary (see example).
Detecting structural analogues from MPDB
In this function, users can detect the structural analogues for query domains. In detection, Query individual domains are aligned on each protein of the whole MPDB, with no overlap allowed in the alignments of different domains. The harmonic mean of the TM-score of all domains is defined as the global score (LSscore) for each protein in MPDB, and top 200 multi-domain protein structural analogues with the highest LSscores are output. The query domains and related information of the 200 structural analogues are recorded in a *.txt file (see example for query domains of 1fx7A).