Michel PETITJEAN / Procrustes Methods, Shape Recognition, Similarity and Docking

PROCRUSTES METHODS, SHAPE RECOGNITION, SIMILARITY AND DOCKING

© Michel Petitjean (retired since Jan 1st, 2023)

Author's most recent professional address:
INSERM ERL U1133 (BFA, CNRS UMR 8251), Université Paris Cité
35 rue Hélène Brion, 75205 Paris Cedex 13, France.

Formerly (2010-2018): MTi, INSERM UMR-S 973, Université Paris 7.
Formerly (2007-2009): CEA/DSV/iBiTec-S/SB2SM (CNRS URA 2096), Saclay, France.
Formerly (1987-2006): ITODYS, CNRS UMR 7086, Université Paris 7.

Contact: petitjean.chiral@gmail.com

SHAPE RECOGNITION AND PROCRUSTES METHODS

Shape recognition, i.e. geometric pattern recognition, is a problem commonly encountered in various areas: data analysis, robotics, molecular modeling, biochemistry, etc.
There are two basic problems:

Similarity: the experimentalist looks for identical parts among a family of at least two objects. Most of the time, this is done by computer via an optimal superposition of the objects. In the simplest situation, the objects are finite sets of points. Data analysis methods of optimal superposition of two sets of points are known as Procrustes methods. In chemistry and biochemistry, the optimal superposition of sets of points via least squares techniques are known as RMS (Root Mean Square) methods: calculations of RMSD (Root Mean Square Deviation).

Docking: the experimentalist looks for the geometric complementarity of two objects, as in the key-lock model. This recognition of shape complementarity may be done by computer, but it occurs spontaneously in the real life, such as for enzymatic recognition.

Both problems may be solved with the same tool: the objects are represented (modelized) by colored mixtures.

THE COLORED MIXTURE MODEL

The idea is to work with distributions of random variables in a space which is little more complicated than the ordinary euclidean space (see [1] for a rigorous presentation).
A colored mixture is a random variable X taking values in the product of the d-dimensional euclidean space by a measurable space called the space of colors. The marginal distribution of X in R^d is a mixture of d-variate distributions, each of them being attached to a color. The color may be viewed as a supplementary value on a (d+1)^th pseudo axis, in a space having a numeric part (d axis) and a non numeric part.
In order to understand the use of this concept, we consider some simple situations. First imagine three equally weighted gaussian distributions in R: a red one, a green one, and a blue one. It means that the marginal distribution of X in the space of colors is Prob(red)=1/3, Prob(green)=1/3, and Prob(blue)=1/3. These three marginal distributions in R exist together, and constitute a mixture [2]. An other example is the molecular model of the charges distributions in a neutral molecule: it is a mixture of two equally probable distributions: the distribution of the negative charges, and the distribution of the positive charges (additivity of the effects of the charge are handled via an adequate operator). A last example is the case of a set of n points in R^d, not two of them having the same color: each marginal distribution in R^d concentrates all the mass on a single point in R^d, and the values taken by X are couples (color,point).

Now, for physical applications, we consider two or more than two colored mixtures such that the marginal components of the colored mixtures in the space of colors are almost surely equal: this is the colored mixture model.
In other words, it means that, for all mixtures of marginal distributions in R^d, we have created a correspondence between all the marginal distributions associated to a given color. E.g., we consider two sets of n=3 points (x_r,x_g,x_b) and (y_r,y_g,y_b), not two points in each set having the same color: in this situation, the colored mixture model establish a pairwise correspondence (x_r<->y_r), (x_g<->y_g), and (x_b<->y_b). If the same sets of points were considered without any color (i.e., all colors are identical), we would have to consider all n! possible pairwise correspondences between the two sets.

SIMILARITY AND WASSERSTEIN DISTANCE

The best way to evaluate the similarity between two objects is to define a distance in the space of the objects. The Wasserstein metric is a well-known distance between distributions in R^d [3]. This probability metric is extended in the context of the colored mixture model. Considering two colored mixtures X and Y and any of their joint distribution W, the Wasserstein distance is minimized for all rotations R and translations t of Y:

D² = Inf_{W,R,t} E[(X-Y)'·(X-Y)]

The distance D has the following properties in the case of two finite sets of n colored points [1]:

It is the Procrustes distance in the Procrustes method of optimal superposition of the two sets.
Note that minimizing for orthogonal transformations or linear transformations rather than for rotations and translations are just trivial changes to the method.
It is equal to the RMS, in the RMS method of spatial alignment of the two sets (RMSD calculations).
When all colors are identical, we get both the Procrustes and the RMS methods in the case of an unspecified correspondence. It means that we are looking for the optimal correspondence (among n!) which minimizes D. This correspondence is associated to the permutation matrix [nW].

A major consequence is that all the least squares methods above (Procrustes and RMS, RMSD calculations) are extended to any sets and distributions, discarding if they are continuous and/or infinite, but provided that their inertia are finite.

The Monge-Kantorovitch transportation problem, in which a distribution of mass is transportated at minimal cost onto an other distribution of mass, is known to have deep relations with the Wasserstein metric [4]. There is now a formal link between the transportation problem and the optimal superposition problems, in which we move optimally a distribution onto an other one.

A very important application of the similarity theory occurs when the two objects are a pair of inverted images in a mirror: see the The Mathematical Theory of Chirality.

An analytical solution of the optimization problem has been given in some situations, discarding if the distributions of X and Y are inverted mirror images or not [5].

The freeware ARMS performs the optimal RMS superposition of two rigid conformers, provided that the pairwise correspondence is known or implicit. The computing time is linearly related to the size of the data.
The freeware CSR detects automatically the maximal common 3D motif of two rigid conformers and performs the optimal RMS superposition [6]. The computing time is quadratically related to the size of the data.

SHAPE COMPLEMENTARITY AND DOCKING

The shape complementarity could be viewed from the generalization of the parallelism concept to non trivial situations. Although the parallelism could be obvious for lines, planes, concentric cylinders, concentric spheres, etc., providing a general definition would require drastic conditions on the existence and the properties of the surfaces. Unfortunately, many practical situations involve sets or distributions for which the concept of surface does not make sense, such as for a finite discrete set of points. Note that in this latter situation, the polyhedral convex hull has little use in a docking context. So, there is a need to exhibit a general docking criterion, working for various kinds of sets and distributions.

The variance criterion hereafter suffices in many cases to measure the degree of complementarity of two distributions. Considering two colored mixtures X and Y and any of their joint distribution W, the variance is minimized for all rotations R and translations t of Y:

Z = Inf_{W,R,t} Var[(X-Y)'·(X-Y)]

The docking coefficient Z is formally analog to the minimized squared Wasserstein distance D², except that the variance operator is used rather than the expectation operator. The distributions must have finite four-order moments rather than finite inertias.

An analytical solution of the optimization problem has been given in some situations [1], including the case of two sets of n colored points in R³ [7]. This situation is encountered in chemistry, and thus the freeware DOG has been built to perform geometric docking. Despite that there is actually no automatic detection of the optimal correspondance, it is useful in simple situations, such as when two single complementary DNA strands have to be tested for geometric matching.

REFERENCES

PETITJEAN M.
From Shape Similarity to Shape Complementarity: Toward a Docking Theory.
J. Math. Chem. 2004, 35[3], 147-158 (DOI 10.1023/B:JOMC.0000033252.59423.6b).
Free version for readers: https://rdcu.be/bzt83 (link created by ; downloads, prints and copies are for subscribers only)

EVERITT B.S., HAND D.J.
Finite Mixture Distributions., chap. 1.
Chapman and Hall, London 1981.

RACHEV S.T.
Probability Metrics and the Stability of Stochastic Models.
Wiley, New-York 1981.

RACHEV S.T., RÜSCHENDORF L.
Mass Transportation Problems. Vol. I and II.
Springer-Verlag, New-York 1998.

PETITJEAN M.
Chiral Mixtures.
J. Math. Phys. 2002, 43[8], 4147-4157 (DOI 10.1063/1.1484559).
A free copy for personal use only (©AIP, American Institute of Physics) is available from the HAL repository: hal-02122882; copyright rules apply.

PETITJEAN M.
Interactive Maximal Common 3D Substructure Searching with the Combined SDM/RMS Algorithm.
Comput. Chem. 1998, 22[6], 463-465 (DOI 10.1016/S0097-8485(98)00017-5).

PETITJEAN M.
Solving the Geometric Docking Problem for Planar and Spatial Sets.
Internet Electron. J. Mol. Des. 2002, 1[4], 185-192.
Download PDF paper in open access on Publisher's site, or a free copy deposited on the HAL repository: hal-02115007 (with permission of the Publisher: copyright rules apply).