Michel PETITJEAN / The Chemoinformatics Page

THE CHEMINFORMATICS PAGE
(with a few bioinformatics)

Mining Data in Graph Databases

The Parity Phenomenon in Large Chemical Databases

The Graphic Mendeleev Table

The Radius Diameter Diagram

2D Chemical Databases: Miscellaneous

Virtual Screening of 3D Chemical Databases

Molecular Symmetry and Chirality
[6.6]Chiralane: The Chemical Nomenclature Challenge

Molecular Shape Descriptors: the Cylindrical Model

Proteins: geometric descritors (cavities, channels, concavities, pores, tunnels)

© Michel Petitjean (retired since Jan 1st, 2023)

Author's most recent professional address:
INSERM ERL U1133 (BFA, CNRS UMR 8251), Université Paris Cité
35 rue Hélène Brion, 75205 Paris Cedex 13, France.

Formerly (2010-2018): MTi, INSERM UMR-S 973, Université Paris 7.
Formerly (2007-2009): CEA/DSV/iBiTec-S/SB2SM (CNRS URA 2096), Saclay, France.
Formerly (1987-2006): ITODYS, CNRS UMR 7086, Université Paris 7.

Contact: petitjean.chiral@gmail.com

1. MINING DATA IN GRAPH DATABASES

Databanks contain various kinds of data: numerical, categorical, ordinal, text, and so on... Most kinds of data may be handled by the numerous available statistical and data mining tools. Nevertheless, there is a lack of tools to handle populations of graphs. Assume for example a population of thousands of graphs. How "summarize" it, numerically or graphically ? For real values, we know many descriptors: mean, median, standard deviation, quantiles, etc. Population of vectors are also easily summarized with mean, variance matrix, convex hulls, etc. But what about graphs ? How to compute a "mean" graph ? What are the "extreme" graphs ? How graphs are "dispersed" (diversity) ? Then, how two populations of graphs could be quickly compared ? Much more difficult: how detect clusters in a space of graphs ?

This problem has appeared in chemistry, because the structural formula of a molecule is most time represented by a colored graph in chemical databanks. E.g. the water molecule H₂O is a graph containing 3 nodes and two edges. The graph associated to the formula of a chemical compound is non directed, simple, non connex, and both nodes and edges are valued non numerically (colors). There are chemical databanks containing millions of graphs, such as Chemical Abstracts and Beilstein.

Some of the exploratory statistical tools used on large chemical databases are presented below. They can be used as parts of data mining tools for other graphs databases. Detailed methods and extensive results are available in the cited papers.

2. THE PARITY PHENOMENON IN LARGE CHEMICAL DATABASES

Plotting the number of molecules having a given number of carbon atoms shows that the even values are preferred to the odd values in large chemical files. This parity phenomenon was discovered in 1990 [1,2]. It does not mean that most chemists around the world have cooperated in order to obtain this. A semi quantitative explanation of the phenomenon has been proposed [2]. It is based on graph theory, and on the statistical occurrences of the elements of the periodic table.

Posted with permission from [2].
© 1990 American Chemical Society.

3. THE GRAPHIC MENDELEEV TABLE

The distribution of the chemical elements (atoms) in a chemical database can be visualized graphically. The periodic table of the elements is viewed as a rectangular array, and each element of this array contains the occurrence of its associated atom. The array is submitted to a factorial correspondence analysis. Thus, each atom of the Mendeleev table get coordinates in a 6-dimensional space: this is the "Graphic Mendeleev Table" [1,3]. In order to visualize the Graphic Mendeleev Table, the successive convex hulls of the points are drawn in the principal planes [4,5].

4. THE RADIUS DIAMETER DIAGRAM

There are little statistical tools to perform exploratory data analysis in graphs databases. The basic observation is the graph, but it may be also a connex component, or simply a node, an edge, a cycle, a path, a concentric layer, etc. [2]. Many univariate and multivariate distributions may be generated from these populations. One of the most useful distributions is the number of connex components having a given radius R and a given diameter D. It is recalled that the radius is the lower bound of the eccentricities of the nodes of a connex component, and the diameter is the upper bound, the eccentricity of a node being the upper bound of the distances from the node to all the nodes of the connex component. It is known that D takes values between R and 2R, such that the bivariate distribution in the (R,D) plane takes place in an angular sector limited by the two lines D=R and D=2R. Displaying the clusters in this bivariate distribution offers a schematic graphical summary of the population which is called the Radius Diameter Diagram [6,7].
The quantity I=(D-R)/R takes values in [0..1]. It is used as a shape index, and its distribution can be plotted (see example in [7]).

Posted with permission from [7].
© 1992 American Chemical Society.

5. 2D CHEMICAL DATABASES: MISCELLANEOUS

As mentioned above, there are many distributions of interest in graphs databases. Most of them have been computed from a large chemical database containing more than three millions of chemical compounds [1,2,7,8]. The aspect of some of these distributions is not yet interpreted, such as the number of concentric layers in the plane: number of nodes per layer and depth of the layer [8].

These statistical tools are part of technological survey tools, which helps to detect unexplored areas [9].

Finally, it should be mentionned that searching a partial subgraph in a graphs database is performed in chemistry with an highly efficient tool called the DARC system. It permits to retrieve fastly the compounds in very large chemical databases [10,11]. It works on Markush formulas, which are commonly encountered in reactions databases and patent databases (see the Questel website).

Software:
Enumerating the graph automorphisms associated to the components of a chemical graph is done in the QCM freeware. QCM is devoted to the computation of the chiral index, but it may be used to enumerate the graph automorphisms.

6. VIRTUAL SCREENING OF 3D CHEMICAL DATABASES

The virtual screening of a chemical database can be made on the basis of the geometries of the molecules. It is thus better to store the geometry of each molecule in the database rather to recompute it at the occasion of each screening. Most molecules are flexible and have an infinite number of low energy conformations. So, it is necessary to store an optimal number of conformers rather than storing only one conformer or storing systematically, say, 50 or 100 conformers per molecule. It is difficult to generate directly an optimal set of conformers for a flexible molecule. It is why a two steps procedure has been built [12]: (a) an excess of conformers is generated algorithmically in order to have a sufficient covering of the conformational space, and (b) the conformational redundancy is reduced with a clustering algorithm computing an optimal partition of the initial set of conformers. It is pointed out that few clustering algorithms are able to compute the number of classes ONLY from the input data. The step (a) of the procedure may be performed with most usual conformers generators. The step (b) of the procedure is implemented in the DIVCF freeware.

7. MOLECULAR SYMMETRY AND CHIRALITY
[6.6]CHIRALANE: THE CHEMICAL NOMENCLATURE CHALLENGE

[6.6]Chiralane (C₂₇H₂₈) is a polycycloalkane designed in 2004 by A. Schwartz [13].
Its molecular graph is non-planar and has 768 automorphisms.
Its hydrogen-suppressed graph has 12 automorphisms.
[6.6]Chiralane is both symmetric and chiral.
There are three orthogonal axes of order 2 and four axes of order 3.
The central atom is an undistorted tetrahedral sp³-hybridized carbon bearing four rigorously identical homochiral substituents.
[6.6]Chiralane has all the direct rotational symmetries of the regular tetrahedron, but none of its mirror symmetries or higher order improper symmetries.

The chiral index of [6.6]chiralane is 0.9824 and the chiral index of its skeleton is 1.0000: it is the maximal value allowed for the chiral index.
The predicted optical rotation of [6.6]chiralane is [α]_D=692 degrees (molar rotation 244000 degrees) [13].

The nomenclature challenge:
Until now, nomenclature assignment programmes failed with [6.6]chiralane, a polycycloalcane smaller than cholesterol, C₂₇H₂₈ and C₂₇H₄₆O respectively.
Cheminformaticians willing to run their own nomenclature assignment programme with [6.6]chiralane are welcome to download its connection table and email me the result when successful.

More about [6.6]chiralane can be found in [13].

Download 3D connection table: mol2 format, hin format.
H-suppressed 3D connection table: mol2 format, hin format.

8. MOLECULAR SHAPE DESCRIPTORS: THE CYLINDRICAL MODEL

It is usual to describe the shape of a molecule with a sphere. Calculating the four parameters defining a sphere (center, radius) is easy (most time it is the smallest sphere enclosing the atomic centers), and the handling of spheres in the framework of shape analysis calculations is generally simple. But it appears that the spherical shape model is too rough and that a cylindrical model is more accurate [14]. This latter is also more suitable in the context of protein cavities, channels and pockets calculations (see [14] for a review).
Seven parameters are needed to define a straight cylinder: two shape parameters (radius and height), and five location parameters (center and axis direction). There is no unique way to define a cylindrical molecular shape. The two simplest ways to operate are:

Compute the smallest height enclosing cylinder (needs to compute first the two closest enclosing slabs)
Compute the smallest radius enclosing cylinder (i.e. compute first the axis and the radius, then the height)

Although there is a simple algorithm to compute the smallest height enclosing cylinder, computing the smallest radius enclosing cylinder is more difficult [15]. This latter computation is done by the CYL freeware, and both computations are done by the RADI freeware which is devoted to the computation of many molecular shape descriptors.

9. PROTEINS: GEOMETRIC DESCRIPTORS
(cavities, channels, concavities, pores, tunnels)

The major difficulty encountered to define the shape of a protein is to decide where is the boundary between the interior and the exterior of a protein.
Examples: imagine a large flat cavity at the surface of a protein (this surface is an unclear concept, too): should we flag it as being part of the protein domain, or should we flag it as being part of its exterior? The same question can be asked for a deep and narrow cavity opened at the surface of the protein. Many people would agree that the large flat surface cavity is at the exterior of the protein and that the narrow deep cavity is in the interior of the protein. But it is hard to provide an answer for situations intermediate between the two latter ones.
Other question: when two cavities are intersecting somewhere in a protein, should we count two cavities or only one?
There is a crucial need of robust definitions for these concepts, while no definitive answer exists.
Some tools and softwares are proposed below. They do not provide a definitive answer to the questions above, but they may be helpful for modeling purposes.

CCCPP computes channels in proteins. It is non-convex polyhedral model of channels based on a Delaunay triangulation. It was particularly useful in the case of the CYP3A4 (cytochrome P450 3A4): see ref. [14] (in which there is an historical survey of channels calculation algorithms); see further applications to the CYP3A4 in refs [16], [17], [18].
CONICA computes the largest enclosed cone in a protein [19]. It is a conical model of protein concavities. For ligands, the smallest enclosing cone is computable with CONE: see ref. [20], in which circumscibed cones and best fitting cones calculations algorithms are described (and a conicity index is defined).
PCI computes the Pocket Convexity Index and the Pocket Sphericity Index [21].
PPIC computes without parameter the interface between partners of a complex, such as a protein-ligand complex or a protein-protein complex [22], or between protein chains [23]. For this latter, it was used to follow the evolution of a molecular dynamics simulation without performing spatial alignments [23].

REFERENCES

PETITJEAN M., DUBOIS J.E.
Statistical Study of a large Structural File Based on the Mendeleev Table.
Collec. Czech. Chem. Comm. 1990, 55[6], 1404-1415.
A free scanned copy deposited with permission of the Publisher is available from the HAL repository: hal-02123580 (copyright rules apply).

PETITJEAN M., DUBOIS J.E.
Topological Statistics on a Large Structural File.
J. Chem. Inf. Comput. Sci. 1990, 30[3], 332-343. DOI 10.1021/ci00067a019.

PETITJEAN M., SAPORTA G.
A Possible Comparison between Large Chemical Databases: Generation of the Graphic Mendeleiev Table with Correspondence Analysis.
Bulletin de l'Institut International de Statistique, Communications libres, livraison 2, pp. 196-197,
47^th session of the I.I.S., Paris, 29 Aug. - 6 Sept. 1989.

PETITJEAN M., SAPORTA G.
Peeling a Large Set of Individuals in the Factorial Space coming from Correspondence Analysis
Data Analysis, Learning Symbolic and Numeric Knowledge, pp. 321-325,
E. Diday Ed., Nova Science Publ., New-York 1989.
(Proceedings of the Conference on Data Analysis, Learning Symbolic and Numeric Knowledge, INRIA, Antibes, France, 11-14 Sept. 1989).

PETITJEAN M., SAPORTA G.
On the Performance of Peeling Algorithms.
Appl. Stoch. Models Data Anal. 1992, 8[2], 91-98. DOI 10.1002/asm.3150080204.

PETITJEAN M., SAPORTA G.
The Radius-Diameter Diagram: a Statistical Tool to Evaluate Complex Data in Metric Spaces. Applications to Euclidian Solids and Graphs
Symbolic-Numeric, Data Analysis and Learning, pp. 57-64,
E. Diday & Y. Lechevallier Eds., Nova Science Publ., New-York 1991.
(Proceedings of the Conference on Symbolic-Numeric, Data Analysis and Learning, INRIA, Versailles, France, 18-20 Sept. 1991).

PETITJEAN M.
Applications of the Radius-Diameter Diagram to the Classification of Topological and Geometrical Shapes of Chemical Compounds.
J. Chem. Inf. Comput. Sci. 1992, 32[4], 331-337. DOI 10.1021/ci00008a012.

ATTIAS R., PETITJEAN M.
Statistical Analysis of Atom Topological Neighborhoods and Multivariate Representations of a Large Chemical File.
J. Chem. Inf. Comput. Sci. 1993, 33[5], 649-656. DOI 10.1021/ci00015a001.

PETITJEAN M.
Sur la caractérisation des domaines inexplorés en Chimie.
Revue Française de Bibliométrie, 1990,8,16-26.

ATTIAS R.
Substructure Systems and Structural Retrieval Systems
Encyclopedia of Library and Information Science, Volume 50, Supplement 13, pp 308-363.
Allan Kent Ed., Marcel Dekker Inc., New-York 1992.

ATTIAS R.
Generative Graphs and Representation by Induction of Orders: the RIO General Model of EURECAS.
J. Chem. Inf. Comput. Sci. 1993, 33[3], 415-420. DOI 10.1021/ci00013a019.

MESLAMANI J.E., ANDRÉ F., PETITJEAN M.
Assessing the Geometric Diversity of Cytochrome P450 Ligand Conformers by Hierarchical Clustering with a Stop Criterion.
J. Chem. Inf. Model. 2009, 49[2], 330-337. DOI 10.1021/ci800275k.

SCHWARTZ A., PETITJEAN M.
[6.6]Chiralane: A Remarkably Symmetric Chiral Molecule.
Symmetry: Culture and Science 2008, 19[4], 307-316.
Download PDF paper from the HAL repository: hal-01941526 (deposited with permission from Symmetrion; copyright rules apply)

BENKAIDALI L., ANDRÉ F., MAOUCHE B., SIREGAR P., BENYETTOU M., MAUREL F., PETITJEAN M.
Computing Cavities, Channels, Pores and Pockets in Proteins from non Spherical Ligands Models.
Bioinformatics 2014, 30[6], 792-800. DOI 10.1093/bioinformatics/btt644.

PETITJEAN M.
About the Algebraic Solutions of Smallest Enclosing Cylinders Problems.
Appl. Alg. Eng. Comm. Comp. 2012, 23[3-4], 151-164. DOI 10.1007/s00200-012-0171-y.
Free version for readers: https://rdcu.be/bzTJy (link created by ; downloads, prints and copies are for subscribers only)

BENKAIDALI L.
Etude et applications de nouveaux modèles géométriques des canaux d'accès au site actif de certains cytochromes P450 humains par des ligands volumineux.
PhD Thesis, University Paris 6, 15 September 2016 (in French; available on the HAL repository: tel-01483643).
Awarded in 2016 by the l'Oréal-UNESCO committee: FWIS (For Women in Science) regional award for the Maghreb.

BENKAIDALI L., ANDRÉ F., MOROY G., TANGOUR B., MAUREL F., PETITJEAN M.
The cytochrome P450 3A4 has three major conformations: new clues to drug recognition by this promiscuous enzyme.
Mol. Inf. 2017, 36[10], 1700044. DOI 10.1002/minf.201700044.

BENKAIDALI L., ANDRÉ F., MOROY G., TANGOUR B., MAUREL F., PETITJEAN M.
Four Major Channels Detected in the Cytochrome P450 3A4: a Step Toward Understanding its Multispecificity.
Int. J. Mol. Sci. 2019, 20[4], 987. DOI 10.3390/ijms20040987.

PETITJEAN M.
A fast algorithm to compute conical pockets in proteins. Application to the structural characterization of gamma-carbonic anhydrases.
Mol. Inf. 2017, 36[10], 1600155. DOI 10.1002/minf.201600155.

PETITJEAN M.
Analytical Algorithms for Ligand Cone Angles Calculations. Application to Triphenylphosphine Palladium Complexes.
Compt. Rend. Chim. 2015, 18[6], 678-684. DOI 10.1016/j.crci.2015.04.004.

CERISIER N., REGAD L., TRIKI D., CAMPROUX A.-C., PETITJEAN M.
Cavity versus ligand shape descriptors: Application to urokinase binding pockets.
J. Comput. Biol. 2017, 24[11], 1134-1137. DOI 10.1089/cmb.2017.0061.

LAVILLE P., MARTIN J., LAUNAY G., REGAD L., CAMPROUX A.-C., DE VRIES S., PETITJEAN M.
A non-parametric method to compute protein-protein and protein-ligands interfaces. Application to HIV-2 protease-inhibitors complexes.
biorXiv/2018/498923 2018. DOI 0.1101/498923.

OZEEL V., PERRIER A., VANET A., PETITJEAN M.
The Symmetric Difference Distance: A New Way to Evaluate the Evolution of Interfaces along Molecular Dynamics Trajectories; Application to Influenza Hemagglutinin.
Symmetry 2019, 11[5], 662. DOI 10.3390/sym11050662.

THE CHEMINFORMATICS PAGE (with a few bioinformatics)