THE CHEMINFORMATICS PAGE

  1. Mining Data in Graph Databases
  2. The Parity Phenomenon in Large Chemical Databases
  3. The Graphic Mendeleev Table
  4. The Radius Diameter Diagram
  5. 2D Chemical Databases: Miscellaneous
  6. Virtual Screening of 3D Chemical Databases
  7. Molecular Symmetry and Chirality
    [6.6]Chiralane: The Chemical Nomenclature Challenge
  8. Molecular Shape Descriptors: the Cylindrical Model


© Michel Petitjean, 2014
Author's professional address:
MTi, INSERM UMR-S 973, Université Paris 7
35 rue Hélène Brion, 75205 Paris Cedex 13, France.
petitjean.chiral@gmail.com
Formerly: CEA/DSV/iBiTec-S/SB2SM (CNRS URA 2096), Saclay, France
(and formerly: ITODYS, UMR 7086, CNRS, Université Paris 7).




Other topics:


1. MINING DATA IN GRAPH DATABASES


Databanks contain various kinds of data: numerical, categorical, ordinal, text, and so on... Most kinds of data may be handled by the numerous available statistical and data mining tools. Nevertheless, there is a lack of tools to handle populations of graphs. Assume for example a population of thousands of graphs. How "summarize" it, numerically or graphically ? For real values, we know many descriptors: mean, median, standard deviation, quantiles, etc.. Population of vectors are also easily summarized with mean, variance matrix, convex hulls, etc.. But what about graphs ? How to compute a "mean" graph ? What are the "extreme" graphs ? How graphs are "dispersed" (diversity) ? Then, how two populations of graphs could be quickly compared ? Much more difficult: how detect clusters in a space of graphs ?

This problem has appeared in chemistry, because the structural formula of a molecule is most time represented by a colored graph in chemical databanks. E.g. the water molecule H2O is a graph containing 3 nodes and two edges. The graph associated to the formula of a chemical compound is non directed, simple, non connex, and both nodes and edges are valued non numerically (colors). There are chemical databanks containing millions of graphs, such as Chemical Abstracts and Beilstein.

Some of the exploratory statistical tools used on large chemical databases are presented below. They can be used as parts of data mining tools for other graphs databases. Detailed methods and extensive results are available in the cited papers.




2. THE PARITY PHENOMENON IN LARGE CHEMICAL DATABASES


Plotting the number of molecules having a given number of carbon atoms shows that the even values are preferred to the odd values in large chemical files. This parity phenomenon was discovered in 1990 [1,2]. It does not mean that most chemists around the world have cooperated in order to obtain this. A semi quantitative explanation of the phenomenon has been proposed [2]. It is based on graph theory, and on the statistical occurrences of the elements of the periodic table.

Posted with permission from [2].
© 1990 American Chemical Society.





3. THE GRAPHIC MENDELEEV TABLE


The distribution of the chemical elements (atoms) in a chemical database can be visualized graphically. The periodic table of the elements is viewed as a rectangular array, and each element of this array contains the occurrence of its associated atom. The array is submitted to a factorial correspondence analysis. Thus, each atom of the Mendeleev table get coordinates in a 6-dimensional space: this is the "Graphic Mendeleev Table" [1,3]. In order to visualize the Graphic Mendeleev Table, the successive convex hulls of the points are drawn in the principal planes [4,5].






4. THE RADIUS DIAMETER DIAGRAM


There are little statistical tools to perform exploratory data analysis in graphs databases. The basic observation is the graph, but it may be also a connex component, or simply a node, an edge, a cycle, a path, a concentric layer, etc. [2]. Many univariate and multivariate distributions may be generated from these populations. One of the most useful distributions is the number of connex components having a given radius R and a given diameter D. It is recalled that the radius is the lower bound of the eccentricities of the nodes of a connex component, and the diameter is the upper bound, the eccentricity of a node being the upper bound of the distances from the node to all the nodes of the connex component. It is known that D takes values between R and 2R, such that the bivariate distribution in the (R,D) plane takes place in an angular sector limited by the two lines D=R and D=2R. Displaying the clusters in this bivariate distribution offers a schematic graphical summary of the population which is called the Radius Diameter Diagram [6,7].
The quantity I=(D-R)/R takes values in [0..1]. It is used as a shape index, and its distribution can be plotted (see example in [7]).

Posted with permission from [7].
© 1992 American Chemical Society.





5.   2D CHEMICAL DATABASES: MISCELLANEOUS


As mentioned above, there are many distributions of interest in graphs databases. Most of them have been computed from a large chemical database containing more than three millions of chemical compounds [1,2,7,8]. The aspect of some of these distributions is not yet interpreted, such as the number of concentric layers in the plane: number of nodes per layer and depth of the layer [8].

These statistical tools are part of technological survey tools, which helps to detect unexplored areas [9].

Finally, it should be mentionned that searching a partial subgraph in a graphs database is performed in chemistry with an highly efficient tool called the DARC system. It permits to retrieve fastly the compounds in very large chemical databases [10,11]. It works on Markush formulas, which are commonly encountered in reactions databases and patent databases (see the Questel website).

Software:
Enumerating the graph automorphisms associated to the components of a chemical graph is done in the QCM freeware. QCM is devoted to the computation of the chiral index, but it may be used to enumerate the graph automorphisms.




6. VIRTUAL SCREENING OF 3D CHEMICAL DATABASES


The virtual screening of a chemical database can be made on the basis of the geometries of the molecules. It is thus better to store the geometry of each molecule in the database rather to recompute it at the occasion of each screening. Most molecules are flexible and have an infinite number of low energy conformations. So, it is necessary to store an optimal number of conformers rather than storing only one conformer or storing systematically, say, 50 or 100 conformers per molecule. It is difficult to generate directly an optimal set of conformers for a flexible molecule. It is why a two steps procedure has been built [12]: (a) an excess of conformers is generated algorithmically in order to have a sufficient covering of the conformational space, and (b) the conformational redundancy is reduced with a clustering algorithm computing an optimal partition of the initial set of conformers. It is pointed out that few clustering algorithms are able to compute the number of classes ONLY from the input data. The step (a) of the procedure may be performed with most usual conformers generators. The step (b) of the procedure is implemented in the DIVCF freeware.




7. MOLECULAR SYMMETRY AND CHIRALITY
[6.6]CHIRALANE: THE CHEMICAL NOMENCLATURE CHALLENGE


[6.6]Chiralane (C27H28) is a polycycloalkane designed in 2004 by A. Schwartz [13].
Its molecular graph is non-planar and has 768 automorphisms.
Its hydrogen-suppressed graph has 12 automorphisms.
[6.6]Chiralane is both symmetric and chiral.
There are three orthogonal axes of order 2 and four axes of order 3.
The central atom is an undistorted tetrahedral sp3-hybridized carbon bearing four rigorously identical homochiral substituents.
[6.6]Chiralane has all the direct rotational symmetries of the regular tetrahedron, but none of its mirror symmetries or higher order improper symmetries.

The chiral index of [6.6]chiralane is 0.9824 and the chiral index of its skeleton is 1.0000: it is the maximal value allowed for the chiral index.
The predicted optical rotation of [6.6]chiralane is [α]D=692 degrees (molar rotation 244000 degrees) [13].

The nomenclature challenge:
Until now, nomenclature assignment programmes failed with [6.6]chiralane, a polycycloalcane smaller than cholesterol, C27H28 and C27H46O respectively.
Cheminformaticians willing to run their own nomenclature assignment programme with [6.6]chiralane are welcome to download its connection table and email me the result when successful.

More about [6.6]chiralane can be found in [13].

Download 3D connection table: mol2 format, hin format.
H-suppressed 3D connection table: mol2 format, hin format.





8. MOLECULAR SHAPE DESCRIPTORS: THE CYLINDRICAL MODEL


It is usual to describe the shape of a molecule with a sphere. Calculating the four parameters defining a sphere (center, radius) is easy (most time it is the smallest sphere enclosing the atomic centers), and the handling of spheres in the framework of shape analysis calculations is generally simple. But it appears that the spherical shape model is too rough and that a cylindrical model is more accurate [14]. This latter is also more suitable in the context of protein cavities, channels and pockets calculations (see [14] for a review).
Seven parameters are needed to define a straight cylinder: two shape parameters (radius and height), and five location parameters (center and axis direction). There is no unique way to define a cylindrical molecular shape. The two simplest ways to operate are: Although there is a simple algorithm to compute the smallest height enclosing cylinder, computing the smallest radius enclosing cylinder is more difficult [15]. This latter computation is done by the CYL freeware, and both computations are done by the RADI freeware which is devoted to the computation of many molecular shape descriptors.




REFERENCES


  1. PETITJEAN M., DUBOIS J.E.
    Statistical Study of a large Structural File Based on the Mendeleev Table.
    Collec. Czech. Chem. Comm. 1990,55[6],1404-1415.
    Download scanned paper in PDF (posted with permission from the Publisher)

  2. PETITJEAN M., DUBOIS J.E.
    Topological Statistics on a Large Structural File.
    J. Chem. Inf. Comput. Sci. 1990,30[3],332-343.
    (DOI 10.1021/ci00067a019)

  3. PETITJEAN M., SAPORTA G.
    A Possible Comparison between Large Chemical Databases: Generation of the Graphic Mendeleiev Table with Correspondence Analysis.
    Bulletin de l'Institut International de Statistique, Communications libres, livraison 2, pp. 196-197,
    47th session of the I.I.S., Paris, 29 Aug. - 6 Sept. 1989.

  4. PETITJEAN M., SAPORTA G.
    Peeling a Large Set of Individuals in the Factorial Space coming from Correspondence Analysis
    Data Analysis, Learning Symbolic and Numeric Knowledge, pp. 321-325,
    E. Diday Ed., Nova Science Publ., New-York 1989.
    (Proceedings of the Conference on Data Analysis, Learning Symbolic and Numeric Knowledge, INRIA, Antibes, France, 11-14 Sept. 1989).

  5. PETITJEAN M., SAPORTA G.
    On the Performance of Peeling Algorithms.
    Appl. Stoch. Models Data Anal. 1992,8[2],91-98.
    (DOI 10.1002/asm.3150080204)

  6. PETITJEAN M., SAPORTA G.
    The Radius-Diameter Diagram: a Statistical Tool to Evaluate Complex Data in Metric Spaces. Applications to Euclidian Solids and Graphs
    Symbolic-Numeric, Data Analysis and Learning, pp. 57-64,
    E. Diday & Y. Lechevallier Eds., Nova Science Publ., New-York 1991.
    (Proceedings of the Conference on Symbolic-Numeric, Data Analysis and Learning, INRIA, Versailles, France, 18-20 Sept. 1991).

  7. PETITJEAN M.
    Applications of the Radius-Diameter Diagram to the Classification of Topological and Geometrical Shapes of Chemical Compounds.
    J. Chem. Inf. Comput. Sci. 1992,32[4],331-337.
    (DOI 10.1021/ci00008a012)

  8. ATTIAS R., PETITJEAN M.
    Statistical Analysis of Atom Topological Neighborhoods and Multivariate Representations of a Large Chemical File.
    J. Chem. Inf. Comput. Sci. 1993,33[5],649-656.
    (DOI 10.1021/ci00015a001)

  9. PETITJEAN M.
    Sur la caractérisation des domaines inexplorés en Chimie.
    Revue Française de Bibliométrie, 1990,8,16-26.

  10. ATTIAS R.
    Substructure Systems and Structural Retrieval Systems
    Encyclopedia of Library and Information Science, Volume 50, Supplement 13, pp 308-363.
    Allan Kent Ed., Marcel Dekker Inc., New-York 1992.

  11. ATTIAS R.
    Generative Graphs and Representation by Induction of Orders: the RIO General Model of EURECAS.
    J. Chem. Inf. Comput. Sci. 1993,33[3],415-420.
    (DOI 10.1021/ci00013a019)

  12. MESLAMANI J.E., ANDRÉ F., PETITJEAN M.
    Assessing the Geometric Diversity of Cytochrome P450 Ligand Conformers by Hierarchical Clustering with a Stop Criterion.
    J. Chem. Inf. Model. 2009,49[2],330-337.
    (DOI 10.1021/ci800275k)

  13. SCHWARTZ A., PETITJEAN M.
    [6.6]Chiralane: A Remarkably Symmetric Chiral Molecule.
    Symmetry: Culture and Science 2008,19[4],307-316.
    Download PDF paper (posted with permission from Symmetrion)

  14. BENKAIDALI L., ANDRÉ F., MAOUCHE B., SIREGAR P., BENYETTOU M., MAUREL F., PETITJEAN M.
    Computing Cavities, Channels, Pores and Pockets in Proteins from non Spherical Ligands Models.
    Bioinformatics 2014,30[6],792-800.
    (DOI 10.1093/bioinformatics/btt644)

  15. PETITJEAN M.
    About the Algebraic Solutions of Smallest Enclosing Cylinders Problems.
    Appl. Alg. Eng. Comm. Comp. 2012,23[3-4],151-164.
    (DOI 10.1007/s00200-012-0171-y