- Computing an Optimal Partition from Weighted Categorical Variables or from an Array of Signed Dissimilarities
- Computing an Optimal Partition from Numerical Variables or from an Array of Dissimilarities or Distances

Author's professional address:

INSERM ERL U1133 (BFA, CNRS UMR 8251), Université Paris 7

35 rue Hélène Brion, 75205 Paris Cedex 13, France.

petitjean.chiral@gmail.com

Formerly (2010-2018): MTi, INSERM UMR-S 973, Université Paris 7.

Formerly (2007-2009): CEA/DSV/iBiTec-S/SB2SM (CNRS URA 2096), Saclay, France

Formerly (1987-2006): ITODYS, CNRS UMR 7086, Université Paris 7.

Other topics:

- The Mathematical Theory of Chirality.
- Symmetry, Chirality, Symmetry Measures and Chirality Measures: General Definitions
- An Asymmetry Coefficient for Multivariate Distributions
- A Simple Dataset to Test and Compare Chirality Measures
- The Random-Disorder Paradox
- Is There a Most Scalene Triangle?
- Mining Data in Graph Databases
- The Parity Phenomenon in Large Chemical Databases
- The Graphic Mendeleev Table
- The Radius Diameter Diagram
- Virtual Screening of 3D Chemical Data Bases
- [6.6]Chiralane: The chemical nomenclature challenge
- Molecular Symmetry and Chirality
- Molecular shape descriptors: the cylindrical model
- Spheres Unions and Intersections
- Analytical Computation of van der Waals Surfaces and Volumes
- Procrustes Methods, Shape Recognition, Similarity and Docking
- Freewares

1. COMPUTING AN OPTIMAL PARTITION FROM WEIGHTED CATEGORICAL VARIABLES OR FROM AN ARRAY OF SIGNED DISSIMILARITIES

There are many methods of classification of

The data are converted into an array of signed similarities before solving the optimization problem. It follows that an optimal partition may be computed from any signed similarity array with the same algorithm.

Depending on the data, the optimal partition may be not unique. The optimal cost of the partition is unique.

The content of this section can be retrieved in my paper in french [2]. The essential of the theory is also available in french in the doctoral thesis of Marcotorchino [1]. An English document is also available [5].

Notations:

- P is the partition matrix of order
**n**, i.e. the element at the intersection of the line i and the column j is P_{ij}=1 when the individuals i and j are in the same class, and P_{ij}=0 when i and j are not in the same class. All the diagonal elements of P are equal to 1 because an individual is in the same class than itself (reflexivity). P is symmetric because, if i and j are in the same class, then j and i are also in the same class. The transitivity conditions are: [P_{ij}=1 and P_{jk}=1] => P_{ik}=1, meaning that when i and j are in the same class and j and k are in the same class, then i and k are in the same class.

The individuals can be permuted such that the partition matrix is block diagonal, the blocks being squares containing only the value 1, each block being associated to a class. The size of a block is the number of individuals in its corresponding class. - S is the square array of order
**n**of signed similarities. The element S_{ij}is assumed to be strongly positive when the individuals i and j are "very similar", and to be strongly negative when the individuals i and j are "very dissimilar".

The signed similarity between i and j being assumed equal to the signed similarity between j and i, and the diagonal elements of P being known, we get a constrained optimization problem with

Anyway, the optimal partition and its number of classes is computed only from the data matrix S.

The optimal partition P may be not unique (e.g. when all elements of S are null): only the optimal cost Z is unique.

The difficulty attached to this classification method is to convert the input data into a signed similarity array, following a non arbitrary way.

Such a conversion is possible when the

In this situation, each signed similarity S

The methods extends to weighted categorical variables.

Marcotorchino noticed in his thesis that the transitivity constraints are linear, i.e., for the individuals i,j,k, the linear combination P

The algorithm described in [2] is a dynamic programming one, formally related to the Faure and Malgrange boolean algorithm, but it is working only as O(

The freeware POP is an implementation of this dynamic programming algorithm, and is downloadable from the software page.

An other version of POP is available in the amap package of the R statistical freeware.

2. COMPUTING AN OPTIMAL PARTITION FROM NUMERICAL VARIABLES OR FROM AN ARRAY OF DISSIMILARITIES OR DISTANCES

There are many methods of classification of

The original method was defined for points with coordinates in the euclidean space. The method has been modified and extended to non-euclidean spaces, so that it operates on an (

The freeware MCG is an implementation of the modified clustering gain algorithm, and is downloadable from the software page.

An application to the generation of 3D chemical data banks for virtual screening has been provided [4].

- MARCOTORCHINO F.

*Agrégation des similarités en classification automatique.*

Thèse de Doctorat d'Etat en Mathématiques,

Université Paris 6, 25 June 1981.

- PETITJEAN M.

*Agrégation des similarités: une solution oubliée.*

RAIRO Oper. Res. 2002,**36**[1], 101-108. DOI 10.1051/ro:2002001.

A free copy deposited with permission from EDP Sciences (see copyright rules), is available from the HAL repository: hal-02123085.

- JUNG Y., PARK H., DU D.-Z., DRAKE B.L.

*A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering.*

J. Glob. Opt. 2003,**25**[1], 91-111.

- MESLAMANI J.E., F. ANDRÉ F., PETITJEAN M.

*Assessing the Geometric Diversity of Cytochrome P450 Ligand Conformers by Hierarchical Clustering with a Stop Criterion.*

J. Chem. Inf. Model. 2009,**49**[2], 330-337. DOI 10.1021/ci800275k.

- PETITJEAN M.

*Optimal Partitioning and Clustering: How Many Classes ?*

Invited seminar, ABI (Atelier de Bioinformatique), Université Paris 6, 12 May 2010.

Download PDF file of the lecture from the HAL repository: hal-02124947.