- Computing an Optimal Partition from Weighted Categorical Variables or from an Array of Signed Dissimilarities
- Computing an Optimal Partition from Numerical Variables or from an Array of Dissimilarities or Distances
© Michel Petitjean, 2016
Author's professional address:
MTi, INSERM UMR-S 973, Université Paris 7
35 rue Hélène Brion, 75205 Paris Cedex 13, France.
Formerly: CEA/DSV/iBiTec-S/SB2SM (CNRS URA 2096), Saclay, France
(and formerly: ITODYS, UMR 7086, CNRS, Université Paris 7).
1. COMPUTING AN OPTIMAL PARTITION FROM WEIGHTED CATEGORICAL VARIABLES OR FROM AN ARRAY OF SIGNED DISSIMILARITIES
There are many methods of classification of n individuals, but little of them
are able to compute the number of classes K from the data only.
In 1981, François Marcotorchino published in his doctoral thesis
 a method computing the optimal partition
of n individuals described by p categorical variables.
A major advantage of the method is that the number K of classes is computed
from the data, and ONLY from the data.
The data are converted into an array of signed similarities before solving
the optimization problem. It follows that an optimal partition may be computed
from any signed similarity array with the same algorithm.
Depending on the data, the optimal partition may be not unique.
The optimal cost of the partition is unique.
The content of this section can be retrieved in my paper
in french . The essential of the theory is also
available in french in the doctoral thesis of Marcotorchino .
An English document is also available .
Ideally, the best partition should be such that Pij=1 when Sij
is positive, and Pij=0 when Sij is negative.
Unfortunately, such a matrix may not satisfy all the transitivity constraints,
and thus is not, in general, a partition matrix. Thus, the optimal partition
is defined to be maximizing the sum Z of all the quantities
PijSij, for all the n2 couples (i,j).
- P is the partition matrix of order n, i.e. the element at the intersection
of the line i and the column j is Pij=1 when the individuals i and j
are in the same class, and Pij=0 when i and j are not in the same class.
All the diagonal elements of P are equal to 1 because an individual is in the same
class than itself (reflexivity). P is symmetric because, if i and j are in the same
class, then j and i are also in the same class. The transitivity conditions are:
[Pij=1 and Pjk=1] => Pik=1, meaning that
when i and j are in the same class and j and k are
in the same class, then i and k are in the same class.
The individuals can be permuted such that the partition matrix is block diagonal,
the blocks being squares containing only the value 1, each block being associated to
a class. The size of a block is the number of individuals in its corresponding class.
- S is the square array of order n of signed similarities. The element
Sij is assumed to be strongly positive when the individuals i and j
are "very similar", and to be strongly negative when the individuals i and j
are "very dissimilar".
The signed similarity between i and j being assumed equal to the signed similarity
between j and i, and the diagonal elements of P being known, we get a constrained
optimization problem with n(n-1)/2 unknowns boolean values. Note that when
S is not symmetric, it is equivalent to work with the symmetric matrix
(S+S')/2, where S' is the transposed of S.
Anyway, the optimal partition and its number of classes is computed only from
the data matrix S.
The optimal partition P may be not unique (e.g. when all elements of S are null):
only the optimal cost Z is unique.
The difficulty attached to this classification method is to convert the
input data into a signed similarity array, following a non arbitrary way.
Such a conversion is possible when the n individuals are decsribed by p
categorical variables. Note that each categorical variable defines a classification,
but the p classifications are in general quite different. It means that
we are looking for an optimal partition, as close as possible from the p
input partitions P1 ,P2,... Pp.
A solution could be to consider the mean partition matrix
but unfortunately Q is not, in general, a partition matrix.
Thus, we look for the optimal partition P which minimizes the distance between
the mean partition matrix Q and P, in the sense of the Schur norm of P-Q.
It is recalled than the Schur norm generalizes the
ordinary euclidean norm from vectors to rectangular matrices.
In this situation, each signed similarity Sij is the number of categorical variables
such that i and j are classified together, minus the number of categorical variables
such that i and j are not classified together. The maximization of Z is thus the
maximization of the number of agreements minus the number of disagreements.
For each couple (i,j), the number of variables such that i and j are together
plus the number of variables such that i and j are not together is equal to p.
Thus, the partition maximizing Z maximizes the number of agreements.
The methods extends to weighted categorical variables.
Numerical Solution of the Optimization Problem
Marcotorchino noticed in his thesis that the transitivity constraints are linear,
i.e., for the individuals i,j,k, the linear combination
Pij+Pjk-Pik should be less than or equal to 1.
It follows that an optimal partition may be computed by linear programming
The algorithm described in  is a dynamic programming one,
formally related to the Faure and Malgrange boolean algorithm, but it is working only
as O(n2) in space. It permits either to compute an optimal partition,
or to enumerate all the optimal partitions, although the linear programming approach
is rather devoted to the computation of only one optimal partition.
The freeware POP is an implementation of this dynamic programming algorithm,
and is downloadable from the
An other version of POP is available in the
of the R statistical freeware.
2. COMPUTING AN OPTIMAL PARTITION FROM NUMERICAL VARIABLES OR FROM AN ARRAY OF DISSIMILARITIES OR DISTANCES
There are many methods of classification of n individuals described by p numerical variables
or by an (n,n) array of dissimilarities or distances, but it has been since a long time
a challenge to compute the number of classes ONLY from the data.
One of the approachs of this problem consists to define a stop criterion to the hierarchical clustering.
In 2003, Jung, Park, Du and Drake  defined the clustering gain, which takes the value zero when there
are either n clusters or one unique cluster, and takes non negative values at each step of the ascending hierarchical clustering.
It follows that the clustering gain is ensured to reach a maximum, thus providing an optimal number of classes.
The original method was defined for points with coordinates in the euclidean space.
The method has been modified and extended to non-euclidean spaces,
so that it operates on an (n,n) array of dissimilarities or non-euclidean distances .
See also the pdf of my lecure at the ABI .
The freeware MCG is an implementation of the modified clustering gain algorithm, and is downloadable from the
An application to the generation of 3D chemical data banks for
virtual screening has been provided .
- MARCOTORCHINO F.
Agrégation des similarités en classification automatique.
Thèse de Doctorat d'Etat en Mathématiques,
Université Paris 6, 25 June 1981.
- PETITJEAN M.
Agrégation des similarités: une solution oubliée.
RAIRO Oper. Res. 2002,36,101-108.
Download PDF paper
(posted with permission from EDP Sciences:
see copyright information)
- JUNG Y., PARK H., DU D.-Z., DRAKE B.L.
A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering.
J. Glob. Opt. 2003,25,91-111.
- MESLAMANI J.E., F. ANDRÉ F., PETITJEAN M.
Assessing the Geometric Diversity of Cytochrome P450 Ligand Conformers by Hierarchical Clustering with a Stop Criterion.
J. Chem. Inf. Model. 2009,49,330-337.
- PETITJEAN M.
Optimal Partitioning and Clustering: How Many Classes ?
Invited seminar, ABI (Atelier de Bioinformatique), Université Paris 6, 12 May 2010.
Download PDF file of the lecture.