CLASSIFICATION


  1. Computing an Optimal Partition from Weighted Categorical Variables or from an Array of Signed Dissimilarities
  2. Computing an Optimal Partition from Numerical Variables or from an Array of Dissimilarities or Distances


© Michel Petitjean (retired since Jan 1st, 2023)

Author's most recent professional address:
INSERM ERL U1133 (BFA, CNRS UMR 8251), Université Paris Cité
35 rue Hélène Brion, 75205 Paris Cedex 13, France.

Formerly (2010-2018): MTi, INSERM UMR-S 973, Université Paris 7.
Formerly (2007-2009): CEA/DSV/iBiTec-S/SB2SM (CNRS URA 2096), Saclay, France.
Formerly (1987-2006): ITODYS, CNRS UMR 7086, Université Paris 7.

Contact: petitjean.chiral@gmail.com




Other topics:


1. COMPUTING AN OPTIMAL PARTITION FROM WEIGHTED CATEGORICAL VARIABLES OR FROM AN ARRAY OF SIGNED DISSIMILARITIES



There are many methods of classification of n individuals, but little of them are able to compute the number of classes K from the data only. In 1981, François Marcotorchino published in his doctoral thesis [1] a method computing the optimal partition of n individuals described by p categorical variables. A major advantage of the method is that the number K of classes is computed from the data, and ONLY from the data.

The data are converted into an array of signed similarities before solving the optimization problem. It follows that an optimal partition may be computed from any signed similarity array with the same algorithm.

Depending on the data, the optimal partition may be not unique. The optimal cost of the partition is unique.


Theory.


The content of this section can be retrieved in my paper in french [2]. The essential of the theory is also available in french in the doctoral thesis of Marcotorchino [1]. An English document is also available [5].

Notations: Ideally, the best partition should be such that Pij=1 when Sij is positive, and Pij=0 when Sij is negative. Unfortunately, such a matrix may not satisfy all the transitivity constraints, and thus is not, in general, a partition matrix. Thus, the optimal partition is defined to be maximizing the sum Z of all the quantities PijSij, for all the n2 couples (i,j).
The signed similarity between i and j being assumed equal to the signed similarity between j and i, and the diagonal elements of P being known, we get a constrained optimization problem with n(n-1)/2 unknowns boolean values. Note that when S is not symmetric, it is equivalent to work with the symmetric matrix (S+S')/2, where S' is the transposed of S.

Anyway, the optimal partition and its number of classes is computed only from the data matrix S.
The optimal partition P may be not unique (e.g. when all elements of S are null): only the optimal cost Z is unique.

The difficulty attached to this classification method is to convert the input data into a signed similarity array, following a non arbitrary way.

Such a conversion is possible when the n individuals are decsribed by p categorical variables. Note that each categorical variable defines a classification, but the p classifications are in general quite different. It means that we are looking for an optimal partition, as close as possible from the p input partitions P1 ,P2,... Pp. A solution could be to consider the mean partition matrix Q=(P1+P2+...+Pp)/p, but unfortunately Q is not, in general, a partition matrix. Thus, we look for the optimal partition P which minimizes the distance between the mean partition matrix Q and P, in the sense of the Schur norm of P-Q. It is recalled than the Schur norm generalizes the ordinary euclidean norm from vectors to rectangular matrices.

In this situation, each signed similarity Sij is the number of categorical variables such that i and j are classified together, minus the number of categorical variables such that i and j are not classified together. The maximization of Z is thus the maximization of the number of agreements minus the number of disagreements. For each couple (i,j), the number of variables such that i and j are together plus the number of variables such that i and j are not together is equal to p. Thus, the partition maximizing Z maximizes the number of agreements.
The methods extends to weighted categorical variables.


Numerical Solution of the Optimization Problem


Marcotorchino noticed in his thesis that the transitivity constraints are linear, i.e., for the individuals i,j,k, the linear combination Pij+Pjk-Pik should be less than or equal to 1. It follows that an optimal partition may be computed by linear programming [1].

The algorithm described in [2] is a dynamic programming one, formally related to the Faure and Malgrange boolean algorithm, but it is working only as O(n2) in space. It permits either to compute an optimal partition, or to enumerate all the optimal partitions, although the linear programming approach is rather devoted to the computation of only one optimal partition.

The freeware POP is an implementation of this dynamic programming algorithm, and is downloadable from the software page.
An other version of POP is available in the amap package of the R statistical freeware.





2. COMPUTING AN OPTIMAL PARTITION FROM NUMERICAL VARIABLES OR FROM AN ARRAY OF DISSIMILARITIES OR DISTANCES



There are many methods of classification of n individuals described by p numerical variables or by an (n,n) array of dissimilarities or distances, but it has been since a long time a challenge to compute the number of classes ONLY from the data. One of the approachs of this problem consists to define a stop criterion to the hierarchical clustering. In 2003, Jung, Park, Du and Drake [3] defined the clustering gain, which takes the value zero when there are either n clusters or one unique cluster, and takes non negative values at each step of the ascending hierarchical clustering. It follows that the clustering gain is ensured to reach a maximum, thus providing an optimal number of classes.

The original method was defined for points with coordinates in the euclidean space. The method has been modified and extended to non-euclidean spaces, so that it operates on an (n,n) array of dissimilarities or non-euclidean distances [4]. See also the pdf of my lecure at the ABI [5].

The freeware MCG is an implementation of the modified clustering gain algorithm, and is downloadable from the software page.
An application to the generation of 3D chemical data banks for virtual screening has been provided [4].




REFERENCES


  1. MARCOTORCHINO F.
    Agrégation des similarités en classification automatique.
    Thèse de Doctorat d'Etat en Mathématiques,
    Université Paris 6, 25 June 1981.

  2. PETITJEAN M.
    Agrégation des similarités: une solution oubliée.
    RAIRO Oper. Res. 2002, 36[1], 101-108. DOI 10.1051/ro:2002001.
    A free copy deposited with permission from EDP Sciences (see copyright rules), is available from the HAL repository: hal-02123085.

  3. JUNG Y., PARK H., DU D.-Z., DRAKE B.L.
    A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering.
    J. Glob. Opt. 2003, 25[1], 91-111.

  4. MESLAMANI J.E., F. ANDRÉ F., PETITJEAN M.
    Assessing the Geometric Diversity of Cytochrome P450 Ligand Conformers by Hierarchical Clustering with a Stop Criterion.
    J. Chem. Inf. Model. 2009, 49[2], 330-337. DOI 10.1021/ci800275k.

  5. PETITJEAN M.
    Optimal Partitioning and Clustering: How Many Classes ?
    Invited seminar, ABI (Atelier de Bioinformatique), Université Paris 6, 12 May 2010.
    Download PDF file of the lecture from the HAL repository: hal-02124947.