On Entropy-Based Dependence Measures for Two and Three Dimensional Categorical Variable Distributions

Ilkka Virtanen and Jaakko Astola

Abstract

It is well known that the entropy-based concept of mutual information provides a measure of dependence between two discrete random variables. There are several ways to normalize this measure in order to obtain a coefficient similar e.g. to Pearson's coefficient of contingency.

In our paper we propose and study one way of normalizing the mutual information. There are two factors which make our normalization attractive.

First, the coefficient we get possesses a consistent behaviour for a family of test distributions. In a situation where we generate random variables having a "prescribed amount of dependence" among them, we obtain a high degree of compatibility between the entropy-based correlation coefficient and the a priori amount of dependence.

Secondly, the definition of the information and the normalization procedure generalize directly to three dimensios. They produce a measure of total dependence among the three variables that possesses the ability to reveal also inverse association or negative dependence between the random variables (even for pure categorical variables).

In the two dimensional case we define the entropy correlation coefficient r(H) by

(1) r(H) = (2*I(X,Y)/(H(X)+H(Y)))**(1/2) = (2*(1- H(X,Y)/(H(X)+H(Y))))**(1/2),

where H(X,Y) is the joint entropy of X and Y, H(X) and H(Y) are the entropies of X and Y, respectively, and I(X,Y) = H(X) + H(Y) - H(X,Y) is the mutual information between X and Y.

The entropy correlation coefficient is shown to have e.g. the following properties: r(H) is scaled to (0,1), such that 0 indicates full independence and 1 complete dependence between the two variables. Further, r(H) increases almost linearly from 0 to 1 with increasing amount of dependence between X and Y.

The entropy correlation coefficient for a three-dimensional distribution is defined as

(2) r(H) = (3*I(X,Y,Z)/(H(X)+H(Y)+H(Z)))**(1/3) ,

where the total information I(X,Y,Z) between the three random variables X, Y and Z is defined using the entropies of different orders:

(3) I(X,Y,Z) = H(X,Y,Z) - H(X,Y) - H(Y,Z) - H(Z,X) + H(X) + H(Y) + H(Z).

It can be shown that r(H) in (2) is scaled to (-1, 1), 0 indicating independence, 1 complete dependence, and -1 complete inverse dependence.

(Proceedings of The First World Congress of Bernoulli Society, Tashkent 8.-14.9.1986, 4p.)