The first choice that must be made is how similarity (or alternatively, distance) between gene expression data is to be defined. There are many ways to compute how similar two series of numbers are. Cluster provides eight options.
The most commonly used similarity metrics are based on Pearson correlation. The Pearson correlation coefficient between any two series of numbers x = {x1, x2, ..., xn} and y = {y1, y2, ..., yn} is defined as
r = |
|
n ∑ i = 1 |
( |
|
) | ( |
|
) |
where x is the average of values in x and σx is the standard deviation of these values.
There are many ways of conceptualizing the correlation coefficient. If you were to make a scatterplot of the values of x against y (pairing x1 with y1, x2 with y2, etc), then r reports how well you can fit a line to the values.
The simplest way to think about the correlation coefficient is to plot x and y as curves, with r telling you how similar the shapes of the two curves are. The Pearson correlation coefficient is always between -1 and 1, with 1 meaning that the two series are identical, 0 meaning they are completely uncorrelated, and -1 meaning they are perfect opposites. The correlation coefficient is invariant under linear transformation of the data. That is, if you multiply all the values in y by 2, or add 7 to all the values in y, the correlation between x and y will be unchanged. Thus, two curves that have identical shape, but different magnitude, will still have a correlation of 1.
Cluster actually uses four different flavors of the Pearson correlation. The textbook
Pearson correlation coefficient, given by the formula above, is used if you select
Correlation (centered) in the Similarity Metric dialog box.
Correlation (uncentered) uses the following modified equations:
r = |
|
n ∑ i = 1 |
( |
|
) | ( |
|
) |
σx(0) = | √ | ( |
|
n ∑ i = 1 |
xi2 | ) |
σy(0) = | √ | ( |
|
n ∑ i = 1 |
yi2 | ) |
The Spearman rank correlation and Kendall’s τ are two additional metrics, which are non-parametric versions of the Pearson correlation coefficient. These methods are more robust against outliers.
The Spearman rank correlation calculates the correlation between the ranks of the data values in the two vectors. For example, if we have two data vectors
Kendall’s τ goes a step further by using only the relative ordering of x and y to calculate the correlation (Snedecor & Cochran). To calculate Kendall’s τ, consider all pairs of data points (xi, yi) and (xj, yj). We call a pair concordant if
and discordant if
We can represent this by a table:
- | (2.3, 2.1) | (6.7, 5.9) | (4.5, 4.4) | (20.8, 4.2) |
(2.3, 2.1) | - | << | << | << |
(6.7, 5.9) | >> | - | >> | <> |
(4.5, 4.4) | >> | << | - | <> |
(20.8, 4.2) | >> | >< | >< | - |
τ = |
|
, |
A newly added distance function is the Euclidean distance, which is defined as
d = |
|
n ∑ i = 1 |
( | xi - yi | )2 |
Unlike the correlation-based distance measures, the Euclidean distance takes the magnitude of changes in the gene expression levels into account. An example of the Euclidean distance applied to k-means clustering can be found in De Hoon, Imoto, and Miyano (2002).
The city-block distance, alternatively known as the Manhattan distance, is related to the Euclidean distance. Whereas the Euclidean distance corresponds to the length of the shortest path between two points, the city-block distance is the sum of distances along each dimension:
d = |
|
n ∑ i = 1 |
| | xi - yi | | |
As for the Euclidean distance, the expression data are subtracted directly from each other, and we should therefore make sure that they are properly normalized.
When either x or y has missing values, only observations present for both x and y are used in computing similarities.
With any specified metric, the first step in the hierarchical clustering routines described below is to compute the distance (the opposite of similarity; for all correlation metrics distance = 1.0 - correlation) between all pairs of items to be clustered (e.g. the set of genes in the current dataset). This can often be time consuming, and, except for pairwise single-linkage clustering, memory intensive (the maximum amount of memory required is 4 x N x N bytes, where N is the number of items being clustered). The algorithm for pairwise single-linkage hierarchical clustering is less memory-intensive (linear in N).