当前位置:首页 > 聚类的R语言实现外文
Cluster Analysis: Tutorial with R
Jari Oksanen January 26, 2014
Contents
1 Introduction
1
2 Hierarchic Clustering 2.1 Description of Classes . .
and Ordination Spanning Tree .
. . . . . . . . . . . . . . . . . . . . . .42.2 Numbers of Classes . . . . . . . . . . .. . . . . . . . . . . . . . . 42.3 Clustering . . . . . . . . . . . . . . . . . . . . . .52.4 Reordering a Dendrogram . . . . . .. . . . . . . . . . . . . . . . 62.5 Minimum . . . . . . . . . . . . . . . . . . . . . .72.6 Cophenetic Distance . . . . . . . . . .. . . . . . . . . . . . . . . 8
1
3 Interpretation of Classes
3.1 Environmental Interpretation . . . . . . . . . . . . . . . . . . . .
.2 Community Summaries . . . . . . . . . . . . . . . . . . . . . . .
8 9310 11 12
4 Optimized Clustering at a Given Level
4.1 Optimum Number of Classes . . . . . . . . . . . . . . . . . . . . 11 5 Fuzzy Clustering
1 Introduction
In this tutorial we inspect classification. Classification and ordination are al- ternative strategies of simplifying data. Ordination tries to simplify data into a map showing similarities among points. Classification simplifies data by putting similar points into same class. The task of describing a high number of points is simplified to an easier task of describing a low number of classes.
2 Hierarchic Clustering
The classification methods are available in standard R packages. The vegan package does not have many support functions for classification, but we still
load vegan to have access to its data sets and some of its support functions.1
1If you do not have a package, but get an error message, you must install package using
install.packages(\ or the installation menu.
1
AA
B
B
B BB
B
+
AAA
A
B
B
A B
A A B B
+
A
B
B BB
Figure 1: Distance between two clusters A and B defined by single, complete and average linkage. Mark each of the linkage types in the connecting line. The fusion level in the cluster dendrogram would be the length of the corresponding connecting line of the linkage type.
R> library(vegan) R> data(dune)
Hierarchic clustering (function hclust) is in standard R and available with- out loading any specific libraries. Hierarchic clustering needs dissimilarities as its input. Standard R has function dist to calculate many dissimilarity functions, but for community data we may prefer vegan function vegdist with ecologically useful dissimilarity indices. The default index in vegdist is Bray-Curtis: R> d <- vegdist(dune)
Ecologically useful indices in vegan have an upper limit of 1 for absolutely diferent sites (no shared species), and they are based on diferences of abun- dances. In contrast, the standard Euclidean distance has no upper limit, but varies with the sum of total abundances of compared sites when there are no shared species, and uses squares of
diferences of abundances. There are many other ecologically useful indices in vegdist, but Bray-Curtis is usually not a bad choice.
There are several alternative clustering methods in the standard function hclust. We shall inspect three basic methods: single linkage, complete linkage and average
linkage. All these start in the same way: they fuse two most sim- ilar points to a cluster. They difer in the way they combine clusters to each other, or new points to existing
clusters (Fig. 1). In single linkage (a.k.a. near- est neighbour, or neighbour joining tree in genetics) the distance between two clusters is the shortest possible distance among members of the clusters, or the best of the friends. In complete linkage (a.k.a. furthest neighbour) the distance between two clusters is the longest possible distance between the groups, or the worst among the friends. In average linkage, the distance between the clusters
2
is the distance between cluster centroids. There are several alternative ways of defining the average and defining the closeness, and hence a huge number of average linkage methods. We only use one of these methods commonly known as upgma. The lecture slides discuss the methods in more detail.
In the following we will compare three diferent clustering strategies. If you want to plot three graphs side by side, you can divide the screen into three panels by
R> par(mfrow=c(1,3))
This defines three panels side by side. You probably want to stretch the plotting window if you are using this option. Alternatively, you can have three panels above each other with R> par(mfrow=c(3,1))
You can get back to the single panel mode with R> par(mfrow=c(1,1))
You may also wish to use narrower empty margins for the panels: R> par(mar=c(3,4,1,1)+.1)
The mar command defines plot margins in order bottom, left, up, right using row height (text height) as a unit.
The single linkage clustering can be found with: R> csin <- hclust(d, method=\ R> csin The dendrogram can be plotted with: R> plot(csin)
The default is to plot an inverted tree with the root at the top, and branches hanging down. You can force the branches down to the base line giving the hang argument:
R> plot(csin, hang=-1)
If you plotted the csin tree twice you consumed two panels out of three you
have, and there will not be space for the next two trees in the same plot. In that case you can start a new plot by issuing again the mfrow command and then drawing csin again.
The complete linkage and average linkage methods are found in the same way:
Rccom <- hclust(d, method=\
> plot(ccom, hang=-1)
caver <- hclust(d, method=\) R
plot(caver, hang=-1) >
RThe vertical axes of the cluster dendrogram show the fusion level. The two > R>
most similar observations are combined first, and they are at the same level in all dendrograms. At the upper fusion levels, the scales diverge: they are the shortest dissimilarities among cluster members in single linkage, the longest possible
dissimilarities in complete linkage, and the distances among cluster centroids in average linkage (Fig. 1).
3
Figure 2: Vegemite is an Australian national delicacy made of yeast extract. The vegemite function was named because its output is just as dense as Vegemite.
2.1 Description of Classes
One problem with hierarchic clustering is that it gives a classification of ob-
servations (plots, sampling units), but it does not tell how these classes difer from each other. For community data, there is no information how the species composition difers between classes (we return to this subject in Chapter 3.2).
The vegan package has function vegemite (Fig. 2) that can produce com- pact community tables ordered by a dendrogram, ordination or environmental variables. With the help of these tables it is possible to see which species difer in classification: R> vegemite(dune, caver)
The vegemite command will always use one-character columns. If the ob- served values do not fit one character, the vegemite refuses to work. With ar- gument scale you can recode the values to one-character width. The vegemite has a graphical sister function tabasco that is described in section 2.4.
2.2 Numbers of Classes
The hierarchic clustering methods produce all possible levels of classifications. The extremes are all observations in a single class, and each observation in its private class. The user normally wants to have a clustering into a certain number of classes. The fixed classification can be visually demonstrated with rect.hclust function: R> plot(csin, hang=-1) R> rect.hclust(csin, 3) R> plot(ccom, hang=-1) R> rect.hclust(ccom, 3)
4
共分享92篇相关文档