当前位置:首页 > 聚类的R语言实现外文
R> plot(caver, hang=-1) R> rect.hclust(caver, 3)
Single linkage has a tendency to chain observations: most common case is
to fuse a single observation to an existing class: the single link is the nearest neighbour, and a close neighbour is more probably in a large group than in a small group or a lonely point. Complete linkage has a tendency to produce compact bunches: complete link minimizes the spread within the cluster. The average linkage is between these two extremes.
We can extract classification in a certain level using function cutree: R> cl <- cutree(ccom, 3) R> cl
This gives a numeric classification vector of cluster identities. The clusters are numbered in the order the observations appear in the data: the first item will always belong to cluster 1, and the numbering does not match the dendrogram. We can tabulate the numbers of observations in each cluster: R> table(cl)
We can compare two clustering schemes by cross-tabulation which gives as a confusion matrix:
R> table(cl, cutree(csin, 3)) R> table(cl, cutree(caver, 3))
The confusion matrix tabulates the classifications against each other. The rows give the first classification, and the columns the second classification. If the
classifications match and there is no \non-zero entry, but if the classes are divided between several classes in the second classification, the row has several non-zero entries.
2.3 Clustering and Ordination
We can use ordination to display the observed dissimilarities among points.
A natural choice is to use metric scaling a.k.a. principal coordinates analysis (PCoA) that maps observed dissimilarities linearly onto low-dimensional graph using the same dissimilarities we had in our clustering.
The metric scaling can be performed with standard R function cmdscale: R> ord <- cmdscale(d)
We can display the results using vegan function ordiplot that can plot results of any vegan ordination function and many non-vegan ordination func- tions, such as cmdscale, prcomp and princomp (the latter for principal compo- nents analysis): R> ordiplot(ord)
We got a warning because ordiplot tries to plot both species and sites in the same graph, and the cmdscale result has no species scores. We do not need to care about this warning.
There are many vegan functions to overlay classification onto ordination. For distinct, non-overlapping classes convex hulls are practical:
5
Figure 3: A dendrogram is similar to a mobile: branches can turn around, but the mobile is the same. Alexander Calder: Red Mobile.
R> ordihull(ord, cl, lty=3)
R> ordispider(ord, cl, col=\ R> ordiellipse(ord, cl, col=\\
For overlapping classes we can use ordispider. If we are not interested in individual points, but on class centroids and average dispersions, we can use ordiellipse, both with the same arguments as ordihull. The other clustering results can be seen with: R> ordiplot(ord, dis=\
R> ordihull(ord, cutree(caver, 3)) R> ordiplot(ord, dis=\ R> ordicluster(ord, csin)
We set here explicitly the display argument to display = \ to avoid the annoying and useless warnings. The contrasting clustering strategies (nearest vs. furthest vs. average neighbour) are evident in the shapes of the clusters. Single linkage clusters are chained, complete linkage clusters are compact, and average linkage clusters between these two.
The vegan package has a special function to display the cluster fusions in ordination. The ordicluster function combines sites and cluster centroids similarly as the average linkage method: R> ordiplot(ord, dis=\ R> ordicluster(ord, caver)
We can prune the top level fusions to highlight the clustering: R> ordiplot(ord, dis=\
R> ordicluster(ord, caver, prune=2)
2.4 Reordering a Dendrogram
The leaves of a dendrogram do not have a natural order. You can take a branch and turn around its root, and the tree is the same (see Fig. 3).
R has two alternative dendrogram presentations: the hclust result object and a general dendrogram object. The cluster type can be changed with: R> den <- as.dendrogram(caver)
The dendrograms are more general, and several methods are available for their manipulation and analysis. It is possible to re-order the leaves of a dendrogram so that they match as closely as possible an external variable.
6
In the following we rearrange the dendrogram so that the ordering of leaves corresponds as closely as possible with the first ordination axis: R> x <- scores(ord, display = \ R> oden <- reorder(den, x)
We plot the dendrorgrams together: R> par(mfrow=c(2,1)) R> plot(den) R> plot(oden)
R> par(mfrow=c(1,1))
The reordered dendrogram may also give more regularly structured community table:
R> vegemite(dune, oden)
The vegemite function has a graphical sister function tabasco2 that can also display the dendrogram. Moreover, it defaults to rerrange the dendrogram by the first axis of Correspondence Analysis: R> tabasco(dune, caver)
Correspondence Analysis packs similar species next to each other, and similar sites next to each other and gives a good diagonal representation of the data. If you want to see the original ordering of the sample plots, you must set Rowv = FALSE:
R> tabasco(dune, caver, Rowv = FALSE) R> tabasco(dune, oden, Rowv = FALSE)
tabasco function also defaults to order species to match the ordering of sites unless you set Colv = FALSE.
2.5 Minimum Spanning Tree
In the mathematical graph theory, tree is a connected graph without loops, spanning tree is a tree connecting all points, and minimum spanning tree is the shortest of such trees. Minimum spanning tree (mst) is closely related to single linkage clustering, which also connects all points with minimum total connecting distance. However, mst really combines points, whereas R representations of single linkage clustering hierarchically connects clusters instead of single points. mst can be found with vegan function spantree3 R> mst <- spantree(d)
mst can be overlaid onto an ordination with lines command: R> ordiplot(ord, dis=\ R> lines(mst, ord)
called because it is similar to vegemite but hotter.
are many other implementations MST in R, but the implementation in vegan
probably is the fastest.
2So 3There
7
Alternatively, the tree can be displayed with a plot command that tries to find a
locally optimal configuration for points to reproduce the distances among points along the tree. Internally the function uses Sammon scaling (sammon function in the MASS package) to find the configuration of points. Sammon scaling is a variant of metric
scaling trying to reproduce relative distances among points and it is optimal for showing the local (vs. global) structure of points. R> plot(mst, type=\
2.6 Cophenetic Distance
The estimated distance between two points is the level at which they are fused in
the dendrogram, or the height of the root. A good clustering method correctly reproduces the actual dissimilarities. The distance estimated from a dendro- gram is called
cophenetic distance. The name echoes the origins of hierarchic clustering in old fashioned numeric taxonomy. Standard R function cophenetic estimates the distances among all points from a dendrogram.
We can visually inspect the cophenetic distances against observed dissimi- larities. In the following, abline adds a line with zero intercept and slope of one, or the equivalence line. We also set equal scaling on x and y axes (asp = 1):
Rplot(d, cophenetic(csin), asp=1) abline( , 1) > plot(d, cophenetic(ccom), asp=1) Rabline( , 1)
> plot(d, cophenetic(caver), asp=1) Rabline( , 1) > R> R> R>
In single linkage clustering, the cophenetic distances are as long as or shorter than the observed distances: the distance between groups is the shortest possible distance between its members. In complete linkage clustering, the cophenetic distances are as long as or longer than observed distances: the distance between two groups is the longest possible distance between groups. In average linkage clustering, the cophenetic distance is the average of observed distances (cf. Fig. 1).
The correlation between observed and cophenetic distances is called the cophenetic correlation: R> cor(d, cophenetic(csin)) R> cor(d, cophenetic(ccom)) R> cor(d, cophenetic(caver))
The ranking of these cophenetic correlations is not entirely random: it is guar- anteed that average linkage (upgma) method maximizes the cophenetic corre- lation.
3 Interpretation of Classes
We commonly want to use classes for prediction of external variables — this is the idea om Finnish forest type system, EU Water Framework Directive and
8
共分享92篇相关文档