ClusterAnalysisof聚类分析-

Cluster Analysis ofMicroarray Data4/13/2021Copyright 2021 Dan Nettleton1.ClusteringGroup objects that are similar to one another together in a cluster.Separate objects that are dissimilar from each other into different clusters.The similarity or dissimilarity of two objects is determined by comparing the objects with respect to one or more attributes that can be measured for each object.2.Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 3.Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 genestime pointsestimated expression levels4.Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 genestissue typesestimated expression levels5.Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 genestreatmentconditionsestimated expression levels6.Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 samplesgenesestimated expression levels7.Clustering: An Example ExperimentResearchers were interested in studying gene expression patterns in developing soybean seeds.Seeds were harvested from soybean plants at 25, 30, 40, 45, and 50 days after flowering (daf).One RNA sample was obtained for each level of daf.8.An Example Experiment (continued)Each of the 5 samples was measured on two two-color cDNA microarray slides using a loop design.The entire process we repeated on a second occasion to obtain a total of two independent biological replications.9.25304045502530404550Rep 1Rep 2Diagram Illustrating the Experimental Design10.The daf means estimated for each gene from a mixed linear model analysis provide a useful summary of the data for cluster analysis.Normalized Data for One Example GenedafdafNormalized Log SignalEstimated Means + or 1 SE11.400 genes exhibited significant evidence of differential expression across time (p-value= G(4)-SE55.The Gap Statistic Suggests K=3 Clusters56.Gap Analysis for Two-Color Array Data (N=100)k=Number of Clustersk=Number of Clusterslog WkG(k)G(k)=log Wk log Wk vs. k(+ or 1 standard error) log Wk and log Wk vs. k *57.Gap Analysis for Two-Color Array Data (N=100)k=Number of ClustersG(k)Gap AnalysisEstimates K=11Clusters“zoomed in” versionof previous plot58.59.60.61.62.63.64.65.66.67.68.69.Plot of Cluster Medoids70.Principal ComponentsPrincipal components can be useful for providing low-dimensional views of high-dimensional data. 1 2 . m 1 2 X = . . . nDataMatrixorDataSetx11 x12 . . . x1mx21.xn1x2m.xnmxn2 . . .observationorobjectvariableorattributenumber of variablesnumber of observations71.Principal Components (continued)Each principal component of a data set is a variable obtained by taking a linear combination of the original variables in the data set.A linear combination of m variables x1, x2, ., xm is given by c1x1 + c2x2 + + cmxm.For the purpose of constructing principal components, the vector of coefficients is restricted to have unit length, i.e., c1 + c2 + + cm = 1.22272.Principal Components (continued)The first principal component is the linear combination of the variables that has maximum variation across the observations in the data set.The jth principal component is the linear combination of the variables that has maximum variation across the observations in the data set subject to the constraint that the vector of coefficients be orthogonal to coefficient vectors for principal components 1, ., j-1.73.The Simple Data Example x1 x2 74.The First Principal Component Axis x1 x2 75.The First Principal Components x1 x2 1st PC forthis pointis signeddistancebetween itsprojectiononto the1st PC axisand theorigin.76.The Second Principal Component Axis x1 x2 77.The Second Principal Component x1 x2 2nd PC forthis pointis signeddistancebetween itsprojectiononto the2nd PC axisand theorigin.78.Plot of PC1 vs. PC2 PC1 PC2 79.Compare the PC plot to the plotof the original data below. x1 x2 Because thereare only twovariables here,the plot ofPC2 vs. PC1 isjust a rotationof the originalplot.80.There is more to be gained when the number of variables is greater than 2.Consider the principal components for the 400 significant genes from our two-color microarray experiment.Our data matrix has n=400 rows and m=5 columns.We have looked at this data using parallel coordinate plots.What would it look like if we projected the data points to 2-dimensions?81.Projection of Two-Color Array Datawith 11-Medoid Clustering PC1 PC2 a=1b=2c=3d=4e=5f=6g=7h=8i=9j=10k=1182.Projection of Two-Color Array Datawith 11-Medoid Clustering PC1 PC3 a=1b=2c=3d=4e=5f=6g=7h=8i=9j=10k=1183. PC2 84.Projection of Two-Color Array Datawith 11-Medoid Clustering PC1 PC3 a=1b=2c=3d=4e=5f=6g=7h=8i=9j=10k=1185.Hierarchical Clustering MethodsHierarchical clustering methods build a nested sequence of clusters that can be displayed using a dendrogram.We will begin with some simple illustrations and then move on to a more general discussion.86.The Simple Example Datawith Observation Numbers x1 x2 87.Dendrogram for the Simple Example DataTree Structurenodesa parent nodeterminal nodes or leavescorrespondingto objectsroot nodedaughter nodes(daughter nodeswith same parentare sister nodes)88.A Hierarchical Clustering of theSimple Example DataScatterplot of DataDendrogramx1x2clusters within clusterswithin clusters.89.Dendrogram for the Simple Example DataThe heightof a noderepresents thedissimilaritybetween thetwo clustersmergedtogether atthe node.These two clusters have a dissimilarity of about 1.75.90.The appearance of a dendrogram is not unique.Any twosister nodescould tradeplaces withoutchanging themeaning of thedendrogram.Thus 14 next to 7 does not imply that these objects are similar.91.Dendrogram for the Simple Example DataBy convention,R dendrogramsshow the lowersister nodeon the left.Ties are brokenby observationnumber.The appearance of a dendrogram is not unique.e.g., 13 is to the left of 14 92.The lengthsof the branchesleading toterminal nodeshave noparticularmeaning in Rdendrograms.The appearance of a dendrogram is not unique.93.Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=2 Clusters94.Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=3 Clusters95.Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=4 Clusters96.Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=10 Clusters97.Agglomerative (Bottom-Up) Hierarchical ClusteringDefine a measure of distance between any two clusters. (An individual object is considered a cluster of size one.)Find the two nearest clusters and merge them together to form a new cluster.Repeat until all objects have been merged into a single cluster.98.Common Measures of Between-Cluster DistanceSingle Linkage a.k.a. Nearest Neighbor: the distance between any two clusters A and B is the minimum of all distances from an object in cluster A to an object in cluster B.Complete Linkage a.k.a Farthest Neighbor: the distance between any two clusters A and B is the maximum of all distances from an object in cluster A to an object in cluster B.99.Common Measures of Between-Cluster DistanceAverage Linkage: the distance between any two clusters A and B is the average of all distances from an object in cluster A to an object in cluster B.Centroid Linkage: the distance between any two clusters A and B is the distance between the centroids of cluster A and B. (The centroid of a cluster is the componentwise average of the objects in a cluster.) 100.Agglomerative Clustering Using Average Linkage for the Simple Example Data SetScatterplot of DataDendrogramx1x2ABCDEFGHIJMKLNOP101.Agglomerative Clustering Using Average Linkage for the Simple Example Data SetA. 1-2 AB. 9-10 BC. 3-4 CD. 5-6 DE. 7-(5,6) EF. 13-14 FG. 11-12 GH. (1,2)-(3,4) HI. (9,10)-(11,12) Ietc.JKLMNOP102.Agglomerative Clustering Using Single Linkage for the Simple Example Data Set103.Agglomerative Clustering Using Complete Linkage for the Simple Example Data Set104.Agglomerative Clustering Using Centroid Linkage for the Simple Example Data SetCentroid linkage isnot monotone inthe sense thatlater cluster mergescan involve clustersthat are more similarto each other thanearlier merges.105.Agglomerative Clustering Using Centroid Linkage for the Simple Example Data SetThe merge between4 and (1,2,3,5) createsa cluster whose centroidis closer to the (6,7)centroid than 4 was tothe centroid of (1,2,3,5).106.Agglomerative Clustering Using Single Linkage for the Two-Color Microarray Data Set107.Agglomerative Clustering Using Complete Linkage for the Two-Color Microarray Data Set108.Agglomerative Clustering Using Average Linkage for the Two-Color Microarray Data Set109.Agglomerative Clustering Using Centroid Linkage for the Two-Color Microarray Data Set110.Which Between-Cluster Distance is Best?Depends, of course, on what is meant by “best”.Single linkage tends to produce “long stringy” clusters.Complete linkage produces compact spherical clusters but might result in some objects that are closer to objects in clusters other than their own. (See next example.)Average linkage is a compromise between single and complete linkage.Centroid linkage is not monotone. 111.1. Conduct agglomerative hierarchical clustering for this datausing Euclidean distance and complete linkage.2. Display your results using a dendrogram.3. Identify the k=2 clustering using your results.112.Results of Complete-Linkage ClusteringResults for k=2 Clusters113.Divisive (Top-Down) Hierarchical ClusteringStart with all data in one cluster and divide it into two clusters (using, e.g., 2-means or 2-medoids clustering).At each subsequent step, choose one of the existing clusters and divide it into two clusters. Repeat until there are n clusters each containing a single object.114.Potential Problem with Divisive Clustering15115.Macnaughton-Smith et al. (1965)1.Start with objects in one cluster A.2.Find the object with the largest average dissimilarity to all other objects in A and move that object to a new cluster B.3.Find the object in cluster A whose average dissimilarity to other objects in cluster A minus its average dissimilarity to objects in cluster B is maximum. If this difference is positive, move the object to cluster B.4.Repeat step 3 until no objects satisfying 3 are found.5.Repeat steps 1 through 4 to one of the existing clusters (e.g., the one with the largest average within-cluster dissimilarity) until n clusters of 1 object each are obtained. 116.Macnaughton-Smith Divisive Clustering15AB117.Macnaughton-Smith Divisive Clustering15AB118.Macnaughton-Smith Divisive Clustering15AB119.Macnaughton-Smith Divisive Clustering15ABB120.Macnaughton-Smith Divisive Clustering15ABBNext continue to split each of these clustersuntil each object is in a cluster by itself.121.Dendrogram for the Macnaughton-Smith Approach122.Agglomerative vs. Divisive ClusteringDivisive clustering has not been studied as extensively as agglomerative clustering.Divisive clustering may be preferred if only a small number of large clusters is desired.Agglomerative clustering may be preferred if a large number of small clusters is desired.123.