10
ClusBasic
Data Mining:Concepts and Techniques(3rd ed.)Chapter 10,Jiawei Han,Micheline Kamber,and Jian PeiUniversity of Illinois at Urbana-Champaign&Simon Fraser University2011 Han,Kamber&Pei.All rights reserved.,1,2,3,Chapter 10.Cluster Analysis:Basic Concepts and Methods,Cluster Analysis:Basic ConceptsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsEvaluation of ClusteringSummary,3,4,What is Cluster Analysis?,Cluster:A collection of data objectssimilar(or related)to one another within the same groupdissimilar(or unrelated)to the objects in other groupsCluster analysis(or clustering,data segmentation,)Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clustersUnsupervised learning:no predefined classes(i.e.,learning by observations vs.learning by examples:supervised)Typical applicationsAs a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms,5,Clustering for Data Understanding and Applications,Biology:taxonomy of living things:kingdom,phylum,class,order,family,genus and speciesInformation retrieval:document clusteringLand use:Identification of areas of similar land use in an earth observation databaseMarketing:Help marketers discover distinct groups in their customer bases,and then use this knowledge to develop targeted marketing programsCity-planning:Identifying groups of houses according to their house type,value,and geographical locationEarth-quake studies:Observed earth quake epicenters should be clustered along continent faultsClimate:understanding earth climate,find patterns of atmospheric and oceanEconomic Science:market resarch,6,Clustering as a Preprocessing Tool(Utility),Summarization:Preprocessing for regression,PCA,classification,and association analysisCompression:Image processing:vector quantizationFinding K-nearest NeighborsLocalizing search to one or a small number of clustersOutlier detectionOutliers are often viewed as those“far away”from any cluster,Quality:What Is Good Clustering?,A good clustering method will produce high quality clustershigh intra-class similarity:cohesive within clusterslow inter-class similarity:distinctive between clustersThe quality of a clustering method depends onthe similarity measure used by the method its implementation,andIts ability to discover some or all of the hidden patterns,7,Measure the Quality of Clustering,Dissimilarity/Similarity metricSimilarity is expressed in terms of a distance function,typically metric:d(i,j)The definitions of distance functions are usually rather different for interval-scaled,boolean,categorical,ordinal ratio,and vector variablesWeights should be associated with different variables based on applications and data semanticsQuality of clustering:There is usually a separate“quality”function that measures the“goodness”of a cluster.It is hard to define“similar enough”or“good enough”The answer is typically highly subjective,8,Considerations for Cluster Analysis,Partitioning criteriaSingle level vs.hierarchical partitioning(often,multi-level hierarchical partitioning is desirable)Separation of clustersExclusive(e.g.,one customer belongs to only one region)vs.non-exclusive(e.g.,one document may belong to more than one class)Similarity measureDistance-based(e.g.,Euclidian,road network,vector)vs.connectivity-based(e.g.,density or contiguity)Clustering spaceFull space(often when low dimensional)vs.subspaces(often in high-dimensional clustering),9,Requirements and Challenges,ScalabilityClustering all the data instead of only on samplesAbility to deal with different types of attributesNumerical,binary,categorical,ordinal,linked,and mixture of these Constraint-based clusteringUser may give inputs on constraintsUse domain knowledge to determine input parametersInterpretability and usabilityOthers Discovery of clusters with arbitrary shapeAbility to deal with noisy dataIncremental clustering and insensitivity to input orderHigh dimensionality,10,Major Clustering Approaches(I),Partitioning approach:Construct various partitions and then evaluate them by some criterion,e.g.,minimizing the sum of square errorsTypical methods:k-means,k-medoids,CLARANSHierarchical approach:Create a hierarchical decomposition of the set of data(or objects)using some criterionTypical methods:Diana,Agnes,BIRCH,CAMELEONDensity-based approach:Based on connectivity and density functionsTypical methods:DBSACN,OPTICS,DenClueGrid-based approach:based on a multiple-level granularity structureTypical methods:STING,WaveCluster,CLIQUE,11,Major Clustering Approaches(II),Model-based:A model is hypothesized for each of the clusters and tries to find the best fit of that model to each otherTypical methods:EM,SOM,COBWEBFrequent pattern-based:Based on the analysis of frequent patternsTypical methods:p-ClusterUser-guided or constraint-based:Clustering by considering user-specified or application-specific constraintsTypical methods:COD(obstacles),constrained clusteringLink-based clustering:Objects are often linked together in various waysMassive links can be used to cluster objects:SimRank,LinkClus,12,13,Chapter 10.Cluster Analysis:Basic Concepts and Methods,Cluster Analysis:Basic ConceptsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsEvaluation of ClusteringSummary,13,Partitioning Algorithms:Basic Concept,Partitioning method:Partitioning a database D of n objects into a set of k clusters,such that the sum of squared distances is minimized(where ci is the centroid or medoid of cluster Ci)Given k,find a partition of k clusters that optimizes the chosen partitioning criterionGlobal optimal:exhaustively enumerate all partitionsHeuristic methods:k-means and k-medoids algorithmsk-means(MacQueen67,Lloyd57/82):Each cluster is represented by the center of the clusterk-medoids or PAM(Partition around medoids)(Kaufman&Rousseeuw87):Each cluster is represented by one of the objects in the cluster,14,The K-Means Clustering Method,Given k,the k-means algorithm is implemented in four steps:Partition objects into k nonempty subsetsCompute seed points as the centroids of the clusters of the current partitioning(the centroid is the center,i.e.,mean point,of the cluster)Assign each object to the cluster with the nearest seed point Go back to Step 2,stop when the assignment does not change,15,An Example of K-Means Clustering,K=2Arbitrarily partition objects into k groups,Update the cluster centroids,Update the cluster centroids,Reassign objects,Loop if needed,16,The initial data set,Partition objects into k nonempty subsetsRepeatCompute centroid(i.e.,mean point)for each partition Assign each object to the cluster of its nearest centroid Until no change,Comments on the K-Means Method,Strength:Efficient:O(tkn),where n is#objects,k is#clusters,and t is#iterations.Normally,k,t n.Comparing:PAM:O(k(n-k)2),CLARA:O(ks2+k(n-k)Comment:Often terminates at a local optimalWeaknessApplicable only to objects in a continuous n-dimensional space Using the k-modes method for categorical dataIn comparison,k-medoids can be applied to a wide range of dataNeed to specify k,the number of clusters,in advance(there are ways to automatically determine the best k(see Hastie et al.,2009)Sensitive to noisy data and outliersNot suitable to discover clusters with non-convex shapes,17,Variations of the K-Means Method,Most of the variants of the k-means which differ inSelection of the initial k meansDissimilarity calculationsStrategies to calculate cluster meansHandling categorical data:k-modesReplacing means of clusters with modesUsing new dissimilarity measures to deal with categorical objectsUsing a frequency-based method to update modes of clustersA mixture of categorical and numerical data:k-prototype method,18,What Is the Problem of the K-Means Method?,The k-means algorithm is sensitive to outliers!Since an object with an extremely large value may substantially distort the distribution of the dataK-Medoids:Instead of taking the mean value of the object in a cluster as a reference point,medoids can be used,which is the most centrally located object in a cluster,19,20,PAM:A Typical K-Medoids Algorithm,Total Cost=20,0,1,2,3,4,5,6,7,8,9,10,0,1,2,3,4,5,6,7,8,9,10,K=2,Arbitrary choose k object as initial medoids,Assign each remaining object to nearest medoids,Randomly select a nonmedoid object,Oramdom,Compute total cost of swapping,Total Cost=26,Swapping O and Oramdom If quality is improved.,Do loopUntil no change,The K-Medoid Clustering Method,K-Medoids Clustering:Find representative objects(medoids)in clustersPAM(Partitioning Around Medoids,Kaufmann&Rousseeuw 1987)Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clusteringPAM works effectively for small data sets,but does not scale well for large data sets(due to the computational complexity)Efficiency improvement on PAMCLARA(Kaufmann&Rousseeuw,1990):PAM on samplesCLARANS(Ng&Han,1994):Randomized re-sampling,21,22,Chapter 10.Cluster Analysis:Basic Concepts and Methods,Cluster Analysis:Basic ConceptsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsEvaluation of ClusteringSummary,22,Hierarchical Clustering,Use distance matrix as clustering criteria.This method does not require the number of clusters k as an input,but needs a termination condition,23,AGNES(Agglomerative Nesting),Introduced in Kaufmann and Rousseeuw(1990)Implemented in statistical packages,e.g.,SplusUse the single-link method and the dissimilarity matrix Merge nodes that have the least dissimilarityGo on in a non-descending fashionEventually all nodes belong to the same cluster,24,Dendrogram:Shows How Clusters are Merged,Decompose data objects into a several levels of nested partitioning(tree of clusters),called a dendrogramA clustering of the data objects is obtained by cutting the dendrogram at the desired level,then each connected component forms a cluster,25,DIANA(Divisive Analysis),