分享
07FPAdvanced.ppt
下载文档

ID:3415983

大小:3.59MB

页数:80页

格式:PPT

时间:2024-04-29

收藏 分享赚钱
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,汇文网负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
网站客服:3074922707
07 FPAdvanced
1,1,Data Mining:Concepts and Techniques(3rd ed.)Chapter 7,Jiawei Han,Micheline Kamber,and Jian PeiUniversity of Illinois at Urbana-Champaign&Simon Fraser University2011 Han,Kamber&Pei.All rights reserved.,April 29,2024,Data Mining:Concepts and Techniques,2,3,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,Research on Pattern Mining:A Road Map,4,5,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceMining Multi-Level AssociationMining Multi-Dimensional AssociationMining Quantitative Association RulesMining Rare Patterns and Negative PatternsConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,6,Mining Multiple-Level Association Rules,Items often form hierarchiesFlexible support settings Items at the lower level are expected to have lower supportExploration of shared multi-level mining(Agrawal&SrikantVLB95,Han&FuVLDB95),7,Multi-level Association:Flexible Support and Redundancy filtering,Flexible min-support thresholds:Some items are more valuable but less frequentUse non-uniform,group-based min-supportE.g.,diamond,watch,camera:0.05%;bread,milk:5%;Redundancy Filtering:Some rules may be redundant due to“ancestor”relationships between itemsmilk wheat bread support=8%,confidence=70%2%milk wheat bread support=2%,confidence=72%The first rule is an ancestor of the second ruleA rule is redundant if its support is close to the“expected”value,based on the rules ancestor,8,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceMining Multi-Level AssociationMining Multi-Dimensional AssociationMining Quantitative Association RulesMining Rare Patterns and Negative PatternsConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,9,Mining Multi-Dimensional Association,Single-dimensional rules:buys(X,“milk”)buys(X,“bread”)Multi-dimensional rules:2 dimensions or predicatesInter-dimension assoc.rules(no repeated predicates)age(X,”19-25”)occupation(X,“student”)buys(X,“coke”)hybrid-dimension assoc.rules(repeated predicates)age(X,”19-25”)buys(X,“popcorn”)buys(X,“coke”)Categorical Attributes:finite number of possible values,no ordering among valuesdata cube approachQuantitative Attributes:Numeric,implicit ordering among valuesdiscretization,clustering,and gradient approaches,10,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceMining Multi-Level AssociationMining Multi-Dimensional AssociationMining Quantitative Association RulesMining Rare Patterns and Negative PatternsConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,11,Mining Quantitative Associations,Techniques can be categorized by how numerical attributes,such as age or salary are treatedStatic discretization based on predefined concept hierarchies(data cube methods)Dynamic discretization based on data distribution(quantitative rules,e.g.,Agrawal&SrikantSIGMOD96)Clustering:Distance-based association(e.g.,Yang&MillerSIGMOD97)One dimensional clustering then associationDeviation:(such as Aumann and LindellKDD99)Sex=female=Wage:mean=$7/hr(overall mean=$9),12,Static Discretization of Quantitative Attributes,Discretized prior to mining using concept hierarchy.Numeric values are replaced by rangesIn relational database,finding all frequent k-predicate sets will require k or k+1 table scansData cube is well suited for miningThe cells of an n-dimensional cuboid correspond to the predicate setsMining from data cubescan be much faster,13,Quantitative Association Rules Based on Statistical Inference Theory Aumann and LindellDMKD03,Finding extraordinary and therefore interesting phenomena,e.g.,(Sex=female)=Wage:mean=$7/hr(overall mean=$9)LHS:a subset of the population RHS:an extraordinary behavior of this subsetThe rule is accepted only if a statistical test(e.g.,Z-test)confirms the inference with high confidenceSubrule:highlights the extraordinary behavior of a subset of the pop.of the super rule E.g.,(Sex=female)(South=yes)=mean wage=$6.3/hrTwo forms of rulesCategorical=quantitative rules,or Quantitative=quantitative rulesE.g.,Education in 14-18(yrs)=mean wage=$11.64/hrOpen problem:Efficient methods for LHS containing two or more quantitative attributes,14,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceMining Multi-Level AssociationMining Multi-Dimensional AssociationMining Quantitative Association RulesMining Rare Patterns and Negative PatternsConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,15,Negative and Rare Patterns,Rare patterns:Very low support but interestingE.g.,buying Rolex watchesMining:Setting individual-based or special group-based support threshold for valuable itemsNegative patternsSince it is unlikely that one buys Ford Expedition(an SUV car)and Toyota Prius(a hybrid car)together,Ford Expedition and Toyota Prius are likely negatively correlated patternsNegatively correlated patterns that are infrequent tend to be more interesting than those that are frequent,16,Defining Negative Correlated Patterns(I),Definition 1(support-based)If itemsets X and Y are both frequent but rarely occur together,i.e.,sup(X U Y)s(A)*s(B)Where is the problem?Null transactions,i.e.,the support-based definition is not null-invariant!,17,Defining Negative Correlated Patterns(II),Definition 2(negative itemset-based)X is a negative itemset if(1)X=U B,where B is a set of positive items,and is a set of negative items,|1,and(2)s(X)Itemsets X is negatively correlated,ifThis definition suffers a similar null-invariant problemDefinition 3(Kulzynski measure-based)If itemsets X and Y are frequent,but(P(X|Y)+P(Y|X)/2,where is a negative pattern threshold,then X and Y are negatively correlated.Ex.For the same needle package problem,when no matter there are 200 or 105 transactions,if=0.01,we have(P(A|B)+P(B|A)/2=(0.01+0.01)/2,18,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,19,Constraint-based(Query-Directed)Mining,Finding all the patterns in a database autonomously?unrealistic!The patterns could be too many but not focused!Data mining should be an interactive process User directs what to be mined using a data mining query language(or a graphical user interface)Constraint-based miningUser flexibility:provides constraints on what to be minedOptimization:explores such constraints for efficient mining constraint-based mining:constraint-pushing,similar to push selection first in DB query processingNote:still find all the answers satisfying constraints,not finding some answers in“heuristic search”,20,Constraints in Data Mining,Knowledge type constraint:classification,association,etc.Data constraint using SQL-like queries find product pairs sold together in stores in Chicago this yearDimension/level constraintin relevance to region,price,brand,customer categoryRule(or pattern)constraintsmall sales(price$200)Interestingness constraintstrong rules:min_support 3%,min_confidence 60%,Meta-Rule Guided Mining,Meta-rule can be in the rule form with partially instantiated predicates and constants P1(X,Y)P2(X,W)=buys(X,“iPad”)The resulting rule derived can beage(X,“15-25”)profession(X,“student”)=buys(X,“iPad”)In general,it can be in the form of P1 P2 Pl=Q1 Q2 Qr Method to find meta-rulesFind frequent(l+r)predicates(based on min-support threshold)Push constants deeply when possible into the mining process(see the remaining discussions on constraint-push techniques)Use confidence,correlation,and other measures when possible,21,22,Constraint-Based Frequent Pattern Mining,Pattern space pruning constraintsAnti-monotonic:If constraint c is violated,its further mining can be terminatedMonotonic:If c is satisfied,no need to check c againSuccinct:c must be satisfied,so one can start with the data sets satisfying cConvertible:c is not monotonic nor anti-monotonic,but it can be converted into it if items in the transaction can be properly orderedData space pruning constraintData succinct:Data space can be pruned at the initial pattern mining processData anti-monotonic:If a transaction t does not satisfy c,t can be pruned from its further mining,23,Pattern Space Pruning with Anti-Monotonicity Constraints,A constraint C is anti-monotone if the super pattern satisfies C,all of its sub-patterns do so tooIn other words,anti-monotonicity:If an itemset S violates the constraint,so does any of its superset Ex.1.sum(S.price)v is anti-monotoneEx.2.range(S.profit)15 is anti-monotoneItemset ab violates CSo does every superset of abEx.3.sum(S.Price)v is not anti-monotoneEx.4.support count is anti-monotone:core property used in Apriori,TDB(min_sup=2),24,Pattern Space Pruning with Monotonicity Constraints,A constraint C is monotone if the pattern satisfies C,we do not need to check C in subsequent miningAlternatively,monotonicity:If an itemset S satisfies the constraint,so does any of its superset Ex.1.sum(S.Price)v is monotoneEx.2.min(S.Price)v is monotoneEx.3.C:range(S.profit)15Itemset ab satisfies CSo does every superset of ab,TDB(min_sup=2),25,Data Space Pruning with Data Anti-monotonicity,A constraint c is data anti-monotone if for a pattern p cannot satisfy a transaction t under c,ps superset cannot satisfy t under c eitherThe key for data anti-monotone is recursive data reductionEx.1.sum(S.Price)v is data anti-monotoneEx.2.min(S.Price)v is data anti-monotoneEx.3.C:range(S.profit)25 is data anti-monotoneItemset b,cs projected DB:T10:d,f,h,T20:d,f,g,h,T30:d,f,gsince C cannot satisfy T10,T10 can be pruned,TDB(min_sup=2),26,Pattern Space Pruning with Succinctness,Succinctness:Given A1,the set of items satisfying a succinctness constraint C,then any set S satisfying C is based on A1,i.e.,S contains a subset belonging to A1Idea:Without looking at the transaction database,whether an itemset S satisfies constraint C can be determined based on the selection of items min(S.Price)v is succinctsum(S.Price)v is not succinctOptimization:If C is succinct,C is pre-counting pushable,27,Apriori+Constraint,Database D,Scan D,C1,L1,L2,C2,C2,Scan D,C3,L3,Scan D,Constraint:SumS.price 5,28,Constrained Apriori:Push a Succinct Constraint Deep,Database D,Scan D,C1,L1,L2,C2,C2,Scan D,C3,L3,Scan D,Constraint:minS.price=1,not immediately to be used,29,Constrained FP-Growth:Push a Succinct Constraint Deep,Constraint:minS.price=1,Remove infrequentlength 1,FP-Tree,1-Projected DB,No Need to project on 2,3,or 5,30,Constrained FP-Growth:Push a Data Anti-monotonic Constraint Deep,Constraint:minS.price=1,FP-Tree,Single branch,we are done,Remove from data,31,Constrained FP-Growth:Push a Data Anti-monotonic Constraint Deep,Constraint:rangeS.price 25min_sup=2,FP-Tree,B-Projected DB,BFP-Tree,RecursiveData Pruning,Single branch:bcdfg:2,32,Convertible Constraints:Ordering Data in Transactions,Convert tough constraints into anti-monotone or monotone by properly ordering itemsExamine C:avg(S.profit)25Order items in value-descending orderIf an itemset afb violates CSo does afbh,afb*It becomes anti-monotone!,TDB(min_sup=2),33,Strongly Convertible Constraints,avg(X)25 is convertible anti-monotone w.r.t.item value descending order R:If an itemset af violates a constraint C,so does every itemset with af as prefix,such as afd avg(X)25 is convertible monotone w.r.t.item value ascending order R-1:If an itemset d satisfies a constraint C,so does itemsets df and dfa,which having d as a prefixThus,avg(X)25 is strongly convertible,34,Can Apriori Handle Convertible Constraints?,A convertible,not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithmWithin the level wise framework,no direct pruning based on the constraint can be madeItemset df violates constraint C:avg(X)=25Since adf satisfies C,Apriori needs df to assemble adf,df cannot be prunedBut it can be pushed into frequent-pattern growth framework!,35,Pattern Space Pruning w.Convertible Constraints,C:avg(X)=25,min_sup=2List items in every transaction in value descending order R:C is convertible anti-monotone w.r.t.RScan TDB onceremove infrequent itemsItem h is droppedItemsets a and f are good,Projection-based miningImposing an appropriate order on item projectionMany tough constraints can be converted into(anti)-monotone,TDB(min_sup=2),36,Handling Multiple Constraints,Different constraints may require different or even conflicting item-orderingIf there exists an order R s.t.both C1 and C2 are convertible w.r.t.R,then there is no conflict between the two convertible constraintsIf there exists conflict on order of itemsTry to satisfy one constraint firstThen using the order for the other constraint to mine frequent itemsets in the corresponding projected database,37,Constraint-Based Mining A General Picture,38,What Constraints Are Convertible?,39,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,40,Mining Colossal Frequent Patterns,F.Zhu,X.Yan,J.Han,P.S.Yu,and H.Cheng,“Mining Colossal Frequent Patterns by Core Pattern Fusion”,ICDE07.We have many algorithms,but can we mine large(i.e.,colossal)patterns?such as just size around 50 to 100?Unfortunately,not!Why not?the curse of“downward closure”of frequent patternsThe“downward closure”propertyAny sub-pattern of a frequent pattern is frequent.Example.If(a1,a2,a100)is frequent,then a1,a2,a100,(a1,a2),(a1,a3),(a1,a100),(a1,a2,a3),are all frequent!There are about 2100 such frequent itemsets!No matter using breadth-first search(e.g.,Apriori)or depth-first search(FPgrowth),we have to examine so many patternsThus the downward closure property leads to explosion!,41,Closed/maximal patterns may partially alleviate the problem but not really solve it:We often need to mine scattered large patterns!Let the minimum support threshold=20There are frequent patterns of size 20Each is

此文档下载收益归作者所有

下载文档
猜你喜欢
你可能关注的文档
收起
展开