scalingdecisiontreeinduction标度的决策树归纳-

1,Mining Decision Trees from Data Streams,Tong Suk Man Ivy CSIS DB Seminar February 12, 2003,2,Contents,Introduction: problems in mining data streams Classification of stream data VFDT algorithm Window approach CVFDT algorithm Experimental results Conclusions Future work,3,Data Streams,Characteristics Large volume of ordered data points, possibly infinite Arrive continuously Fast changing Appropriate model for many applications: Phone call records Network and security monitoring Financial applications (stock exchange) Sensor networks,4,Problems in Mining Data Streams,Traditional data mining techniques usually require Entire data set to be present Random access (or multiple passes) to the data Much time per data item Challenges of stream mining Impractical to store the whole data Random access is expensive Simple calculation per data due to time and space constraints,5,Classification of Stream Data,VFDT algorithm “Mining High-Speed Data Streams”, KDD 2000. Pedro Domingos, Geoff Hulten CVFDT algorithm (window approach) “Mining Time-Changing Data Streams”, KDD 2001. Geoff Hulten, Laurie Spencer, Pedro Domingos,6,Hoeffding Trees,7,Definitions,A classification problem is defined as: N is a set of training examples of the form (x, y) x is a vector of d attributes y is a discrete class label Goal: To produce from the examples a model y=f(x) that predict the classes y for future examples x with high accuracy,8,Decision Tree Learning,One of the most effective and widely-used classification methods Induce models in the form of decision trees Each node contains a test on the attribute Each branch from a node corresponds to a possible outcome of the test Each leaf contains a class prediction A decision tree is learned by recursively replacing leaves by test nodes, starting at the root,9,Challenges,Classic decision tree learners assume all training data can be simultaneously stored in main memory Disk-based decision tree learners repeatedly read training data from disk sequentially Prohibitively expensive when learning complex trees Goal: design decision tree learners that read each example at most once, and use a small constant time to process it,10,Key Observation,In order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node. Given a stream of examples, use the first ones to choose the root attribute. Once the root attribute is chosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively. Use Hoeffding bound to decide how many examples are enough at each node,11,Hoeffding Bound,Consider a random variable a whose range is R Suppose we have n observations of a Mean: Hoeffding bound states: With probability 1- , the true mean of a is at least , where,12,How many examples are enough?,Let G(Xi) be the heuristic measure used to choose test attributes (e.g. Information Gain, Gini Index) Xa : the attribute with the highest attribute evaluation value after seeing n examples. Xb : the attribute with the second highest split evaluation function value after seeing n examples. Given a desired , if after seeing n examples at a node, Hoeffding bound guarantees the true , with probability 1-. This node can be split using Xa, the succeeding examples will be passed to the new leaves.,13,Algorithm,Calculate the information gain for the attributes and determines the best two attributes Pre-pruning: consider a “null” attribute that consists of not splitting the node At each node, check for the condition,If condition satisfied, create child nodes based on the test at the node If not, stream in more examples and perform calculations till condition satisfied,14,15,Performance Analysis,p: probability that an example passed through DT to level i will fall into a leaf at that point The expected disagreement between the tree produced by Hoeffding tree algorithm and that produced using infinite examples at each node is no greater than /p. Required memory: O(leaves * attributes * values * classes),16,VFDT,17,VFDT (Very Fast Decision Tree),A decision-tree learning system based on the Hoeffding tree algorithm Split on the current best attribute, if the difference is less than a user-specified threshold Wasteful to decide between identical attributes Compute G and check for split periodically Memory management Memory dominated by sufficient statistics Deactivate or drop less promising leaves when needed Bootstrap with traditional learner Rescan old data when time available,18,VFDT(2),Scales better than pure memory-based or pure disk-based learners Access data sequentially Use subsampling to potentially require much less than one scan VFDT is incremental and anytime New examples can be quickly incorporated as they arrive A usable model is available after the first few examples and then progressively defined,19,Experiment Results (VFDT vs. C4.5),Co