SUPERVISED VERSUS UNSUPERVISED METHODS

METHODOLOGY FOR SUPERVISED MODELING

BIAS–VARIANCE TRADE-OFF

CLASSIFICATION TASK

k-NEAREST NEIGHBOR ALGORITHM

DISTANCE FUNCTION

COMBINATION FUNCTION

QUANTIFYING ATTRIBUTE RELEVANCE: STRETCHING THE AXES

DATABASE CONSIDERATIONS

k-NEAREST NEIGHBOR ALGORITHM FOR ESTIMATION AND PREDICTION

CHOOSING k

**SUPERVISED VERSUS UNSUPERVISED METHODS**

Data mining methods may be categorized as either supervised or unsupervised. In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structure among all the variables. The most common unsupervised data mining method is clustering, our topic in Chapters 8 and 9. For example, political consultants may analyze congressional districts using clustering methods, to uncover the locations of voter clusters that may be responsive to a particular candidate’s message. In this case, all appropriate variables (e.g., income, race, gender) would be input to the clustering algorithm, with no target variable specified, in order to develop accurate voter profiles for fund-raising and advertising purposes. Another data mining method, which may be supervised or unsupervised, is association rule mining. In market basket analysis, for example, one may simply be interested in “which items are purchased together,” in which case no target variable would be identified. The problem here, of course, is that there are so many items for sale, that searching for all possible associations may present a daunting task, due to the resulting combinatorial explosion. Nevertheless, certain algorithms, such as the a priori algorithm, attack this problem cleverly, as we shall see when we cover association rule mining in Chapter 10. Most data mining methods are supervised methods, however, meaning that (1) there is a particular prespecified target variable, and (2) the algorithm is given many examples where the value of the target variable is provided, so that the algorithm may learn which values of the target variable are associated with which values of the predictor variables. For example, the regression methods of Chapter 4 are supervised methods, since the observed values of the response variable y are provided to the

least-squares algorithm, which seeks to minimize the squared distance between these y values and the y values predicted given the x-vector. All of the classification methods we examine in Chapters 5 to 7 are supervised methods, including decision trees, neural networks, and k-nearest neighbors.

**METHODOLOGY FOR SUPERVISED MODELING**

Most supervised data mining methods apply the following methodology for building and evaluating a model. First, the algorithm is provided with a training set of data, which includes the preclassified values of the target variable in addition to the predictor variables. For example, if we are interested in classifying income bracket, based on age, gender, and occupation, our classification algorithm would need a large pool of

records, containing complete (as complete as possible) information about every field, including the target field, income bracket. In other words, the records in the training set need to be preclassified.Aprovisional data mining model is then constructed using the training samples provided in the training data set. However, the training set is necessarily incomplete; that is, it does not include the “new” or future data that the data modelers are really interested in classifying. Therefore, the algorithm needs to guard against “memorizing” the training set and

blindly applying all patterns found in the training set to the future data. For example, it may happen that all customers named “David” in a training set may be in the highincome bracket.We would presumably not want our final model, to be applied to new data, to include the pattern “If the customer’s first name is David, the customer has a high income.” Such a pattern is a spurious artifact of the training set and needs to be

verified before deployment. Therefore, the next step in supervised data mining methodology is to examinehow the provisional data mining model performs on a test set of data. In the test set, a holdout data set, the values of the target variable are hidden temporarily from the provisional model, which then performs classification according to the patterns and structure it learned from the training set. The efficacy of the classifications are then evaluated by comparing them against the true values of the target variable. The provisional data mining model is then adjusted to minimize the error rate on the test set.

from: Wiley-Interscience.Discovering.Knowledge.in.Data.An.Introduction.to.Data.Mining

## Tinggalkan Balasan