Missing data is quite common when dealing with real world datasets. There are several ways to improve prediction accuracy when missing data in some predictors without completely discarding the entire observation. This example shows how decision trees with surrogate splits can be used to improve prediction accuracy in the presence of missing data.
rng(5); % For reproducibility load ionosphere; labels = unique(Y);
cv = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cv),:);
Ytrain = Y(training(cv));
Xtest = X(test(cv),:);
Ytest = Y(test(cv));
Bagging (bootstrap aggregating), is an ensemble approach which involves training several weak learners to create a strong classifier.
% Classification Tree is chosen as the learner mdl1 = ClassificationTree.template('NVarToSample','all'); RF1 = fitensemble(Xtrain,Ytrain,'Bag',150,mdl1,'type','classification'); % Classification Tree with surrogate splits is chosen as the learner mdl2 = ClassificationTree.template('NVarToSample','all','surrogate','on'); RF2 = fitensemble(Xtrain,Ytrain,'Bag',150,mdl2,'type','classification');
Suppose half of the values in the test set are missing:
Xtest(rand(size(Xtest))>0.5) = NaN;
y_pred1 = predict(RF1,Xtest); confmat1 = confusionmat(Ytest,y_pred1); y_pred2 = predict(RF2,Xtest); confmat2 = confusionmat(Ytest,y_pred2); disp('Confusion Matrix - without surrogates') disp(confmat1) disp('Confusion Matrix - with surrogates') disp(confmat2)
Confusion Matrix - without surrogates 67 1 24 13 Confusion Matrix - with surrogates 65 3 4 33
Decreasing value with number of trees indicates good performance.
figure subplot(2,2,1:2) plot(loss(RF1,Xtest,Ytest,'mode','cumulative'),'LineWidth',3); hold on; plot(loss(RF2,Xtest,Ytest,'mode','cumulative'),'r','LineWidth',3); legend('Regular trees','Trees with surrogate splits'); xlabel('Number of trees'); ylabel('Test classification error','FontSize',12); subplot(2,2,3) [hImage, hText, hXText] = heatmap(confmat1, labels, labels, 1,'Colormap','red','ShowAllTicks',1); title('Confusion Matrix - without surrogates') subplot(2,2,4) heatmap(confmat2, labels, labels, 1,'Colormap','red','ShowAllTicks',1); title('Confusion Matrix - with surrogates')
'Data Mining & R' 카테고리의 다른 글
Netflix Prize 및 프로덕션 머신 러닝 시스템 PDF (0) | 2017.03.31 |
---|---|
Digit Classification Using HOG Features (0) | 2017.03.31 |
kNN(k Nearest Neighbors) 알고리즘 소개 및 R 구현 (0) | 2017.03.13 |
Supervised Learning, Unsupervised Learning (0) | 2017.03.13 |
데이터마이닝 소개와 분석 방법 (LG CNS) (0) | 2017.03.09 |