资源预览内容
第1页 / 共3页
第2页 / 共3页
第3页 / 共3页
亲,该文档总共3页全部预览完了,如果喜欢就下载吧!
资源描述
python葡萄酒数据集_利python分析红葡萄酒数据在本次分析中,我使了随机森林回归,并涉及数据标准化和超参数调优。在这,我使随机森林分类器,对好酒和不太好的酒进元分类。先导数据包:importnumpy as npimportpandas as pdimportmatplotlib.pyplot as pltimport seaborn as sns导数据:data = pd.read_csv(winequality-red.csv)data.head()data.describe()注释:fixed acidity:挥发性酸volatile acidity : 挥发性酸citric acid:柠檬酸residual sugar :剩余糖分chlorides:氯化物free sulfur dioxide :游离氧化硫total sulfur dioxide:总氧化硫density:密度pH:pHsulphates:硫酸盐alcohol:酒精quality:质量所有数据的数值为1599,所以没有缺失值。让我们看看是否有重复值:extra =datadata.duplicated()extra.shape有240个重复值,但先不删除它,因为葡萄酒的质量等级是由不同的品酒师给出的。数据可视化sns.set()data.hist(figsize=(10,10), color=red)plt.show()只有质量是离散型变量,主要集中在5和6中,下分析下变量的相关性:colormap =plt.cm.viridisplt.figure(figsize=(12,12)plt.title(Correlation of Features, y=1.05, size=15)sns.heatmap(data.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True,linecolor=white, annot=True)观察:酒精与葡萄酒质量的相关性最,其次是各种酸度、硫酸盐、密度和氯化物。使分类器:将葡萄酒分成两组;“优质”5为“好酒”y = data.quality #set quality as targetX = data.drop(quality, axis=1) #rest are featuresprint(y.shape, X.shape) #check correctness#Create a new y1y1 = (y 5).astype(int)y1.head()# plot histogramax= y1.plot.hist(color=green)ax.set_title(Wine quality distribution, fontsize=14)ax.set_xlabel(aggregated target value)利随机森林分类器训练预测模型from sklearn.model_selection importtrain_test_split, cross_val_scorefrom sklearn.ensembleimportRandomForestClassifierfrom sklearn.metrics importaccuracy_score, log_lossfrom sklearn.metrics importconfusion_matrix将数据分割为训练和测试数据集seed = 8 #set seed for reproducibilityX_train, X_test, y_train, y_test = train_test_split(X, y1, test_size=0.2,random_state=seed)print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)对随机森林分类器进交叉验证训练和评价#Instantiate the Random Forest ClassifierRF_clf = RandomForestClassifier(random_state=seed)RF_clf#在训练数据集上计算k-fold交叉验证,并查看平均精度得分cv_scores = cross_val_score(RF_clf,X_train, y_train, cv=10, scoring=accuracy)print(The accuracy scores for the iterationsare .format(cv_scores)print(The mean accuracy score is .format(cv_scores.mean()执预测RF_clf.fit(X_train, y_train)pred_RF= RF_clf.predict(X_test)#Print 5 results to seefor i in range(0,5):print(Actual wine quality is, y_test.iloci, and predicted is, pred_RFi)在前五名中,有个错误。让我们看看指标。print(accuracy_score(y_test, pred_LR)print(log_loss(y_test, pred_LR)print(confusion_matrix(y_test, pred_LR)总共有81个分类错误。与Logistic回归分类器相,随机森林分类器更优。让我们调优随机森林分类器的超参数from sklearn.model_selection importGridSearchCVgrid_values= n_estimators:50,100,200,max_depth:None,30,15,5,max_features:auto,sqrt,log2,min_samples_leaf:1,20,50,100grid_RF= GridSearchCV(RF_clf,param_grid=grid_values,scoring=accuracy)grid_RF.fit(X_train, y_train)grid_RF.best_params_除了估计数之外,其他推荐值是默认值。RF_clf = RandomForestClassifier(n_estimators=100,random_state=seed)RF_clf.fit(X_train,y_train)pred_RF=RF_clf.predict(X_test)print(accuracy_score(y_test,pred_RF)print(log_loss(y_test,pred_RF)print(confusion_matrix(y_test,pred_RF)通过超参数调谐,射频分类器的准确度已提到82.5%,志损失值也相应降低。分类错误的数量也减少到56个。将随机森林分类器作为基本推荐器,将红酒分为“推荐”(6级以上)或“不推荐”(5级以下),预测准确率为82.5%似乎是合理的。
收藏 下载该资源
网站客服QQ:2055934822
金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号