python葡萄酒数据集-利用python分析红葡萄酒数据-

python葡萄酒数据集_利python分析红葡萄酒数据在本次分析中，我使了随机森林回归，并涉及数据标准化和超参数调优。在这，我使随机森林分类器，对好酒和不太好的酒进元分类。先导数据包：importnumpy as npimportpandas as pdimportmatplotlib.pyplot as pltimport seaborn as sns导数据：data = pd.read_csv(winequality-red.csv)data.head()data.describe()注释：fixed acidity：挥发性酸volatile acidity ：挥发性酸citric acid：柠檬酸residual sugar ：剩余糖分chlorides：氯化物free sulfur dioxide ：游离氧化硫total sulfur dioxide：总氧化硫density：密度pH：pHsulphates：硫酸盐alcohol：酒精quality：质量所有数据的数值为1599，所以没有缺失值。让我们看看是否有重复值：extra =datadata.duplicated()extra.shape有240个重复值，但先不删除它，因为葡萄酒的质量等级是由不同的品酒师给出的。数据可视化sns.set()data.hist(figsize=(10,10), color=red)plt.show()只有质量是离散型变量，主要集中在5和6中，下分析下变量的相关性：colormap =plt.cm.viridisplt.figure(figsize=(12,12)plt.title(Correlation of Features, y=1.05, size=15)sns.heatmap(data.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True,linecolor=white, annot=True)观察:酒精与葡萄酒质量的相关性最，其次是各种酸度、硫酸盐、密度和氯化物。使分类器：将葡萄酒分成两组;“优质”5为“好酒”y = data.quality #set quality as targetX = data.drop(quality, axis=1) #rest are featuresprint(y.shape, X.shape) #check correctness#Create a new y1y1 = (y 5).astype(int)y1.head()# plot histogramax= y1.plot.hist(color=green)ax.set_title(Wine quality distribution, fontsize=14)ax.set_xlabel(aggregated target value)利随机森林分类器训练预测模型from sklearn.model_selection importtrain_test_split, cross_val_scorefrom sklearn.ensembleimportRandomForestClassifierfrom sklearn.metrics importaccuracy_score, log_lossfrom sklearn.metrics importconfusion_matrix将数据分割为训练和测试数据集seed = 8 #set seed for reproducibilityX_train, X_test, y_train, y_test = train_test_split(X, y1, test_size=0.2,random_state=seed)print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)对随机森林分类器进交叉验证训练和评价#Instantiate the Random Forest ClassifierRF_clf = RandomForestClassifier(random_state=seed)RF_clf#在训练数据集上计算k-fold交叉验证，并查看平均精度得分cv_scores = cross_val_score(RF_clf,X_train, y_train, cv=10, scoring=accuracy)print(The accuracy scores for the iterationsare .format(cv_scores)print(The mean accuracy score is .format(cv_scores.mean()执预测RF_clf.fit(X_train, y_train)pred_RF= RF_clf.predict(X_test)#Print 5 results to seefor i in range(0,5):print(Actual wine quality is, y_test.iloci, and predicted is, pred_RFi)在前五名中，有个错误。让我们看看指标。print(accuracy_score(y_test, pred_LR)print(log_loss(y_test, pred_LR)print(confusion_matrix(y_test, pred_LR)总共有81个分类错误。与Logistic回归分类器相，随机森林分类器更优。让我们调优随机森林分类器的超参数from sklearn.model_selection importGridSearchCVgrid_values= n_estimators:50,100,200,max_depth:None,30,15,5,max_features:auto,sqrt,log2,min_samples_leaf:1,20,50,100grid_RF= GridSearchCV(RF_clf,param_grid=grid_values,scoring=accuracy)grid_RF.fit(X_train, y_train)grid_RF.best_params_除了估计数之外，其他推荐值是默认值。RF_clf = RandomForestClassifier(n_estimators=100,random_state=seed)RF_clf.fit(X_train,y_train)pred_RF=RF_clf.predict(X_test)print(accuracy_score(y_test,pred_RF)print(log_loss(y_test,pred_RF)print(confusion_matrix(y_test,pred_RF)通过超参数调谐，射频分类器的准确度已提到82.5%，志损失值也相应降低。分类错误的数量也减少到56个。将随机森林分类器作为基本推荐器，将红酒分为“推荐”(6级以上)或“不推荐”(5级以下)，预测准确率为82.5%似乎是合理的。