一、预测概率直方图
我们可以通过绘制直方图来查看模型的预测概率的分布。
直方图以样本的预测概率分箱后的结果为横坐标,每个箱中的样本数量为纵坐标绘制一个图像。
具体代码实现为:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LR
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data
print(X.shape)
y = data.target
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size=0.3, random_state=420)
fig, ax1 = plt.subplots(figsize=(8, 6))
estimators = [GaussianNB().fit(Xtrain, Ytrain)
, LR(solver='lbfgs', max_iter=5000, multi_class='auto').fit(Xtrain,Ytrain)
, SVC(kernel='rbf', probability=True).fit(Xtrain,Ytrain)
]
name = ['GaussianNB', 'LogisticRegression', 'SVC']
for i, estimator in enumerate(estimators):
proba = estimator.predict_proba(Xtest)[:,0]
ax1.hist(proba
, bins=10
, label = name[i]
, histtype = "step" #直方图设置为透明色
, lw = 2 #直方图柱子描边的粗细
, density = True
)
ax1.set_xlabel("Distribution of Probability")
ax1.set_ylabel("Mean Predict Probability")
ax1.set_xticks([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
ax1.legend()
plt.show()
结果为:
二、校准可靠性曲线
一般用等近似回归来矫正概率算法。
等近似回归有两种回归可以使用,一种是基于Platt的Sigmoid模型的参数校准方法,一种是等渗回归(isotonic calibration)的非参数的校准方法。概率校准应该发生在测试集上,必须是模型未见过的数据。
一般在sklearn中用这个函数进行概率校准:
class sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method='sigmoid', cv='warn')
base_estimator: 需要校准其输出决策功能的分类器,必须存在predict_proba或decision_function接口,如果参数cv=prefit,分类器必须已经拟合数据完毕。
cv 确定交叉验证的策略,可能输入是:
None:表示使用默认的3折交叉验证。
任意整数:对于输入证书和None的情况下,如果是二分类,则自动使用类sklearn.model_selection.StratifiedKFold进行折数分割。如果y是连续型变量,则使用sklearn.model_selection.KFold进行分割。
prefit:假设已经在分类器上拟合完毕数据。
method:进行概率校准的方法,有两种输入选择:
‘sigmoid’:使用基于Platt的Sigmoid模型进行校准。
‘isotonic’:使用等渗回归来进行校准。(当校准的样本量太少(小于等于1000个测试样本)的时候,不建议使用等渗回归。)
这是一个带交叉验证的概率校准类,它使用交叉验证生成器,对交叉验证中的每一份数据,都在训练样本上进行模型参数评估,在测试样本上进行概率校准,然后为我们返回最佳的一组参数估计和校准结果。每一份数据的预测概率会被求解平均。
注意,类CalibratedClassifierCV没有接口decision_function,要查看这个类下校准过后的模型生成的概率,必须调用predict_proba接口。
概率校准代码为:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LR
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
import pandas as pd
from sklearn.metrics import brier_score_loss
from sklearn.calibration import CalibratedClassifierCV
def plot_calib(estimators, names, Xtrain, Xtest, Ytrain, Ytest):
fig, ax = plt.subplots(figsize=(8,5))
for i, estimator in enumerate(estimators):
estimator = estimator.fit(Xtrain, Ytrain)
prob = estimator.predict_proba(Xtest)[:, 0]
trueproba, predproba = calibration_curve(Ytest_copy[0], prob, n_bins=5)
bls = brier_score_loss(Ytest_copy[0], prob, pos_label=0)
ax.plot(trueproba, predproba, 'o-', label="{}'s brier_score_loss:{:.4f}".format(names[i], bls))
ax.set_xlabel('Mean Predict Probaility')
ax.set_ylabel('True Probability for Class 1')
ax.plot([0, 1], [0, 1], '--', label='Perfectly Clibration')
ax.legend()
fig.tight_layout()
plt.show()
#概率校准
if __name__ == '__main__':
data = load_breast_cancer()
X = data.data
y = data.target
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size=0.3, random_state=420)
Ytest_copy = Ytest.copy()
Ytest_copy = pd.get_dummies(Ytest_copy)
GaussianNB_cali = [GaussianNB()
, CalibratedClassifierCV(GaussianNB(), method='sigmoid', cv=10)
, CalibratedClassifierCV(GaussianNB(), method='isotonic', cv=10)
]
LR_cali = [LR(solver='lbfgs', max_iter=5000, multi_class='auto')
, CalibratedClassifierCV(LR(solver='lbfgs', max_iter=5000, multi_class='auto'), method='sigmoid', cv=10)
, CalibratedClassifierCV(LR(solver='lbfgs', max_iter=5000, multi_class='auto'), method='isotonic', cv=10)
]
SVC_cali = [SVC(kernel='rbf',probability=True)
, CalibratedClassifierCV(SVC(kernel='rbf'), method='sigmoid', cv=10)
, CalibratedClassifierCV(SVC(kernel='rbf'), method='isotonic', cv=10)
]
estimators_cali = [GaussianNB_cali, LR_cali, SVC_cali]
GaussianNB_name = ['GaussianNB', 'GaussianNB_cali(sigmoid)', 'GaussianNB_cali(isotonic)']
LR_name = ['LogisticRegression', 'LogisticRegression_cali(sigmoid)', 'LogisticRegression_cali(isotonic']
SVC_name = ['SVC', 'SVC_cali(sigmoid)', 'SVC_cali(isotonic)']
names = [GaussianNB_name, LR_name, SVC_name]
for i in range(len(estimators_cali)):
plot_calib(estimators_cali[i], names[i], Xtrain, Xtest, Ytrain, Ytest)
结果为:
大家可以尝试着写一下直方图的校准曲线。
概率类模型评估指标讲解到此为止。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)