机器学习实践-针对Breast-Cancer数据集

本篇博客中，我们将对一个UCI数据库中的数据集：Breast-Cancer数据集，应用已有的机器学习方法来实现一个分类器。

本文代码链接

数据集概况

数据集的地址为：link)

在该页面中，可以进入Data Set Description 来查看数据的说明文档，另外一个连接是Data Folder 查看数据集的下载地址。

这里我们使用的文件是：

breast-cancer-wisconsin.data
breast-cancer-wisconsin.names

即：

snip-breast-cancer

这两个文件，第一个文件（连接）是我们的数据文件，第二个文件（连接）是数据的说明文档。

对于这样的一份数据，我们应该首先阅读说明文档中的内容来对数据有一个基本的了解。

对数据的预处理

我们可以知道文件有11个列，第1个列为id号，第2-10列为特征，11列为标签（2为良性、4为恶性）。具体的特征内容在文档中，但是我们可以不关心医学上的具体意义，这部分在文档中的描述如下：

7. Attribute Information: (class attribute has been moved to last column)
   #  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

另外从文档中我们还可以知道一些其他的信息：

数据集中共有699条信息
数据集中有16处缺失值，缺失值使用”?”表示
数据集中良性数据有458条，恶性数据有241条

缺失值处理和分割数据集

因为缺失的数据不多（11条），所以我们暂时先采用丢弃带有“？”的数据，加上前面读取数据、添加表头的操作，代码如下：

# import the packets
import numpy as np
import pandas as pd
DATA_PATH = "breast-cancer-wisconsin.data"
# create the column names
columnNames = [
    'Sample code number',
    'Clump Thickness',
    'Uniformity of Cell Size',
    'Uniformity of Cell Shape',
    'Marginal Adhesion',
    'Single Epithelial Cell Size',
    'Bare Nuclei',
    'Bland Chromatin',
    'Normal Nucleoli',
    'Mitoses',
    'Class'
]
data = pd.read_csv(DATA_PATH, names = columnNames)
# show the shape of data
print data.shape
# use standard missing value to replace "?"
data = data.replace(to_replace = "?", value = np.nan)
# then drop the missing value
data = data.dropna(how = 'any')
print data.shape

输出结果为：

1 2	(699, 11) (683, 11)

可以看到，现在数据中带有缺失值的数据都被丢弃掉了。

我们可以通过类似 data['Class'] 的方式来访问特定的属性，如下图：

data-Class

然后我们会将数据集分割为两部分：训练数据集和测试数据集，使用了train_test_split，这个函数已经自动完成了随机分割的功能，函数文档。

然后我们分割数据集：

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    data[ columnNames[1:10] ], # features
    data[ columnNames[10]   ], # labels
    test_size = 0.25,
    random_state = 33
)

得到的变量为：

X_train ：训练数据集的特征
X_test ：测试数据集的特征
y_train ：训练数据集的标签
y_test ：测试数据集的标签

因为是监督学习，所以所有数据都有标签，且认为标签的内容百分之百准确。

应用机器学习模型

应用机器模型前，应该将每个特征的数值转化为均值为0，方差为1的数据，使训练出的模型不会被某些维度过大的值主导。

这里使用的使scikit-learn 中的 StandardScaler 模块，doc链接。

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train) # fit_transform for train data
X_test = ss.transform(X_test)

然后我们将建立一个机器学习模型，这里我们使用了Logestic Regression 和 SVM：

# use logestic-regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_y = lr.predict(X_test)
# use svm
from sklearn.svm import LinearSVC
lsvc = LinearSVC()
lsvc.fit(X_train, y_train)
svm_y = lsvc.predict(X_test)

分类器的效果评估

首先我们用分类器自带的.score方法来对准确性进行打印：


# now we will check the performance of the classifier
from sklearn.metrics import classification_report
# use the classification_report to present result
# `.score` method can be used to test the accuracy
print 'Accuracy of the LogesticRegression: ', lr.score(X_test, y_test)
# print 'Accuracy on the train dataset: ', lr.score(X_train, y_train)
# print 'Accuracy on the predict result (should be 1.0): ', lr.score(X_test, lr_y)
print 'Accuracy of the SVM: ' , lsvc.score(X_test, y_test)

输出为：

1 2	Accuracy of the LogesticRegression: 0.953216374269 Accuracy of the SVM: 0.959064327485

除此以外，我们还可以使用classification_report对分类器查看更详细的性能测试结果：

1	print classification_report(y_test, svm_y, target_names = ['Benign', 'Malignant'])

其结果如下：

             precision    recall  f1-score   support
     Benign       0.96      0.98      0.97       111
  Malignant       0.96      0.92      0.94        60
avg / total       0.96      0.96      0.96       171