本文共 4515 字,大约阅读时间需要 15 分钟。
鸢尾花训练集是植物学家通过测量三种种类的鸢尾花萼片、花径等参数形成的一个数据集。
在监督学习中,一个最常见的案例就是学习这个训练集,得到一个分类器。当一个全新的鸢尾花个例数据输入后,分类器就能判断这属于哪一种鸢尾花。5.1,3.5,1.4,0.2,Iris-setosa
训练集中每一行包括花朵的4个参数(浮点类型)和一个花名(字符串类型)。
在真正处理时,我们需要把花名的字符串转变为浮点型,因为分类器本质上是数字之间的映射。对于鸢尾花问题就是将1*4的向量映射为一个数。 首先对训练集的数据进行处理后,得到5.1,3.5,1.4,0.2,14.9,3.0,1.4,0.2,14.7,3.2,1.3,0.2,14.6,3.1,1.5,0.2,15.0,3.6,1.4,0.2,15.4,3.9,1.7,0.4,14.6,3.4,1.4,0.3,15.0,3.4,1.5,0.2,14.4,2.9,1.4,0.2,14.9,3.1,1.5,0.1,15.4,3.7,1.5,0.2,14.8,3.4,1.6,0.2,14.8,3.0,1.4,0.1,14.3,3.0,1.1,0.1,15.8,4.0,1.2,0.2,15.7,4.4,1.5,0.4,15.4,3.9,1.3,0.4,15.1,3.5,1.4,0.3,15.7,3.8,1.7,0.3,15.1,3.8,1.5,0.3,15.4,3.4,1.7,0.2,15.1,3.7,1.5,0.4,14.6,3.6,1.0,0.2,15.1,3.3,1.7,0.5,14.8,3.4,1.9,0.2,15.0,3.0,1.6,0.2,15.0,3.4,1.6,0.4,15.2,3.5,1.5,0.2,15.2,3.4,1.4,0.2,14.7,3.2,1.6,0.2,14.8,3.1,1.6,0.2,15.4,3.4,1.5,0.4,15.2,4.1,1.5,0.1,15.5,4.2,1.4,0.2,14.9,3.1,1.5,0.1,15.0,3.2,1.2,0.2,15.5,3.5,1.3,0.2,14.9,3.1,1.5,0.1,14.4,3.0,1.3,0.2,15.1,3.4,1.5,0.2,15.0,3.5,1.3,0.3,14.5,2.3,1.3,0.3,14.4,3.2,1.3,0.2,15.0,3.5,1.6,0.6,15.1,3.8,1.9,0.4,14.8,3.0,1.4,0.3,15.1,3.8,1.6,0.2,14.6,3.2,1.4,0.2,15.3,3.7,1.5,0.2,15.0,3.3,1.4,0.2,17.0,3.2,4.7,1.4,26.4,3.2,4.5,1.5,26.9,3.1,4.9,1.5,25.5,2.3,4.0,1.3,26.5,2.8,4.6,1.5,25.7,2.8,4.5,1.3,26.3,3.3,4.7,1.6,24.9,2.4,3.3,1.0,26.6,2.9,4.6,1.3,25.2,2.7,3.9,1.4,25.0,2.0,3.5,1.0,25.9,3.0,4.2,1.5,26.0,2.2,4.0,1.0,26.1,2.9,4.7,1.4,25.6,2.9,3.6,1.3,26.7,3.1,4.4,1.4,25.6,3.0,4.5,1.5,25.8,2.7,4.1,1.0,26.2,2.2,4.5,1.5,25.6,2.5,3.9,1.1,25.9,3.2,4.8,1.8,26.1,2.8,4.0,1.3,26.3,2.5,4.9,1.5,26.1,2.8,4.7,1.2,26.4,2.9,4.3,1.3,26.6,3.0,4.4,1.4,26.8,2.8,4.8,1.4,26.7,3.0,5.0,1.7,26.0,2.9,4.5,1.5,25.7,2.6,3.5,1.0,25.5,2.4,3.8,1.1,25.5,2.4,3.7,1.0,25.8,2.7,3.9,1.2,26.0,2.7,5.1,1.6,25.4,3.0,4.5,1.5,26.0,3.4,4.5,1.6,26.7,3.1,4.7,1.5,26.3,2.3,4.4,1.3,25.6,3.0,4.1,1.3,25.5,2.5,4.0,1.3,25.5,2.6,4.4,1.2,26.1,3.0,4.6,1.4,25.8,2.6,4.0,1.2,25.0,2.3,3.3,1.0,25.6,2.7,4.2,1.3,25.7,3.0,4.2,1.2,25.7,2.9,4.2,1.3,26.2,2.9,4.3,1.3,25.1,2.5,3.0,1.1,25.7,2.8,4.1,1.3,26.3,3.3,6.0,2.5,35.8,2.7,5.1,1.9,37.1,3.0,5.9,2.1,36.3,2.9,5.6,1.8,36.5,3.0,5.8,2.2,37.6,3.0,6.6,2.1,34.9,2.5,4.5,1.7,37.3,2.9,6.3,1.8,36.7,2.5,5.8,1.8,37.2,3.6,6.1,2.5,36.5,3.2,5.1,2.0,36.4,2.7,5.3,1.9,36.8,3.0,5.5,2.1,35.7,2.5,5.0,2.0,35.8,2.8,5.1,2.4,36.4,3.2,5.3,2.3,36.5,3.0,5.5,1.8,37.7,3.8,6.7,2.2,37.7,2.6,6.9,2.3,36.0,2.2,5.0,1.5,36.9,3.2,5.7,2.3,35.6,2.8,4.9,2.0,37.7,2.8,6.7,2.0,36.3,2.7,4.9,1.8,36.7,3.3,5.7,2.1,37.2,3.2,6.0,1.8,36.2,2.8,4.8,1.8,36.1,3.0,4.9,1.8,36.4,2.8,5.6,2.1,37.2,3.0,5.8,1.6,37.4,2.8,6.1,1.9,37.9,3.8,6.4,2.0,36.4,2.8,5.6,2.2,36.3,2.8,5.1,1.5,36.1,2.6,5.6,1.4,37.7,3.0,6.1,2.3,36.3,3.4,5.6,2.4,36.4,3.1,5.5,1.8,36.0,3.0,4.8,1.8,36.9,3.1,5.4,2.1,36.7,3.1,5.6,2.4,36.9,3.1,5.1,2.3,35.8,2.7,5.1,1.9,36.8,3.2,5.9,2.3,36.7,3.3,5.7,2.5,36.7,3.0,5.2,2.3,36.3,2.5,5.0,1.9,36.5,3.0,5.2,2.0,36.2,3.4,5.4,2.3,35.9,3.0,5.1,1.8,3
最后的1、2、3表示花朵种类。
利用sklearn库中封装好的KNN算法,即sklearn.neighbors包中的KNeighborsClassifier。
而pandas和numpy是进行数据处理的python库。能够让Python像matlab一样进行例如矩阵变换等复杂的运算。 在进行数据加载时,feature是一个n×4的矩阵,label是一个n×1的矩阵。我们利用pandas提供的方法,先把整个数据集读进内存。形成了一个n×5的矩阵,它是一个dataframe类型。 接着,我们利用pandas的切片功能,切出矩阵的前4列,即为feature,它也是dataframe类型。 对于label,我们切出n×5矩阵的最后一列。但发现它其实是一个pandas的series类型,通过其values属性可以转换为numpy的ndarray类型,再通过pandas中dataframe的构造方法将数据最终转换为n×1的dataframe类型。 最后将切出的数据分别赋予特征和标签矩阵。利用了numpy的concatenate矩阵合并操作,经试验第二个参数是dataframe类型时,合并是正常的。import pandas as pdimport numpy as npfrom sklearn.preprocessing import Imputerfrom sklearn.cross_validation import train_test_splitfrom sklearn.metrics import classification_reportfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.naive_bayes import GaussianNBdef load_datasets(data_path): feature = np.ndarray(shape=(0,4)) label = np.ndarray(shape=(0,1)) #print feature.shape df = pd.read_table(data_path, delimiter=',', na_values='?', header=None) fpiece = df.ix[:, :3] feature = np.concatenate((feature, fpiece)) data = df.ix[:, 4].values lpiece = pd.DataFrame(data) label = np.concatenate((label, lpiece)) return feature, labelif __name__ == '__main__': data_path = 'iris.txt' x_train, y_train = load_datasets(data_path) print ('start training knn') knn = KNeighborsClassifier().fit(x_train, y_train) print ('training done') x_test = [7.1, 3.8, 5.4, 2.2] answer_knn = int(knn.predict(x_test)) print ('prediction done') irisname = { 1: 'Iris-setosa', 2: 'Iris-versicolor', 3:'Iris-virginica'} print irisname[answer_knn]
最后为了显示结果我们定义了一个irisname字典,完成结果到花名的转换。最终结果: