scikit-learn kmeans实现文本聚类

news/2024/9/5 19:31:10

kmeans 无监督的学习方法。需要根据实际业务需要确定K值。

  • 加载数据集
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

from time import time
from sklearn.datasets import load_files

print("loading documents ...")
t = time()
docs = load_files('datasets/clustering/data')
print("summary: {0} documents in {1} categories.".format(
    len(docs.data), len(docs.target_names)))
print("done in {0} seconds".format(time() - t))
  • 文档向量化
from sklearn.feature_extraction.text import TfidfVectorizer

max_features = 20000
print("vectorizing documents ...")
t = time()
vectorizer = TfidfVectorizer(max_df=0.4, 
                             min_df=2, 
                             max_features=max_features, 
                             encoding='latin-1')
X = vectorizer.fit_transform((d for d in docs.data))
print("n_samples: %d, n_features: %d" % X.shape)
print("number of non-zero features in sample [{0}]: {1}".format(
    docs.filenames[0], X[0].getnnz()))
print("done in {0} seconds".format(time() - t))
  • 聚类
from sklearn.cluster import KMeans

print("clustering documents ...")
t = time()
n_clusters = 4
kmean = KMeans(n_clusters=n_clusters, 
               max_iter=100,
               tol=0.01,
               verbose=1,
               n_init=3)
kmean.fit(X);
print("kmean: k={}, cost={}".format(n_clusters, int(kmean.inertia_)))
print("done in {0} seconds".format(time() - t))
  • 分类过程中权重高的10个词

from __future__ import print_function

print("Top terms per cluster:")

order_centroids = kmean.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(n_clusters):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

如何确定聚类结果的好坏呢?

主要有一下几个属性:

Adjust Rand Index:衡量两个序列相似性的算法,优点是针对两个随机序列,值是负数或者接近于0,如果是两个结构相同的序列,值接近于1,对类别标签不敏感。

from sklearn import metrics

label_true = np.random.randint(1, 4, 6)
label_pred = np.random.randint(1, 4, 6)
print("Adjusted Rand-Index for random sample: %.3f"
      % metrics.adjusted_rand_score(label_true, label_pred))
label_true = [1, 1, 3, 3, 2, 2]
label_pred = [3, 3, 2, 2, 1, 1]
print("Adjusted Rand-Index for same structure sample: %.3f"
      % metrics.adjusted_rand_score(label_true, label_pred))

齐次性homogeneity和完整性completeness

齐次性表示一个聚类元素只由一种类别的元素组成。

完整性表示给定已经标记的类别,全部分配到一个聚类里。

齐次性和完整性是一个互补的关系,两个指标综合起来称为V-measure分数。


from sklearn import metrics

label_true = [1, 1, 2, 2]
label_pred = [2, 2, 1, 1]
print("Homogeneity score for same structure sample: %.3f"
      % metrics.homogeneity_score(label_true, label_pred))
label_true = [1, 1, 2, 2]
label_pred = [0, 1, 2, 3]
print("Homogeneity score for each cluster come from only one class: %.3f"
      % metrics.homogeneity_score(label_true, label_pred))
label_true = [1, 1, 2, 2]
label_pred = [1, 2, 1, 2]
print("Homogeneity score for each cluster come from two class: %.3f"
      % metrics.homogeneity_score(label_true, label_pred))
label_true = np.random.randint(1, 4, 6)
label_pred = np.random.randint(1, 4, 6)
print("Homogeneity score for random sample: %.3f"
      % metrics.homogeneity_score(label_true, label_pred))


from sklearn import metrics

label_true = [1, 1, 2, 2]
label_pred = [2, 2, 1, 1]
print("Completeness score for same structure sample: %.3f"
      % metrics.completeness_score(label_true, label_pred))
label_true = [0, 1, 2, 3]
label_pred = [1, 1, 2, 2]
print("Completeness score for each class assign to only one cluster: %.3f"
      % metrics.completeness_score(label_true, label_pred))
label_true = [1, 1, 2, 2]
label_pred = [1, 2, 1, 2]
print("Completeness score for each class assign to two class: %.3f"
      % metrics.completeness_score(label_true, label_pred))
label_true = np.random.randint(1, 4, 6)
label_pred = np.random.randint(1, 4, 6)
print("Completeness score for random sample: %.3f"
      % metrics.completeness_score(label_true, label_pred))

from sklearn import metrics

label_true = [1, 1, 2, 2]
label_pred = [2, 2, 1, 1]
print("V-measure score for same structure sample: %.3f"
      % metrics.v_measure_score(label_true, label_pred))
label_true = [0, 1, 2, 3]
label_pred = [1, 1, 2, 2]
print("V-measure score for each class assign to only one cluster: %.3f"
      % metrics.v_measure_score(label_true, label_pred))
print("V-measure score for each class assign to only one cluster: %.3f"
      % metrics.v_measure_score(label_pred, label_true))
label_true = [1, 1, 2, 2]
label_pred = [1, 2, 1, 2]
print("V-measure score for each class assign to two class: %.3f"
      % metrics.v_measure_score(label_true, label_pred))

轮廓系数


from sklearn import metrics

labels = docs.target
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, kmean.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, kmean.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, kmean.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, kmean.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, kmean.labels_, sample_size=1000))

轮廓系数可以在不需要已标记的数据集的前提下,对聚类算法的性能进行评估。

a:一个样本与其所在相同聚类的平均距离

b:一个样本预期距离最近的下一个聚类里的点的平均距离。

轮廓系数s = (b-a)/max(a,b) 值介于[-1, 1]之间,-1表示完全错误的聚类,1表示完美的聚类,0表示聚类重叠


http://www.niftyadmin.cn/n/1414394.html

相关文章

【Entity framework】Code First Approach

开篇之前感谢 china_fucan的文章给我的帮助,下面的评论也解决了很多问题同样给予感谢. code first 项目中的ORM框架如果采用的是EF,那么可能会采用code first的方式去使用EF.就是先将数据库的实体类,以及EF的核心DBContext写好之后, 运行程序会通过特定的数据库链接字符串在数据…

算法-高位优先的字符串排序

与之前的低位优先的字符串排序不同,低位优先是从右向左开始排序,高位优先是从左向右开始排序,高位优先排序的过程是字符串切分为独立排序的子数组完成排序任务,切分会为每个首字母得到一个子数组,低位优先排序适用于定…

scikit-learn 支持向量机实现手写体识别

随时代码,阅读笔记 %matplotlib inline import matplotlib.pyplot as plt import numpy as np from sklearn import datasetsdigits datasets.load_digits() # 加载数据# 把数据所代表的图片显示出来 images_and_labels list(zip(digits.images, digits.target)) …

insert()

insert() 用于向列表的指定位置插入元素,如下,表示在索引为1的位置插入元素e In [38]: l [a, b, c]In [39]: l.insert(1, e)In [40]: l Out[40]: [a, e, b, c] 转载于:https://www.cnblogs.com/pzk7788/p/10186564.html

条款19:定义class就相当于定义一个个的内置类型

下面的条框应该是谨记的: 1. 新的type应该如何创建与销毁2. 对象的初始化与赋值应该有什么样的区别3. 新type的对象如果被pass-by-value,有什么影响?4. 什么事新type的合法值5. 新的type需要什么样的转换6. 什么样的操作符和函数对于这个type…

scikit-learn 逻辑回归实现信用卡欺诈检测

读书笔记 import numpy as np import pandas as pd import matplotlib.pyplot as pltdata pd.read_csv(creditcard.csv)#data.head(10)print (data.shape)count_class pd.value_counts(data[Class],sort True).sort_index()print (count_class)from sklearn.preprocessing i…

Delphi程序员如何找到高薪的工作?[转]

转自:http://hi.baidu.com/rarnu/blog/item/3b8d630998015fcb3bc76397.html 本文翻译自《美国优秀经理观念大全修订本》 我想现在没有什么比做一个软件工程师更能赚钱的了,当然了,明星除外。在美国,一个优秀的软件工程师,就算在一…

详谈 Spring 中的 IOC 和 AOP

这篇文章主要讲 Spring 中的几个点,Spring 中的 IOC,AOP,下一篇说说 Spring 中的事务操作,注解和 XML 配置。 Spring 简介 Spring 是一个开源的轻量级的企业级框架,其核心是反转控制 (IoC) 和面向切面 (AOP) 的容器框架…