- 亲和性分析根据样本个体(物体)之间的相似度,确定它们关系的亲疏
- 案例链接
-
- 思考一组订单数据,如何计算出关联性最高的产品组合。
- 思路
- 支持度 support: 指数据集中规则应验的次数
- 置信度 confidence: 置信度衡量的则是规则准确率如何,即符合给定条件(即规则的“如果”语句所表示的前提条件)的所有规则里,跟当前规则结论一致的比例有多大
- 代码片段
# -*- coding: utf-8 -*-
import numpy as np
dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape
# 创建字典使用defaultdict方式,好处是如果查找的键不存在,返回一个默认值
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
# 遍历每一个条订单数据
for sample in X:
# 遍历每一项(每种商品)
for premise in range(n_features):
# continue 语句用来告诉Python跳过当前循环的剩余语句,然后继续进行下一轮循环。
if sample[premise]==0:continue #如果值为零说明没有购买此商品
num_occurences[premise]+=1 # num_occurences字典写入值与键
for conclusion in range(n_features): # 再次遍历订单中的每一项
if premise == conclusion: # 如果是同一个产品则跳出循环继续下一轮循环
continue
if sample[conclusion]==1: #如果里面的项目是为1(即购买了此商品)
valid_rules[(premise,conclusion)]+=1 # valid_rules值则加1
else:
invalid_rules[(premise,conclusion)]+=1 #否则invalid_rules值加一
support = valid_rules
#准备计算confidence的值,公式是购买A种商品时购买了其他某个商品的次数除以购买A商品的总次数,以下部分是计算方式
confidence = defaultdict(float) #
for premise,conclusion in valid_rules.keys(): #遍历有效规则的键
confidence[(premise,conclusion)] = valid_rules[(premise,conclusion)]/num_occurences[premise] #执行计算
print(confidence)
for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
如何对于support的值来进行排序
- 得到所有规则的支持度和置信度后,为了找出最佳规则,还需要根据支持度和置信度对规则进行排序。
- 要找出支持度最高的规则,首先对支持度字典进行排序。字典中的元素(一个键值对)默认为没有前后顺序;字典的items()函数返回包含字典所有元素的列表。我们使用itemgetter()类作为键,这样就可以对嵌套列表进行排序。itemgetter(1)表示以字典各元素的值(这里为支持度)作为排序依据,reverse=True表示降序排列。
def print_rule(premise,conclusion,support,confidence,features):
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule:If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
from operator import itemgetter
# support.items()作用是返回字典所有元素的列表,例:
# {(0, 1): 14,(0, 2): 4} --> [((0, 1), 14), ((0, 2), 4)]
# intemgetter()函数可以对切套列表进行排序,例:
# a = [('c',3),('b',2),('a',4)]; sorted(a,key=itemgetter(0));>>[('a',4),('b',2),('c',3)]
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
for index in range(5):
print("Rule #{0}".format(index + 1))
(premise, conclusion) = sorted_support[index][0]
print_rule(premise, conclusion, support, confidence, features)
输出:
Rule #1
Rule:If a person buys cheese they will also buy bananas
- Confidence: 0.659
- Support: 27
Rule #2
Rule:If a person buys bananas they will also buy cheese
- Confidence: 0.458
- Support: 27
Rule #3
Rule:If a person buys cheese they will also buy apples
- Confidence: 0.610
- Support: 25
Rule #4
Rule:If a person buys apples they will also buy cheese
- Confidence: 0.694
- Support: 25
Rule #5
Rule:If a person buys apples they will also buy bananas
- Confidence: 0.583
- Support: 21
- 本章是非常简单的亲和性关联分析,如果遇到数据量大的推荐使用Apriori算法或者PF-tree算法。