python 亲和性分析

亲和性分析根据样本个体（物体）之间的相似度，确定它们关系的亲疏
案例链接
1. 思考一组订单数据，如何计算出关联性最高的产品组合。
思路
- 支持度 support: 指数据集中规则应验的次数
- 置信度 confidence: 置信度衡量的则是规则准确率如何，即符合给定条件（即规则的“如果”语句所表示的前提条件）的所有规则里，跟当前规则结论一致的比例有多大
代码片段

# -*- coding: utf-8 -*-

import numpy as np
dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape

# 创建字典使用defaultdict方式，好处是如果查找的键不存在，返回一个默认值
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
# 遍历每一个条订单数据
for sample in X:
#     遍历每一项（每种商品）
    for premise in range(n_features):
#       continue 语句用来告诉Python跳过当前循环的剩余语句，然后继续进行下一轮循环。
        if sample[premise]==0:continue #如果值为零说明没有购买此商品
        num_occurences[premise]+=1 # num_occurences字典写入值与键
        for conclusion in range(n_features): # 再次遍历订单中的每一项
            if premise == conclusion: # 如果是同一个产品则跳出循环继续下一轮循环
                continue
            if sample[conclusion]==1:  #如果里面的项目是为1（即购买了此商品）
                valid_rules[(premise,conclusion)]+=1 # valid_rules值则加1
            else:
                invalid_rules[(premise,conclusion)]+=1 #否则invalid_rules值加一
support = valid_rules
#准备计算confidence的值，公式是购买A种商品时购买了其他某个商品的次数除以购买A商品的总次数，以下部分是计算方式
confidence = defaultdict(float)  #
for premise,conclusion in valid_rules.keys(): #遍历有效规则的键
    confidence[(premise,conclusion)] = valid_rules[(premise,conclusion)]/num_occurences[premise] #执行计算
print(confidence)

for premise, conclusion in confidence:
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print(" - Support: {0}".format(support[(premise, conclusion)]))
    print("")

如何对于support的值来进行排序

得到所有规则的支持度和置信度后，为了找出最佳规则，还需要根据支持度和置信度对规则进行排序。
要找出支持度最高的规则，首先对支持度字典进行排序。字典中的元素（一个键值对）默认为没有前后顺序；字典的items()函数返回包含字典所有元素的列表。我们使用itemgetter()类作为键，这样就可以对嵌套列表进行排序。itemgetter(1)表示以字典各元素的值（这里为支持度）作为排序依据，reverse=True表示降序排列。

def print_rule(premise,conclusion,support,confidence,features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule:If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print(" - Support: {0}".format(support[(premise, conclusion)]))
    print("")

from operator import itemgetter
# support.items()作用是返回字典所有元素的列表，例：
# {(0, 1): 14,(0, 2): 4}  --> [((0, 1), 14), ((0, 2), 4)]
# intemgetter()函数可以对切套列表进行排序，例：
# a = [('c',3),('b',2),('a',4)]; sorted(a,key=itemgetter(0));>>[('a',4),('b',2),('c',3)]
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_support[index][0]
    print_rule(premise, conclusion, support, confidence, features)

输出：

Rule #1
Rule:If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27

Rule #2
Rule:If a person buys bananas they will also buy cheese
 - Confidence: 0.458
 - Support: 27

Rule #3
Rule:If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25

Rule #4
Rule:If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25

Rule #5
Rule:If a person buys apples they will also buy bananas
 - Confidence: 0.583
 - Support: 21

本章是非常简单的亲和性关联分析，如果遇到数据量大的推荐使用Apriori算法或者PF-tree算法。

亲和性分析关联推荐基础

如何对于support的值来进行排序

CATALOG

FEATURED TAGS

FRIENDS