本文将详细介绍C4.5算法在Python中的实现方法。
一、C4.5算法简介
C4.5算法是一种决策树学习算法,采用信息增益比来选择最优的划分属性。它通过对训练数据集进行递归划分,生成一棵决策树模型。C4.5算法的主要思想是以信息熵的减少作为选择最优划分属性的标准,同时考虑到属性的取值数目不同对信息增益的影响,引入了信息增益比来解决这个问题。
以下是C4.5算法的Python实现代码:
import math def entropy(data): n = len(data) class_counts = {} for row in data: label = row[-1] if label not in class_counts: class_counts[label] = 0 class_counts[label] += 1 entropy = 0 for count in class_counts.values(): p = count / n entropy -= p * math.log2(p) return entropy def information_gain(data, attribute_index): original_entropy = entropy(data) attribute_values = set([row[attribute_index] for row in data]) gain = original_entropy for value in attribute_values: subset = [row for row in data if row[attribute_index] == value] p = len(subset) / len(data) gain -= p * entropy(subset) return gain / original_entropy def choose_best_attribute(data, attributes): best_gain = 0 best_attribute = None for i, attribute in enumerate(attributes): gain = information_gain(data, i) if gain > best_gain: best_gain = gain best_attribute = attribute return best_attribute def create_decision_tree(data, attributes): class_labels = set([row[-1] for row in data]) if len(class_labels) == 1: return class_labels.pop() if len(attributes) == 0: class_counts = {} for row in data: label = row[-1] if label not in class_counts: class_counts[label] = 0 class_counts[label] += 1 return max(class_counts, key=class_counts.get) best_attribute = choose_best_attribute(data, attributes) decision_tree = {best_attribute: {}} attribute_values = set([row[attributes.index(best_attribute)] for row in data]) for value in attribute_values: subset = [row for row in data if row[attributes.index(best_attribute)] == value] new_attributes = [attr for attr in attributes if attr != best_attribute] decision_tree[best_attribute][value] = create_decision_tree(subset, new_attributes) return decision_tree
二、C4.5算法步骤
1、计算数据集的熵值。
2、对于每个属性,计算其信息增益。
3、选择信息增益最大的属性作为划分属性。
4、根据划分属性的取值对数据集进行划分。
5、递归地对每个子数据集进行划分,直到满足终止条件。
三、C4.5算法实例
以下是一个使用C4.5算法进行鸢尾花分类的示例:
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) attributes = ['sepal length', 'sepal width', 'petal length', 'petal width'] data = [list(row) + [target] for row, target in zip(X_train, y_train)] decision_tree = create_decision_tree(data, attributes) predictions = [] for sample in X_test: node = decision_tree while isinstance(node, dict): attribute = list(node.keys())[0] value = sample[attributes.index(attribute)] node = node[attribute][value] predictions.append(node) accuracy = accuracy_score(y_test, predictions) print("Accuracy:", accuracy)
四、总结
本文介绍了C4.5算法在Python中的实现方法,详细说明了算法的原理和步骤,并通过一个鸢尾花分类的实例演示了算法的应用。C4.5算法是一种经典的决策树学习算法,在实际应用中具有较好的效果。
原创文章,作者:ZLNO,如若转载,请注明出处:https://www.beidandianzhu.com/g/1517.html