机器学习——上机3——逻辑回归(广告点击率预测)

广告点击率预测

广告点击率(CTR)预测是广告行业的典型应用,是评估广告效果的一个非常重要的指标。通过历史数据训练预测模型,对于每天的增量数据进行预测,找出广告的CTR符合标准的样本进行投放。 ## 数据集介绍 数据集来自于kaggle,数据包含了10天的Avazu的广告点击数据,训练集10000个,测试集1000个。每一条广告包含:广告id、时间、广告位置等属性。

任务1:导入库和数据集与数据预处理

  • 读入训练数据和测试数据,划分data和label
  • 将string类型的特征转化为int型:1)进行 one-hot 编码处理,会得到高维稀疏的特征,增大内存开销;2)使用python内置的hash函数将那些类型为object的特征变量映射为一定范围内的整数(原来的string被映射成了integer),可以大大降低内存的消耗。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import gzip
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model

types_train = {
'id': np.dtype(int),
'click': np.dtype(int), #是否点击,1表示被点击,0表示没被点击
'hour': np.dtype(int), #广告被展现的日期+时间
'C1': np.dtype(int), #匿名分类变量
'banner_pos': np.dtype(int), #广告位置
'site_id': np.dtype(str), #站点Id
'site_domain': np.dtype(str), #站点域名
'site_category': np.dtype(str), #站点分类
'app_id': np.dtype(str), # appId
'app_domain': np.dtype(str), # app域名
'app_category': np.dtype(str), # app分类
'device_id': np.dtype(str), #设备Id
'device_ip': np.dtype(str), #设备Ip
'device_model': np.dtype(str), #设备型号
'device_type': np.dtype(int), #设备型号
'device_conn_type': np.dtype(int),
'C14': np.dtype(int), #匿名分类变量
'C15': np.dtype(int), #匿名分类变量
'C16': np.dtype(int), #匿名分类变量
'C17': np.dtype(int), #匿名分类变量
'C18': np.dtype(int), #匿名分类变量
'C19': np.dtype(int), #匿名分类变量
'C20': np.dtype(int), #匿名分类变量
'C21':np.dtype(int) #匿名分类变量
}

# 添加列名
header_row = ['id', 'click', 'hour', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', \
'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model',\
'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19',\
'C20', 'C21']

# 读入训练数据和测试数据
train = pd.read_csv('train_data.csv', names=header_row, dtype=types_train)
test = pd.read_csv('test_data.csv', names=header_row, dtype=types_train)
# 去除第0行(表示列的编号,不是样本)
train = train.drop(labels=train.index.values[0])
test = test.drop(labels=test.index.values[0])
print(test.shape)

# 划分data和label
train_data = train.drop('click', axis=1) #去除click 这一列
print(train_data.shape)
train_label = train['click'] #提取click 这一列

# 数据预处理
# 使用pd.get_dummies对非数值型特征进行 one-hot 编码处理,得到高维稀疏的特征
train_data1 = pd.get_dummies(train_data)
print(train_data1.shape)

# 编写convert_obj_to_int()函数将string类型的特征转换为int型
def convert_obj_to_int(self):
object_list_columns = self.columns
object_list_dtypes = self.dtypes
new_col_suffix = '_int'
for index in range(0, len(object_list_columns)):
if object_list_dtypes[index] == object:
# 使用hash和map将string特征变量映射为一定范围内的整数
self[object_list_columns[index] + new_col_suffix] = self[object_list_columns[index]].map(lambda x: hash(x) % (1 << 32))
self.drop([object_list_columns[index]], inplace=True, axis=1)
return self

# 调用convert_obj_to_int()函数,将string类型转换为int型
train_data = convert_obj_to_int(train_data)
print(train_data.shape)
(1000, 24)
(10000, 23)
(10000, 10531)
(10000, 23)


C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:

任务2:特征分析

以广告在网页中的位置(banner_pos)为例,查看banner_pos和最终类标(click)之间的关系。 - 查看banner_pos在数据集中的取值分布; - 查看不同banner_pos对点击率click的贡献。

1
2
3
4
5
6
7
8
9
10
11
12
# 查看banner_pos在数据集中的取值分布
print(train.banner_pos.value_counts()/len(train))

# 查看不同banner_pos对点击率click的贡献
banner_pos_val = train.banner_pos.unique()
banner_pos_val.sort()
ctr_avg_list = []
for i in banner_pos_val:
selected_data = train.loc[train.banner_pos == i]
ctr_avg = selected_data.click.mean()
ctr_avg_list.append(ctr_avg)
print(" banner 位置: {}, 点击率: {}".format(i, ctr_avg))
banner_pos
0    0.8041
1    0.1951
2    0.0007
4    0.0001
Name: count, dtype: float64
 banner 位置: 0,  点击率: 0.16975500559631887
 banner 位置: 1,  点击率: 0.19067145053818554
 banner 位置: 2,  点击率: 0.14285714285714285
 banner 位置: 4,  点击率: 0.0

任务3:模型训练与评估

  • 调用sklearn的逻辑回归函数LogisticRegression(),进行模型训练
  • 对测试集test_data进行预测,计算预测结果的各项指标acc, pre, recall, auc
  • 绘制ROC曲线(使用预测的概率值而不是预测的类标)
  • 选做:自定义逻辑回归函数MyLogisticRegression(),进行模型训练与预测,与上述结果比较。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
test_data = test.drop('click', axis=1)
test_data = convert_obj_to_int(test_data)
test_label = test['click']
# 调用sklearn的逻辑回归函数LogisticRegression()
clf = linear_model.LogisticRegression(max_iter=1000) # 增加最大迭代次数防止不收敛

# 模型训练
clf.fit(train_data, train_label)
print("Finish Training!")

# 模型预测
pred = clf.predict(test_data)
pred_proba = clf.predict_proba(test_data)[:, 1]

# 计算模型的acc, pre, recall, auc,并输出
# 请在下方作答
acc = accuracy_score(test_label, pred)
pre = precision_score(test_label, pred)
recall = recall_score(test_label, pred)
auc = roc_auc_score(test_label, pred_proba)
print(f"Accuracy: {acc:.4f}, Precision: {pre:.4f}, Recall: {recall:.4f}, AUC: {auc:.4f}")
# 绘制roc曲线
# 请在下方作答
fpr, tpr, _ = roc_curve(test_label, pred_proba)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f'AUC = {auc:.4f}')
plt.plot([0,1], [0,1], 'k--')
plt.title('ROC Curve (sklearn)')
plt.legend()
plt.show()
# 自定义实现逻辑回归函数MyLogisticRegression()
# 请在下方作答

Finish Training!
Accuracy: 0.8120, Precision: 0.0000, Recall: 0.0000, AUC: 0.4983


C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
f:\Anconda\Anconda\envs\general\Lib\site-packages\sklearn\metrics\_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
png


Custom Model - Accuracy: 0.8240, Precision: 0.6875, Recall: 0.1170, AUC: 0.6580

png