机器学习——上机3——逻辑回归（广告点击率预测）

发表于 2025-03-21 更新于 2025-03-22 分类于大二下，机器学习阅读次数：

广告点击率预测

广告点击率(CTR)预测是广告行业的典型应用，是评估广告效果的一个非常重要的指标。通过历史数据训练预测模型，对于每天的增量数据进行预测，找出广告的CTR符合标准的样本进行投放。 ## 数据集介绍数据集来自于kaggle，数据包含了10天的Avazu的广告点击数据，训练集10000个，测试集1000个。每一条广告包含：广告id、时间、广告位置等属性。

任务1：导入库和数据集与数据预处理

读入训练数据和测试数据，划分data和label
将string类型的特征转化为int型：1）进行 one-hot 编码处理，会得到高维稀疏的特征，增大内存开销；2）使用python内置的hash函数将那些类型为object的特征变量映射为一定范围内的整数(原来的string被映射成了integer)，可以大大降低内存的消耗。

import gzip
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model

types_train = {
    'id': np.dtype(int),
    'click': np.dtype(int),         #是否点击,1表示被点击,0表示没被点击
    'hour': np.dtype(int),          #广告被展现的日期+时间
    'C1': np.dtype(int),            #匿名分类变量
    'banner_pos': np.dtype(int),    #广告位置
    'site_id': np.dtype(str),       #站点Id
    'site_domain': np.dtype(str),   #站点域名
    'site_category': np.dtype(str), #站点分类
    'app_id': np.dtype(str),        # appId
    'app_domain': np.dtype(str),    # app域名
    'app_category': np.dtype(str),  # app分类
    'device_id': np.dtype(str),     #设备Id
    'device_ip': np.dtype(str),     #设备Ip
    'device_model': np.dtype(str),  #设备型号
    'device_type': np.dtype(int),   #设备型号
    'device_conn_type': np.dtype(int),
    'C14': np.dtype(int),   #匿名分类变量
    'C15': np.dtype(int),   #匿名分类变量
    'C16': np.dtype(int),   #匿名分类变量
    'C17': np.dtype(int),   #匿名分类变量
    'C18': np.dtype(int),   #匿名分类变量
    'C19': np.dtype(int),   #匿名分类变量
    'C20': np.dtype(int),   #匿名分类变量
    'C21':np.dtype(int)     #匿名分类变量
}

# 添加列名
header_row = ['id', 'click', 'hour', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', \
              'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model',\
              'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19',\
              'C20', 'C21']

# 读入训练数据和测试数据
train = pd.read_csv('train_data.csv', names=header_row, dtype=types_train)
test = pd.read_csv('test_data.csv', names=header_row, dtype=types_train)
# 去除第0行（表示列的编号，不是样本）
train = train.drop(labels=train.index.values[0])
test = test.drop(labels=test.index.values[0])
print(test.shape)

# 划分data和label
train_data = train.drop('click', axis=1) #去除click 这一列
print(train_data.shape)
train_label = train['click'] #提取click 这一列

# 数据预处理
# 使用pd.get_dummies对非数值型特征进行 one-hot 编码处理，得到高维稀疏的特征
train_data1 = pd.get_dummies(train_data)
print(train_data1.shape)

# 编写convert_obj_to_int()函数将string类型的特征转换为int型
def convert_obj_to_int(self):
    object_list_columns = self.columns
    object_list_dtypes = self.dtypes
    new_col_suffix = '_int'
    for index in range(0, len(object_list_columns)):
        if object_list_dtypes[index] == object:
            # 使用hash和map将string特征变量映射为一定范围内的整数
            self[object_list_columns[index] + new_col_suffix] = self[object_list_columns[index]].map(lambda x: hash(x) % (1 << 32))
            self.drop([object_list_columns[index]], inplace=True, axis=1)
    return self

# 调用convert_obj_to_int()函数，将string类型转换为int型    
train_data = convert_obj_to_int(train_data)
print(train_data.shape)

(1000, 24)
(10000, 23)
(10000, 10531)
(10000, 23)


C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:

任务2：特征分析

以广告在网页中的位置(banner_pos)为例，查看banner_pos和最终类标(click)之间的关系。 - 查看banner_pos在数据集中的取值分布； - 查看不同banner_pos对点击率click的贡献。

# 查看banner_pos在数据集中的取值分布
print(train.banner_pos.value_counts()/len(train))

# 查看不同banner_pos对点击率click的贡献
banner_pos_val = train.banner_pos.unique()
banner_pos_val.sort()
ctr_avg_list = []
for i in banner_pos_val:
    selected_data = train.loc[train.banner_pos == i]
    ctr_avg = selected_data.click.mean()
    ctr_avg_list.append(ctr_avg)
    print(" banner 位置: {},  点击率: {}".format(i, ctr_avg))

banner_pos
0    0.8041
1    0.1951
2    0.0007
4    0.0001
Name: count, dtype: float64
 banner 位置: 0,  点击率: 0.16975500559631887
 banner 位置: 1,  点击率: 0.19067145053818554
 banner 位置: 2,  点击率: 0.14285714285714285
 banner 位置: 4,  点击率: 0.0

任务3：模型训练与评估

调用sklearn的逻辑回归函数LogisticRegression()，进行模型训练
对测试集test_data进行预测，计算预测结果的各项指标acc, pre, recall, auc
绘制ROC曲线（使用预测的概率值而不是预测的类标）
选做：自定义逻辑回归函数MyLogisticRegression()，进行模型训练与预测，与上述结果比较。

test_data = test.drop('click', axis=1)
test_data = convert_obj_to_int(test_data)
test_label = test['click']
# 调用sklearn的逻辑回归函数LogisticRegression()
clf = linear_model.LogisticRegression(max_iter=1000)  # 增加最大迭代次数防止不收敛

# 模型训练
clf.fit(train_data, train_label)  
print("Finish Training!")

# 模型预测
pred = clf.predict(test_data)
pred_proba = clf.predict_proba(test_data)[:, 1]

# 计算模型的acc, pre, recall, auc，并输出
# 请在下方作答
acc = accuracy_score(test_label, pred)
pre = precision_score(test_label, pred)
recall = recall_score(test_label, pred)
auc = roc_auc_score(test_label, pred_proba)
print(f"Accuracy: {acc:.4f}, Precision: {pre:.4f}, Recall: {recall:.4f}, AUC: {auc:.4f}")
# 绘制roc曲线
# 请在下方作答
fpr, tpr, _ = roc_curve(test_label, pred_proba)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f'AUC = {auc:.4f}')
plt.plot([0,1], [0,1], 'k--')
plt.title('ROC Curve (sklearn)')
plt.legend()
plt.show()
# 自定义实现逻辑回归函数MyLogisticRegression()
# 请在下方作答

Finish Training!
Accuracy: 0.8120, Precision: 0.0000, Recall: 0.0000, AUC: 0.4983


C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  if object_list_dtypes[index] == object:
f:\Anconda\Anconda\envs\general\Lib\site-packages\sklearn\metrics\_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

Custom Model - Accuracy: 0.8240, Precision: 0.6875, Recall: 0.1170, AUC: 0.6580