机器学习——上机3——逻辑回归(广告点击率预测)
广告点击率预测
广告点击率(CTR)预测是广告行业的典型应用,是评估广告效果的一个非常重要的指标。通过历史数据训练预测模型,对于每天的增量数据进行预测,找出广告的CTR符合标准的样本进行投放。 ## 数据集介绍 数据集来自于kaggle,数据包含了10天的Avazu的广告点击数据,训练集10000个,测试集1000个。每一条广告包含:广告id、时间、广告位置等属性。
任务1:导入库和数据集与数据预处理
- 读入训练数据和测试数据,划分data和label
- 将string类型的特征转化为int型:1)进行 one-hot 编码处理,会得到高维稀疏的特征,增大内存开销;2)使用python内置的hash函数将那些类型为object的特征变量映射为一定范围内的整数(原来的string被映射成了integer),可以大大降低内存的消耗。
1 | import gzip |
(1000, 24)
(10000, 23)
(10000, 10531)
(10000, 23)
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
任务2:特征分析
以广告在网页中的位置(banner_pos)为例,查看banner_pos和最终类标(click)之间的关系。 - 查看banner_pos在数据集中的取值分布; - 查看不同banner_pos对点击率click的贡献。
1 | # 查看banner_pos在数据集中的取值分布 |
banner_pos
0 0.8041
1 0.1951
2 0.0007
4 0.0001
Name: count, dtype: float64
banner 位置: 0, 点击率: 0.16975500559631887
banner 位置: 1, 点击率: 0.19067145053818554
banner 位置: 2, 点击率: 0.14285714285714285
banner 位置: 4, 点击率: 0.0
任务3:模型训练与评估
- 调用sklearn的逻辑回归函数LogisticRegression(),进行模型训练
- 对测试集test_data进行预测,计算预测结果的各项指标acc, pre, recall, auc
- 绘制ROC曲线(使用预测的概率值而不是预测的类标)
- 选做:自定义逻辑回归函数MyLogisticRegression(),进行模型训练与预测,与上述结果比较。
1 | test_data = test.drop('click', axis=1) |
Finish Training!
Accuracy: 0.8120, Precision: 0.0000, Recall: 0.0000, AUC: 0.4983
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
C:\Users\29020\AppData\Local\Temp\ipykernel_71456\1472409378.py:66: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
if object_list_dtypes[index] == object:
f:\Anconda\Anconda\envs\general\Lib\site-packages\sklearn\metrics\_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Custom Model - Accuracy: 0.8240, Precision: 0.6875, Recall: 0.1170, AUC:
0.6580