创建您的第一个机器学习模型来过滤垃圾邮件

sjb_strat

5.00/5 (2投票s)

2018年3月3日

CPOL

9分钟阅读

11916

使用机器学习创建垃圾邮件过滤器。

引言

机器学习使我们能够基于数据使用数学和统计概率来确定代码的输出。这使我们能够创建随着时间而“演变”的代码，因为它基于数据的变化，而不是具有特定的硬编码值或存储在某处的值。

例如，客户在信用卡上的使用情况会随着他们的购物习惯而变化和演变，信用卡公司需要能够持续识别欺诈交易。如果代码或数据库中设置了“阈值”，那么该值将需要定期更新，并且对于大量客户来说，确定该值将是成本高昂/困难的。定期训练机器学习模型以根据实际数据识别欺诈活动 far more 维护.

在本文中，我们将使用“监督学习”来确定消息是“垃圾邮件”还是“正常邮件”（垃圾邮件或非垃圾邮件）。监督学习意味着我们有一组数据，其中包含已被识别为“垃圾邮件”或“正常邮件”的消息，我们将使用这些数据来训练机器学习模型，以便能够识别新消息为 垃圾邮件 或 正常邮件。此判断基于新消息与我们训练模型的消息的统计相似性。

背景

如果您对编程有一定的熟悉程度并对机器学习感兴趣，您应该能够跟上本教程。CodeProject 提供的数据看起来像这样

# Spam training data
Spam,<p>But could then once pomp to nor that glee glorious of deigned. The vexed times 
childe none native. To he vast now in to sore nor flow and most fabled. 
The few tis to loved vexed and all yet yea childe. Fulness consecrate of it before his 
a a a that.</p><p>Mirthful and and pangs wrong. Objects isle with partings ancient 
made was are. Childe and gild of all had to and ofttimes made soon from to long youth 
way condole sore.</p>
Spam,<p>His honeyed and land vile are so and native from ah to ah it like flash in not. 
That gild by in basked they lemans passed way who talethis forgot deigned nor friends 
his before strange. Found long little the. Talethis have soon of hellas had

一个指示“垃圾邮件”或“正常邮件”的初始值，后跟一个 <p> 标签，然后是消息内容。此外，该文件分为训练和测试部分（稍后将详细介绍）。

导入库

在这里，与许多语言一样，我们导入代码所需的各种库。我们将在下面详细介绍我们正在使用的内容

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion

from sklearn.linear_model.logistic import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.utils import shuffle
from sklearn.metrics import precision_score, classification_report, accuracy_score

import time

加载和解析数据

是的，虽然成为 21 世纪最性感的工作——数据科学家，但它需要大量时间来执行那些不太性感的解析/清理/理解数据过程。对于这个项目，我大约 85% 的时间都花在了这里。

def get_data():
    file_name = './SpamDetectionData.txt'
    rawdata = open(file_name, 'r')
    lines = rawdata.readlines()
    lines = lines[1:] #get rid of "header"
    spam_train = lines[0:1000]
    ham_train = lines[1002:2002]
    test_mix = lines[2004:]
    return (spam_train, ham_train, test_mix)

在 get_data() 函数中，我们从 CodeProject 提供的文件中获取训练和测试数据。我们从文件中读取原始数据并将其存储在数组中。在像下面这样的行中

  spam_train = lines[0:1000]

  ham_train = lines[1002:2002]

  test_mix = lines[2004:]

我们只是将数组的一部分拆分到具有有意义名称的单独数组中。有关我们如何以及为何使用测试和训练数据的更多详细信息，请参阅下文。

创建 Pandas DataFrame

对于您在机器学习中将要进行的绝大多数数据/特征工程，您将使用 Pandas，因为它提供了非常强大（尽管有时令人困惑）的工具集来处理数据。在这里，我们正在创建一个 DataFrame，这可以很容易地理解为内存中的“表”，其中包含行和列，其中一列包含 spam/ham 消息的内容，另一列包含一个二元标志（或数据科学行话中的类别），指示消息是“spam”（1）还是“ham”（0）。

def create_dataframe(input_array):    
    spam_indcator = 'Spam,<p>'
    message_class = np.array([1 if spam_indcator in item else 0 for item in input_array])
    data = pd.DataFrame()
    data['class'] = message_class
    data['message'] = input_array
    return data

在下一步清理数据之前，我们的数据框看起来是这样的——包括我们刚刚添加的“class”列。

数据清洗前的数据框

DataFame before the data is cleaned

删除单词和打乱数据

在这里，根据我在数据中看到的情况，我们想要删除任何没有意义的无关文本，例如 <p>，或者，在我们提供的用于指示消息类型的明显数据的情况下，例如“ Ham,<p>”，这些将不会出现在我们试图分类的真实消息中。当我们找到这些时，我们将简单地将它们替换为空字符串。

words_to_remove = ['Ham,<p>', 'Spam,<p>', '<p>', '</p>', '\n']

def remove_words(input_line, key_words=words_to_remove):
    temp = input_line
    for word in key_words:
        temp = temp.replace(word, '')
    return temp

在这里，我们将上述过滤应用于我们的数据框，然后打乱数据。虽然打乱数据对于 CodeProject 提供的数据不是必需的，因为它已经分为训练数据和测试数据，如果您要对其他数据集执行此过程，您应该始终在将数据分成训练集和测试集之前对其进行打乱，以确保每个集合中大致相等数量的示例（在这种情况下，垃圾邮件和正常邮件）。如果这些集合不平衡，它们很容易导致训练/测试过程中的偏差。

    
 def remove_words_and_shuffle(input_dataframe, input_random_state=7):
    input_dataframe['message'] = input_dataframe['message'].apply(remove_words)
    messages, classes = shuffle(input_dataframe['message'], input_dataframe['class'], 
              random_state=input_random_state)
    df_return = pd.DataFrame()
    df_return['class'] = classes
    df_return['message'] = messages
    return df_return

清理后的数据框如下

DataFrame after it is cleaned up

训练和测试我们的模型

这就是全部——使用训练数据来训练我们的机器学习模型，然后使用测试数据来确定模型的准确性以及它的性能如何。

 def test_models(X_train_input_raw, y_train_input, X_test_input_raw, y_test_input, models_dict):

    return_trained_models = {}
    
    return_vectorizer = FeatureUnion([('tfidf_vect', TfidfVectorizer())])
    
    X_train = return_vectorizer.fit_transform(X_train_input_raw)
    X_test = return_vectorizer.transform(X_test_input_raw)
    
    for key in models_dict:
        model_name = key
        model = models_dict[key]
        t1 = time.time()
        model.fit(X_train, y_train_input)
        t2 = time.time()
        predicted_y = model.predict(X_test)
        t3 = time.time()
        
        output_accuracy(y_test_input, predicted_y, model_name, t2 - t1, t3 - t2)        
        return_trained_models[model_name] = model
        
    return (return_trained_models, return_vectorizer)

这段代码有很多内容，我们将逐行进行。首先，让我们看看参数

X_train_input_data - 这是我们将用于训练模型的“原始”垃圾邮件/正常邮件消息
y_train_input - 这是 0 或 1，指示 X_train_input_data 参数是正常邮件还是垃圾邮件
X_test_input_raw - 我们将用于测试训练模型准确性的“原始”垃圾邮件/正常邮件消息
y_test_input - 这是 0 或 1，指示 X_test_input_raw 参数是正常邮件还是垃圾邮件

return_trained_models = {} 是一个字典，它将保存我们训练过的模型，以便以后使用

return_vectorizer = FeatureUnion([('tfidf_vect', TfidfVectorizer())]) 设置了一个 TfidfVectorizer 以应用于传入的消息。本质上，我们正在将一串单词（消息）转换为一个向量（一个数组），其中包含这些单词的出现次数。

此外，TF-IDF（词频-逆文档频率）会为术语在源文档中出现的频率分配权重。

词频-逆文档频率值随着单词在文档中出现的次数成比例增加，并被单词在语料库中的频率所抵消，这有助于调整一些单词在一般情况下出现频率较高的事实。如今，tf-idf 是最流行的词项加权方案之一；在数字图书馆领域的基于文本的推荐系统中，83% 使用 tf-idf。（来源）

这意味着那些总体出现频率较低，但在特定类型文档中出现频率较高的术语将具有更高的权重。例如，“free”、“viagra”等词，它们在消息中（所有垃圾邮件和正常邮件消息加在一起）总体上出现频率不高，但在垃圾邮件中出现频率却很高，因此这些词的权重会更高，以表明该文档是垃圾邮件。

可以设置和调整大量的参数来提高模型的准确性——您可以在此处找到有关这些参数的详细信息。

接下来，现在我们已经创建了向量化器，我们将它在训练消息上“训练”，并用它将我们的测试消息集转换为向量

X_train = return_vectorizer.fit_transform(X_train_input_raw)    

X_test = return_vectorizer.transform(X_test_input_raw)

最后一步是循环遍历传入的模型字典，训练每个模型，使用模型预测测试数据，并输出每个模型的准确性。

输出模型训练结果

当我们训练模型时，我们希望看到模型的名称、训练模型所需的时间以及模型的准确性。此函数有助于完成此操作

 def output_accuracy(actual_y, predicted_y, model_name, train_time, predict_time):
    print('Model Name: ' + model_name)
    print('Train time: ', round(train_time, 2))
    print('Predict time: ', round(predict_time, 2))
    print('Model Accuracy: {:.4f}'.format(accuracy_score(actual_y, predicted_y)))
    print('Model Precision: {:.4f}'.format(precision_score(actual_y, predicted_y)))
    print('')
    print(classification_report(actual_y, predicted_y, digits=4))
    print("=========================================================================")

创建要测试的模型字典

在这里，我们创建了我们想要训练和测试准确性的模型字典。您可以在此处添加更多模型进行测试，删除性能不佳的模型，或更改模型的参数以确定哪个模型最适合您的需求。

def create_models():
    models = {}
    models['LinearSVC'] = LinearSVC()
    models['LogisticRegression'] = LogisticRegression()
    models['RandomForestClassifier'] = RandomForestClassifier()
    models['DecisionTreeClassifier'] = DecisionTreeClassifier()
    return models

整合

我们将所有步骤整合在一起

获取数据并创建数据框。
清理和打乱数据。
将数据分离为训练集和测试集的输入 (X) 和输出 (y)。
创建模型。
将模型和数据传递给 test_models() 函数以查看其性能。

spam_train, ham_train, test_mix = get_data()

words_to_remove = ['Ham,<p>', 'Spam,<p>', '<p>', '</p>', '\n']

df_train_cleaned = remove_words_and_shuffle(df_train)
df_test_cleaned = remove_words_and_shuffle(df_test)

X_train_raw = df_train_cleaned['message']
y_train = df_train_cleaned['class']

X_test_raw = df_test_cleaned['message']
y_test = df_test_cleaned['class']

X_test_raw = df_test_cleaned['message'] 
y_test = df_test_cleaned['class']

models = create_models()

trained_models, fitted_vectorizer = 
       test_models(X_train_raw, y_train, X_test_raw, y_test, models)

输出

运行此程序，输出如下

Model Name: LinearSVC
Train time:  0.01
Predict time:  0.0
Model Accuracy: 1.0000
Model Precision: 1.0000

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000        57
          1     1.0000    1.0000    1.0000        43

avg / total     1.0000    1.0000    1.0000       100

======================================================
Model Name: LogisticRegression
Train time:  0.01
Predict time:  0.0
Model Accuracy: 0.4300
Model Precision: 0.4300

             precision    recall  f1-score   support

          0     0.0000    0.0000    0.0000        57
          1     0.4300    1.0000    0.6014        43

avg / total     0.1849    0.4300    0.2586       100

======================================================
Model Name: DecisionTreeClassifier
Train time:  0.02
Predict time:  0.0
Model Accuracy: 0.9800
Model Precision: 0.9556

             precision    recall  f1-score   support

          0     1.0000    0.9649    0.9821        57
          1     0.9556    1.0000    0.9773        43

avg / total     0.9809    0.9800    0.9800       100

======================================================
Model Name: RandomForestClassifier
Train time:  0.02
Predict time:  0.0
Model Accuracy: 0.9800
Model Precision: 0.9556

             precision    recall  f1-score   support

          0     1.0000    0.9649    0.9821        57
          1     0.9556    1.0000    0.9773        43

avg / total     0.9809    0.9800    0.9800       100

======================================================

我们可以看到训练模型所需的时间、预测测试数据所需的时间，以及每个模型的准确性、精确率和召回率。其中一些术语需要进一步解释

准确性 - 正确预测的观测数与总观测数的比率（对我们而言，垃圾邮件/正常邮件消息的正确检测百分比是多少）
精确率 - 正确预测为正例的观测数与所有预测为正例的观测数的比率（在我们识别为垃圾邮件的消息中，有多少被正确识别为垃圾邮件）
召回率 - 正确预测为正例的观测数与所有实际为正例的观测数的比率（在我们所有实际为垃圾邮件的消息中，有多少被我们正确识别）
F1 分数 - 精确率和召回率的加权平均值

是的，这些概念有些棘手，难以理解，并且更难解释。这些解释是从这里借来的：http://blog.exsilio.com/，并结合了我将它们与我们的项目相关联的说明。请参考该页面，它提供了对这些主题的更深入的讨论。

尝试用您自己的消息测试模型

最后，让我们尝试用我们自己的消息来查看它们是否被正确识别为垃圾邮件或正常邮件。

#from the sample ham and spam
ham = 'door beguiling cushions did. Evermore from raven from is beak shall name'
spam = 'The vexed times childe none native'
test_messages = [spam, ham]
transformed_test_messages = fitted_vectorizer.transform(test_messages)
trained_models['DecisionTreeClassifier'].predict(transformed_test_messages)

输出是：

array([1, 0])

这正确地识别了垃圾邮件和正常邮件。

结论

机器学习、深度学习和人工智能是未来，我们作为软件工程师需要理解并拥抱这些技术提供的强大功能，因为我们可以利用它们来更有效地解决公司和客户提出的问题，并获得我们的帮助来解决。

我有一个博客，专门帮助软件工程师理解和发展他们在机器学习、深度学习和人工智能领域的技能。如果您觉得从本文中学到了什么，欢迎随时访问我的博客 CognitiveCoder.com。

感谢您一直阅读到最后。

历史

2018 年 3 月 3 日 - 初始发布
2018 年 3 月 3 日 - 修复了损坏的图片链接