使用机器学习确定文本的编程语言

sjb_strat

5.00/5 (1投票)

2018年3月4日

CPOL

3分钟阅读

7307

使用机器学习确定文本的编程语言

引言

很多用于确定文本字符串编程语言的代码已经在我的上一篇文章中进行了充分的介绍和讨论：创建你的第一个机器学习模型来过滤垃圾邮件。在那篇文章中，我们构建了一些可以在您的机器学习管道中高度重用的函数。

在本文中，我们将主要关注那些针对此问题的独特之处。

导入必要的库

规范的 Python import 语句

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.utils import shuffle
from sklearn.metrics import precision_score, classification_report, accuracy_score

from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder

import re
import time

检索和解析数据

我花费了大部分时间来解决如何有效地解析数据，以从文本中提取语言名称，然后从文本中删除该信息，以防止其污染我们的训练和测试数据集。

这是一个文本字符串/片段的示例（跨越多行并包含回车符）

<pre lang="Swift">
@objc func handleTap(sender: UITapGestureRecognizer) {
    if let tappedSceneView = sender.view as? ARSCNView {
        let tapLocationInView = sender.location(in: tappedSceneView)
        let planeHitTest = tappedSceneView.hitTest(tapLocationInView,
            types: .existingPlaneUsingExtent)
        if !planeHitTest.isEmpty {
            addFurniture(hitTest: planeHitTest)
        }
    }
}</pre>

<pre lang="JavaScript">
var my_dataset = [
   {
       id: "1",
       text: "Chairman & CEO",
       title: "Henry Bennett"
   },
   {
       id: "2",
       text: "Manager",
       title: "Mildred Kim"
   },
   {
       id: "3",
       text: "Technical Director",
       title: "Jerry Wagner"
   },
   { id: "1-2", from: "1", to: "2", type: "line" },
   { id: "1-3", from: "1", to: "3", type: "line" }
];</pre>

棘手的部分是让正则表达式返回“<pre lang...><pre>”标签中的数据，然后创建另一个正则表达式只返回“pre”标签的“lang”部分。

它不是很漂亮，而且我确信可以对其进行优化，但它可以工作

def get_data():
    file_name = './LanguageSamples.txt'
    rawdata = open(file_name, 'r')
    lines = rawdata.readlines()
    return lines

def clean_data(input_lines):
    #find matches for all data within the pre tags
    all_found = re.findall(r'<pre[\s\S]*?<\/pre>', input_lines, re.MULTILINE)
    
    #clean the string of various tags
    clean_string = lambda x: x.replace('&lt;', '<').replace('&gt;', '>').replace
                   ('</pre>', '').replace('\n', '')
    all_found = [clean_string(item) for item in all_found]
    
    #get the language for all of the pre tags
    get_language = lambda x: re.findall(r'<pre lang="(.*?)">', x, re.MULTILINE)[0]
    lang_items = [get_language(item) for item in all_found]
    
    #remove all of the pre tags that contain the language
    remove_lang = lambda x: re.sub(r'<pre lang="(.*?)">', "", x)
    all_found = [remove_lang(item) for item in all_found]
    
    #return let text between the pre tags and their corresponding language
    return (all_found, lang_items)

创建 Pandas DataFrame

在这里，我们获取数据，创建一个 DataFrame 并用数据填充它。

all_samples = ''.join(get_data())
cleaned_data, languages = clean_data(all_samples)

df = pd.DataFrame()
df['lang_text'] = languages
df['data'] = cleaned_data

这是我们的 DataFrame 的样子

Initial DataFrame

创建分类列

我们需要做的下一件事是将我们的“lang_text”列转换为数字列，因为这是许多机器学习模型期望的“Y”或它试图确定的输出。为此，我们将使用 LabelEncoder 并使用它将我们的“lang_text”列转换为分类列。

lb_enc = LabelEncoder()
df['language'] = lb_enc.fit_transform(df['lang_text'])

现在我们的 DataFrame 看起来像这样

DataFame with new column

我们可以通过运行此代码来查看列是如何编码的

lb_enc.classes_

它显示如下内容（数组中的位置与新的“language”分类列中的整数值匹配）

array(['ASM', 'ASP.NET', 'Angular', 'C#', 'C++', 'CSS', 'Delphi', 'HTML',
       'Java', 'JavaScript', 'Javascript', 'ObjectiveC', 'PERL', 'PHP',
       'Pascal', 'PowerShell', 'Powershell', 'Python', 'Razor', 'React',
       'Ruby', 'SQL', 'Scala', 'Swift', 'TypeScript', 'VB.NET', 'XML'], dtype=object)

样板代码

如引言中所述，该项目的很多代码已在我的上一篇文章创建你的第一个机器学习模型来过滤垃圾邮件中讨论过，如果您想了解下面代码的更多细节，我建议您阅读该文章。

总之，以下是接下来的步骤

声明用于输出训练结果的函数
声明用于训练和测试模型的函数
声明用于创建要测试的模型的函数
打乱数据
分割训练和测试数据
将数据和模型传递到训练和测试函数中，并查看结果

def output_accuracy(actual_y, predicted_y, model_name, train_time, predict_time):
    print('Model Name: ' + model_name)
    print('Train time: ', round(train_time, 2))
    print('Predict time: ', round(predict_time, 2))
    print('Model Accuracy: {:.4f}'.format(accuracy_score(actual_y, predicted_y)))
    print('')
    print(classification_report(actual_y, predicted_y, digits=4))
    print("=======================================================")

def test_models(X_train_input_raw, y_train_input, X_test_input_raw, y_test_input, models_dict):

    return_trained_models = {}
    
    return_vectorizer = FeatureUnion([('tfidf_vect', TfidfVectorizer())])
    
    X_train = return_vectorizer.fit_transform(X_train_input_raw)
    X_test = return_vectorizer.transform(X_test_input_raw)
    
    for key in models_dict:
        model_name = key
        model = models_dict[key]
        t1 = time.time()
        model.fit(X_train, y_train_input)
        t2 = time.time()
        predicted_y = model.predict(X_test)
        t3 = time.time()
        
        output_accuracy(y_test_input, predicted_y, model_name, t2 - t1, t3 - t2)        
        return_trained_models[model_name] = model
        
    return (return_trained_models, return_vectorizer)

def create_models():
    models = {}
    models['LinearSVC'] = LinearSVC()
    models['LogisticRegression'] = LogisticRegression()
    models['RandomForestClassifier'] = RandomForestClassifier()
    models['DecisionTreeClassifier'] = DecisionTreeClassifier()
    models['MultinomialNB'] = MultinomialNB()
    return models

X_input, y_input = shuffle(df['data'], df['language'], random_state=7)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_input, y_input, test_size=0.7)

models = create_models()
trained_models, fitted_vectorizer = test_models(X_train_raw, y_train, X_test_raw, y_test, models)

这是结果

Model Name: LinearSVC
Train time:  0.99
Predict time:  0.0
Model Accuracy: 0.9262

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     1.0000    1.0000    1.0000         2
          2     1.0000    1.0000    1.0000         1
          3     0.8968    1.0000    0.9456       339
          4     0.9695    0.8527    0.9074       224
          5     0.9032    1.0000    0.9492        28
          6     0.7000    1.0000    0.8235         7
          7     0.9032    0.7568    0.8235        74
          8     0.7778    0.5833    0.6667        36
          9     0.9613    0.9255    0.9430       161
         10     1.0000    0.5000    0.6667         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     1.0000    1.0000    1.0000         2
         14     1.0000    0.4545    0.6250        11
         15     1.0000    1.0000    1.0000         6
         16     1.0000    0.4000    0.5714         5
         17     0.9589    0.9589    0.9589        73
         18     1.0000    1.0000    1.0000         8
         19     0.7600    0.9268    0.8352        41
         20     0.1818    1.0000    0.3077         2
         21     1.0000    1.0000    1.0000       137
         22     1.0000    0.8750    0.9333        24
         23     1.0000    1.0000    1.0000         7
         24     1.0000    1.0000    1.0000        25
         25     0.9571    0.9571    0.9571        70
         26     0.9211    0.9722    0.9459       108

avg / total     0.9339    0.9262    0.9255      1422

=========================================================================
Model Name: DecisionTreeClassifier
Train time:  0.13
Predict time:  0.0
Model Accuracy: 0.9388

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     1.0000    1.0000    1.0000         2
          2     1.0000    1.0000    1.0000         1
          3     0.9123    0.9204    0.9163       339
          4     0.8408    0.9196    0.8785       224
          5     1.0000    0.8929    0.9434        28
          6     1.0000    1.0000    1.0000         7
          7     1.0000    0.9595    0.9793        74
          8     0.9091    0.8333    0.8696        36
          9     0.9817    1.0000    0.9908       161
         10     1.0000    0.5000    0.6667         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     1.0000    1.0000    1.0000         2
         14     1.0000    0.4545    0.6250        11
         15     1.0000    0.5000    0.6667         6
         16     1.0000    0.4000    0.5714         5
         17     1.0000    1.0000    1.0000        73
         18     1.0000    1.0000    1.0000         8
         19     0.9268    0.9268    0.9268        41
         20     1.0000    1.0000    1.0000         2
         21     1.0000    1.0000    1.0000       137
         22     1.0000    0.7500    0.8571        24
         23     1.0000    1.0000    1.0000         7
         24     0.6786    0.7600    0.7170        25
         25     1.0000    1.0000    1.0000        70
         26     1.0000    1.0000    1.0000       108

avg / total     0.9419    0.9388    0.9376      1422

=========================================================================
Model Name: LogisticRegression
Train time:  0.71
Predict time:  0.01
Model Accuracy: 0.9304

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     1.0000    1.0000    1.0000         2
          2     1.0000    1.0000    1.0000         1
          3     0.9040    1.0000    0.9496       339
          4     0.9569    0.8929    0.9238       224
          5     0.9032    1.0000    0.9492        28
          6     0.7000    1.0000    0.8235         7
          7     0.8929    0.6757    0.7692        74
          8     0.8750    0.5833    0.7000        36
          9     0.9281    0.9627    0.9451       161
         10     1.0000    0.5000    0.6667         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     1.0000    1.0000    1.0000         2
         14     1.0000    0.4545    0.6250        11
         15     1.0000    1.0000    1.0000         6
         16     1.0000    0.4000    0.5714         5
         17     0.9589    0.9589    0.9589        73
         18     1.0000    1.0000    1.0000         8
         19     0.7600    0.9268    0.8352        41
         20     1.0000    1.0000    1.0000         2
         21     1.0000    0.9781    0.9889       137
         22     1.0000    0.8750    0.9333        24
         23     1.0000    1.0000    1.0000         7
         24     1.0000    1.0000    1.0000        25
         25     0.9571    0.9571    0.9571        70
         26     0.9211    0.9722    0.9459       108

avg / total     0.9329    0.9304    0.9272      1422

=========================================================================
Model Name: RandomForestClassifier
Train time:  0.04
Predict time:  0.01
Model Accuracy: 0.9374

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     1.0000    1.0000    1.0000         2
          2     1.0000    1.0000    1.0000         1
          3     0.8760    1.0000    0.9339       339
          4     0.9452    0.9241    0.9345       224
          5     0.9032    1.0000    0.9492        28
          6     0.7000    1.0000    0.8235         7
          7     1.0000    0.8378    0.9118        74
          8     1.0000    0.5278    0.6909        36
          9     0.9527    1.0000    0.9758       161
         10     1.0000    0.1667    0.2857         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     1.0000    1.0000    1.0000         2
         14     1.0000    0.4545    0.6250        11
         15     1.0000    0.5000    0.6667         6
         16     1.0000    0.4000    0.5714         5
         17     1.0000    1.0000    1.0000        73
         18     1.0000    0.6250    0.7692         8
         19     0.9268    0.9268    0.9268        41
         20     0.0000    0.0000    0.0000         2
         21     1.0000    1.0000    1.0000       137
         22     1.0000    1.0000    1.0000        24
         23     1.0000    0.5714    0.7273         7
         24     1.0000    1.0000    1.0000        25
         25     1.0000    0.9571    0.9781        70
         26     0.8889    0.8889    0.8889       108

avg / total     0.9411    0.9374    0.9324      1422

=========================================================================
Model Name: MultinomialNB
Train time:  0.01
Predict time:  0.0
Model Accuracy: 0.8776

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     0.0000    0.0000    0.0000         2
          2     0.0000    0.0000    0.0000         1
          3     0.8380    0.9764    0.9019       339
          4     1.0000    0.8750    0.9333       224
          5     1.0000    1.0000    1.0000        28
          6     1.0000    1.0000    1.0000         7
          7     0.6628    0.7703    0.7125        74
          8     1.0000    0.5833    0.7368        36
          9     0.8952    0.6894    0.7789       161
         10     1.0000    0.3333    0.5000         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     0.0000    0.0000    0.0000         2
         14     1.0000    0.7273    0.8421        11
         15     1.0000    1.0000    1.0000         6
         16     1.0000    0.4000    0.5714         5
         17     1.0000    0.9178    0.9571        73
         18     0.8000    1.0000    0.8889         8
         19     0.4607    1.0000    0.6308        41
         20     0.0000    0.0000    0.0000         2
         21     1.0000    1.0000    1.0000       137
         22     1.0000    1.0000    1.0000        24
         23     1.0000    1.0000    1.0000         7
         24     0.8462    0.8800    0.8627        25
         25     0.8642    1.0000    0.9272        70
         26     0.9630    0.7222    0.8254       108

avg / total     0.8982    0.8776    0.8770      1422

=========================================================================

同样，有关此代码以及“accuracy”、“precision”、“recall”和“f1-support”的含义的详细信息，请参阅我的上一篇文章 - 创建你的第一个机器学习模型来过滤垃圾邮件。

结论

机器学习、深度学习和人工智能是未来，我们作为软件工程师需要理解并拥抱这些技术提供的力量，因为我们可以利用它们来更有效地解决公司和客户提出的问题，并需要我们的帮助来解决这些问题。

我有一个博客，致力于帮助软件工程师理解和发展他们在机器学习、深度学习和人工智能领域的技能。如果您觉得您从这篇文章中学到了一些东西，请随时访问我的博客 CognitiveCoder.com。

感谢您一直阅读到最后。

历史

2018年3月3日 - 初始版本
2018年3月3日 - 修复了损坏的图像链接