使用逻辑回归比较字符串

Jesús Utrera

4.71/5 (19投票s)

2015年1月30日

CPOL

3分钟阅读

27827

381

在本文中，我们将训练机器使用逻辑回归来比较字符串，该逻辑回归应用于使用 Levenshtein 算法（改编）和 Jaro-Winkler 的结果。

引言

链接到下载源码

传统的 Levenshtein 和 Jaro-Winkler 算法由于其局限性，通常无法给出良好的结果，例如，在比较街道时。

如果我们结合这两种算法的强大功能，而不是使用某种程度的算法，我们就可以拥有一个相当可靠的方法来比较这些链。

基于数据集，我们将使用逻辑回归训练机器，将对数据集中的每对字符串应用这些算法的结果作为输入。

经过训练后，我们就有了一种可靠的方法来自动进行未来的比较。

使用代码

设置数据源

首先，我们拥有的是数据集。在我的例子中，数据集是不同城市的街道比较，其中第一列和第二列是要比较的街道名称，第三列表示一个值，如果街道名称存在显着差异则为 0，否则为 1。

图像数据源

我们将 Excel 表格导出到 .csv 文件，使用“¦”字符作为字段分隔符。我选择这个字符是因为数据集中没有包含它的街道。该文件具有以下形式

原始数据源

应用 Levenshtein 距离和 Jaro-Winkler 距离算法

我们将使用一个名为“Elements”的类来存储数据源并存储应用距离算法的结果。

public class Elements
{
   public string OldName { get; set; }
   public string NewName { get; set; }
   public int IsSame { get; set; } // 1: the streets are the same - 0: the strings are different
   public double Levenshtein { get; set; }
   public double JaroWinkler { get; set; }
}

现在，代码将读取所有字符串对，并应用距离算法。结果将写入输出文件，其中前两列显示应用这两种算法的值，第三列显示如果字符串可以被认为是相同的则为 1，否则为 0（就像数据源的第三列一样）。

string ruta = path + "\\" + filename + ".csv";

List<Elements> lista = new List<Elements>();
StreamReader sr = new StreamReader(ruta);

//Read the datasource and store in la List of Elements
while (!sr.EndOfStream)
{
    string[] data = sr.ReadLine().Split('|');

    lista.Add(new Elements()
    {
        NewName = ManejadorCadena.QuitaAcentos(data[0]).Trim().ToUpper(),
        OldName = ManejadorCadena.QuitaAcentos(data[1]).Trim().ToUpper(),
        IsSame = Convert.ToInt32(data[2])
    });
}
sr.Close();

//Appliying the distance algorithms
lista.ForEach(item => DistanceAlgorithms.ApplyAlgorithms(item));
            
//Displaying the results
string salida = path + "\\train\\" + filename + ".out";
            
StreamWriter sw = new StreamWriter(salida, false);

lista.ForEach(item => sw.WriteLine(ManejadorCadena.BuildOutString(item, false).Replace(",", ".").Replace("|", ",")));
            
sw.Flush();
sw.Close();

让我们看看计算距离的算法的代码。

Jaro-Winkler 距离代码不是我的。我从 stack overflow 使用了它。

using System;

namespace es.jutrera.logistic
{
    public static class JaroWinklerDistance
    {
        /* The Winkler modification will not be applied unless the 
         * percent match was at or above the mWeightThreshold percent 
         * without the modification. 
         * Winkler's paper used a default value of 0.7
         */
        private static readonly double mWeightThreshold = 0.7;

        /* Size of the prefix to be concidered by the Winkler modification. 
         * Winkler's paper used a default value of 4
         */
        private static readonly int mNumChars = 4;

        /// <summary>
        /// Returns the Jaro-Winkler distance between the specified  
        /// strings. The distance is symmetric and will fall in the 
        /// range 0 (perfect match) to 1 (no match). 
        /// </summary>
        /// <param name="aString1">First String</param>
        /// <param name="aString2">Second String</param>
        /// <returns></returns>
        public static double distance(string aString1, string aString2)
        {
            return proximity(aString1, aString2);
        }

        /// <summary>
        /// Returns the Jaro-Winkler distance between the specified  
        /// strings. The distance is symmetric and will fall in the 
        /// range 0 (no match) to 1 (perfect match). 
        /// </summary>
        /// <param name="aString1">First String</param>
        /// <param name="aString2">Second String</param>
        /// <returns></returns>
        public static double proximity(string aString1, string aString2)
        {
            int lLen1 = aString1.Length;
            int lLen2 = aString2.Length;
            if (lLen1 == 0)
                return lLen2 == 0 ? 1.0 : 0.0;

            int lSearchRange = Math.Max(0, Math.Max(lLen1, lLen2) / 2 - 1);

            bool[] lMatched1 = new bool[lLen1];
            for (int i = 0; i < lMatched1.Length; i++)
            {
                lMatched1[i] = false;
            }
            bool[] lMatched2 = new bool[lLen2];
            for (int i = 0; i < lMatched2.Length; i++)
            {
                lMatched2[i] = false;
            }

            int lNumCommon = 0;
            for (int i = 0; i < lLen1; ++i)
            {
                int lStart = Math.Max(0, i - lSearchRange);
                int lEnd = Math.Min(i + lSearchRange + 1, lLen2);
                for (int j = lStart; j < lEnd; ++j)
                {
                    if (lMatched2[j]) continue;
                    if (aString1[i] != aString2[j])
                        continue;
                    lMatched1[i] = true;
                    lMatched2[j] = true;
                    ++lNumCommon;
                    break;
                }
            }
            if (lNumCommon == 0) return 0.0;

            int lNumHalfTransposed = 0;
            int k = 0;
            for (int i = 0; i < lLen1; ++i)
            {
                if (!lMatched1[i]) continue;
                while (!lMatched2[k]) ++k;
                if (aString1[i] != aString2[k])
                    ++lNumHalfTransposed;
                ++k;
            }
            
            int lNumTransposed = lNumHalfTransposed / 2;

            double lNumCommonD = lNumCommon;
            double lWeight = (lNumCommonD / lLen1
                             + lNumCommonD / lLen2
                             + (lNumCommon - lNumTransposed) / lNumCommonD) / 3.0;

            if (lWeight <= mWeightThreshold) return lWeight;
            int lMax = Math.Min(mNumChars, Math.Min(aString1.Length, aString2.Length));
            int lPos = 0;
            while (lPos < lMax && aString1[lPos] == aString2[lPos])
                ++lPos;
            if (lPos == 0) return lWeight;
            return lWeight + 0.1 * lPos * (1.0 - lWeight);

        }
    }
}

这是 Levenshtein 距离的代码。

using System;

namespace es.jutrera.logistic
{
    public static class LevenshteinDistance
    {
        /// <summary>
        /// Applies the Levenshtein distance algorithm
        /// </summary>
        /// <param name="a">First String</param>
        /// <param name="b">Second String</param>
        /// <returns></returns>
        public static int Levenshtein(string a, string b)
        {
            int n = a.Length;
            int m = b.Length;

            // minimal change matrix
            int[][] d;
            d = new int[n][];
            for (int x = 0; x < d.Length; x++)
                d[x] = new int[m];

            if (n == 0)
                return m;
            if (m == 0)
                return n;

            // setup the worst case (insert all)
            for (int i = 0; i < n; i++)
                d[i][0] = i;
            for (var j = 0; j < m; j++)
                d[0][j] = j;

            // each matrix element will be the transition at lower cost
            for (int i = 1, I = 0; i < n; i++, I++)
                for (int j = 1, J = 0; j < m; j++, J++)
                    if (b[J] == a[I])
                        d[i][j] = d[I][J];
                    else
                    {
                        int aux = Math.Min(d[I][j], d[i][J]);
                        d[i][j] = Math.Min(aux, d[I][J]) + 1;
                    }
            // the lowest operation number
            return d[n - 1][m - 1];
        }
    }
}

执行后，我们得到这个输出文件。我们将使用这些数据来训练我们的机器。

训练数据源

训练机器

让我们使用Octave 应用程序来执行逻辑回归。我们将使用我在 Coursera 的 MOOC 机器学习课程中学到的知识（感谢 Andrew NG！）。

train.m 归档将完成这项工作。

%% Initialization
clear ; close all; clc

%% Load Data
%  The first two columns contains the X and the thirteen column contains the label.

data = load('source.out');
X = data(:, [1:2]); y = data(:, 3);

%  Setup the data matrix 
[m, n] = size(X);
X = [ones(m, 1) X];

% Initialize fitting parameters
initial_theta = zeros(n + 1, 1);

%  Set options for fminunc
options = optimset('GradObj', 'on', 'MaxIter', 400);

%  Run fminunc to obtain the optimal theta
%  This function will return theta and the cost 
[theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);

% Print theta to screen
fprintf('Cost at theta found by fminunc: %f\n', cost);
fprintf('theta: \n');
fprintf(' %f \n', theta);

% Plot Boundary
plotDecisionBoundary(theta, X, y);

% Put some labels 
hold on;
% Labels and Legend
xlabel('Levenshtein')
ylabel('Jaro-Winkler')

% Specified in plot order
legend('Same', 'Not same')
pause;
hold off;

fprintf('\nProgram paused. Press enter to continue.\n');
pause;

prob = sigmoid([1, 2, 0.8] * theta);
fprintf(['prediction probability (Lev: %f, J-W: %f): %f\n\n'], 2, 0.8, prob);

% Compute accuracy on our training set
p = predict(theta, X);

fprintf('Train Accuracy: %f\n', mean(double(p == y)) * 100);

fprintf('\nProgram paused. Press enter to continue.\n');
pause;

%train examples
error_data = load('cross.out');
X_cross = error_data(:, [1:2]); 
y_cross = error_data(:, 3);
[error_train, error_val] = learningCurve(X, y, [ones(size(X_cross, 1), 1) X_cross], y_cross, theta);

figure;
hold on;
plot(1:m, error_train);
plot(1:m, error_val, 'color',  'g');
title('Regression Learning Curve');
xlabel('Number of training examples')
ylabel('Error')
%axis([0 13 0 100])
legend('Train', 'Cross Validation')
pause;
hold off;
fprintf('Program paused. Press enter to continue.\n');
pause;

执行此代码的结果显示在下一个图像中

octave 输出显示

theta 矩阵具有所有输入的梯度下降值。让我们在下一个图表中查看。 + 值表示两个字符串相同时，Ø 表示其他情况。

数据图

决策边界似乎是正确的。如果我们使用 Levenshtein 值为 2.0 和 Jaro-Winkler 值为 0.8 的两个字符串进行预测，则两个字符串至少相同的概率（没有显着差异）约为 0.85。如果概率大于或等于 0.5 则采用相同的决策边界，如果概率小于 0.5 则采用不同的决策边界，我们可以说这两个字符串没有显着差异。

让我们看看交叉验证图

我们可以看到误差值很好，并且交叉验证曲线的间隙是正确的。

最终结论

经过这项研究，我们有了一种算法，可以让我们识别两个字符串之间是否存在显着变化。正如我们所看到的，它提供了一种比简单地应用字符串距离算法稍微有效的工具。虽然 Jaro-Winkler 算法稍微有效一些，但一旦与 Levenshtein 结合并应用于数据集的训练，我们可以看到结果最适合。

应用的其他函数

我已经应用了其他假设函数（线性、二次和三次）来测试它是如何工作的。结果是正确的，并且可以正确匹配（有些比我使用的函数更匹配）。这是三个函数的决策边界图。你能猜出它们吗？