排序文件中级开发 .NET C#

排序巨大的文本文件

ideafixxxer

4.83/5 (18投票s)

2011年11月20日

CPOL

4分钟阅读

86363

1801

大小超过可用内存的文本文件中行排序算法。

下载源代码 - 3.99 KB

引言

一位面试官问我，如果我只有 1GB 的内存，如何对一个 5GB 的文本文件进行排序。这个任务对我来说很有趣，第二天早上我忍不住就实现了它。

背景

我希望我对它有所了解，但我真的不怎么了解外部排序算法。我主要是凭直觉编写了这段代码，所以如果你能在评论中分享你对更高级技术的见解，我将不胜感激。

算法

为了排序文件，我首先将其分成更小的文件。在第一次迭代中，我遍历输入文件，找到所有以相同字符开头的行。如果某些结果文件仍然大于可用内存，我会按前两个字符分割它们，依此类推。然后，每个更小的文件都在内存中排序并保存在磁盘上。最后阶段是将排序后的已排序文件列表合并成一个文件。

示例。考虑以下文本文件

apricot
apple
orange
milk
avocado
meat

为简单起见，我们假设我们使用的是一台非常古老的计算机，无法处理长度超过 10 字节的文件排序。我们的输入文件更大，所以我们需要分割它。我们通过按一个字符分割来开始这个过程。所以，在第一步之后，我们将有三个文件：

a 

apricot
apple
avocado 
m 

milk
meat
o

orange

文件 m 和 o 现在可以在内存中排序。但是文件 a 仍然太大。所以我们需要进一步分割。

ap

apricot
apple 
av

avocado

文件 av 的长度小于十个字节，但是文件 ap 仍然很大。所以我们再次分割。

apr

apricot
app

apple

现在我们有了五个小文件而不是一个大文件，我们按照它们的开头顺序排列它们，并将它们合并在一起，将结果保存到输出文件中。

app	apple
apr	apricot
av	avocado
m	meat milk
o	橙色

看起来不错，但这个算法有一个缺陷：假设输入文件包含五千兆字节的相同行重复多次。很容易看出，在这种情况下，算法将陷入无休止的循环，试图一遍又一遍地分割这个文件。类似的问题将在下面说明。考虑我们有以下 strings，而我们的内存不足以对它们进行排序。

a
ab
abc
abcd
abcde
abcdef

由于它们都以 'a' 开头，所以在第一次迭代中它们都会被复制到同一个块中。在第二次迭代中，我们将按前两个字符分割行，但是行 'a' 只包含一个字符！我们在每次迭代中都会遇到同样的情况。

我通过将小于当前子字符串长度的 strings 分离到一个特殊的未排序文件中来处理这两个问题。由于我们不需要对其进行排序，所以该文件可以是任意大小。（如果只支持区分大小写的排序，则无需将短行保存到文件中，只需计算它们的数量即可。）

顺便说一句，该算法是稳定的，即它保持具有相等值的记录的相对顺序。

Using the Code

类 HugeFileSort 包含以下属性，它们指定了排序的执行方式：

MaxFileSize - 可以内存排序的最大文件大小（默认为 100MB）
StringComparer - 用于排序的比较器（默认为区分大小写的 CurrentCulture）
Encoding - 输入文件编码（默认为 UTF8）

主方法很简单，称为 Sort。它接受两个 strings：输入和输出文件名。如果输入文件的大小小于 MaxFileSize，则其内容将被简单地加载到内存中进行排序。否则，将执行上述过程。

/// <summary>
/// Performs sorting
/// </summary>
/// <param name="inputFileName">The name of the file that needs to be sorted
/// </param>
/// <param name="outputFileName">The name of the file in which 
/// sorted lines from the <paramref name="inputFileName"/> should be saved
/// </param>
public void Sort(string inputFileName, string outputFileName)
{
    chunks = new SortedDictionary<string, ChunkInfo>(Comparer);
    var info = new FileInfo(inputFileName);
    //If file size is less than maxFileSize simply sort its content in memory.
    if (info.Length < maxFileSize)
        SortFile(inputFileName, outputFileName);
    //Otherwise create temp directory and split file into chunks, 
    //saving each chunk as a new file into this directory
    else
    {
        var dir = new DirectoryInfo("tmp");
        if (dir.Exists)
            dir.Delete(true);
        dir.Create();
        SplitFile(inputFileName, 1);
        Merge(outputFileName);
    }
}

在执行过程中，会在当前文件夹中创建一个名为 tmp 的临时目录。为了更好的显示，最终的临时文件集不会被删除，而是保留在文件夹中。在生产代码中，请取消注释 Merge 方法中的两行。

/// <summary>
/// Merges all chunks into one file
/// </summary>
private void Merge(string outputFileName)
{
    using (var output = File.Create(outputFileName))
    {
        foreach (var name in chunks)
        {
            name.Value.Close();
            if (name.Value.NoSortFileName != null)
            {
                CopyFile(name.Value.NoSortFileName, output);
                //File.Delete(name.Value.NoSortFileName);
            }
            if (name.Value.FileName != null)
            {
                CopyFile(name.Value.FileName, output);
                //File.Delete(name.Value.FileName);
            }
        }
    }
}

算法的核心是分割方法。

/// <summary>
/// Splits big file into some chunks by matching starting characters in each line
/// </summary>
/// <param name="inputFileName">Big file name</param>
/// <param name="chars">Number of starting characters to split by</param>
private void SplitFile(string inputFileName, int chars)
{
    var files = new Dictionary<string, FileChunk>(Comparer);
    using (var sr = new StreamReader(inputFileName, Encoding))
    {
        while (sr.Peek() >= 0)
        {
            string entry = sr.ReadLine();
            //The length of the line is less than the current 
            //number of characters we split by
            //In this cases we add the line to the non-sorted file
            if (entry.Length < chars)
            {
                ChunkInfo nameInfo;
                if (!chunks.TryGetValue(entry, out nameInfo))
                    chunks.Add(entry, nameInfo = new ChunkInfo());
                nameInfo.AddSmallString(entry, Encoding);
            }
            //Otherwise we add the line to the file corresponding 
            //to the first char characters of the line
            else
            {
                string start = entry.Substring(0, chars);
                FileChunk sfi;
                if (!files.TryGetValue(start, out sfi))
                {
                    sfi = new FileChunk(Encoding);
                    files.Add(start, sfi);
                }
                sfi.Append(entry, Encoding);
            }
        }
    }
    //For each of the chunk we check if size of the chunk is 
    //still greater than the maxFileSize
    foreach (var file in files)
    {
        file.Value.Close();
        //If it is - split to smaller chunks
        if (file.Value.Size > maxFileSize)
        {
            SplitFile(file.Value.FileName, chars + 1);
            File.Delete(file.Value.FileName);
        }
        //Otherwise save it to the dictionary
        else
        {
            SortFile(file.Value.FileName, file.Value.FileName);
            ChunkInfo nameInfo;
            if (!chunks.TryGetValue(file.Key, out nameInfo))
                chunks.Add(file.Key, nameInfo = new ChunkInfo());
            nameInfo.FileName = file.Value.FileName;
        }
    }
}

FileChunk 和 ChunkInfo 是辅助的嵌套类。前者是对应于每次迭代中新文件的助手，用于将行写入文件。后者包含将被合并到结果文件中的数据信息。在算法的递归工作中，程序会填充一个排序的字典，该字典将文本行开头映射到 ChunkInfo 的实例。

SortedDictionary<string, ChunkInfo> chunks;

ChunkInfo 包含以下信息：

FileName - 包含以给定子字符串开头排序的行的文件名
NoSortFileName - 包含不排序的、与给定子字符串相等的行（可能因大小写而异）的文件名

这两个属性都可以是 null。

该类还包含方法 AddSmallString() ，该方法将给定的 string 写入文件 NoSortFileName。

测试应用程序需要三个命令行参数：输入文件名、输出文件名和最大文件大小（以字节为单位）。它执行不区分大小写的 UTF8 文件排序。

限制

如果你想破坏它，传入一个大的、很少或没有行尾的文件。由于算法使用标准的 TextReader 逐行读取文件文本，因此它不能处理此类输入数据。

历史

2011 年 11 月 20 日：初次发布