四种目录遍历算法的计时

gggustafson

4.42/5 (6投票s)

2012年11月16日

CPOL

4分钟阅读

23486

231

本文介绍了四种目录遍历算法的时间性能测试结果。

下载源码 - 18.9 KB

引言

我目前正在开发一个工具（Transfer），该工具可以将指定源目录（及其子目录）下的所有文件复制到指定的目标目录，并保持源目录结构。该工具完成后将在 CodeProject 上发布。

该工具与 XCOPY 的主要区别在于

是否覆盖目标目录中已存在的文件是基于
- 源文件和目标文件的长度，如果长度相同，则基于
- 文件内容。
可以排除子目录，可以批量排除（如 /bin），也可以按特定选择排除。
文件复制可以限制为特定的扩展名。

在研究如何最有效地遍历源目录的过程中，我测试了四种算法。这些算法及其测试方法是：

Algorithm       Test Method

GetFiles        retrieve_GetFiles_files
FileSystemInfo  retrieve_FileSystemInfo_files
Stacks          retrieve_Stack_files
Recursion       retrieve_Recursive_files

项目

尽管可能对我的算法选择提出一些合理的异议，但我坚持使用了广为人知的实现。为了进行计时，我尽量将每种方法限制在最小的必要代码。

在深入之前，读者应该知道，这个项目仅仅是为了确定在 Transfer 工具项目中使用哪种算法，以获得最佳的执行时间。该工具需要收集目录和子目录的名称，以及指定的最顶层目录下的所有文件的名称。

计时模板

四种算法中的每一种都执行了用户指定的次数。我用来计时每种算法的模板是：

using System.Diagnostics; // need for Stopwatch

TimeSpan        elapsed;
Stopwatch       stopwatch = new Stopwatch ( );
int             time_ms;

stopwatch.Start();
for ( int i = 0; ( i < iterations ); i++ )
    {
    if ( [ target file ].Count > 0 )
        {
        [ target file ].Clear ( );
        }
    [ execute an algorithm from the topmost_directory ]
    }
stopwatch.Stop();
elapsed = stopwatch.Elapsed;
time_ms = elapsed.Hours * 3600000 +
          elapsed.Minutes * 60000 +
          elapsed.Seconds * 1000 +
          elapsed.Milliseconds;

由于两种算法包含递归组件，因此所有算法都需要

使用一个全局声明的 `List` 结构，
并且预处理它们的输出文件。

所以，即使某种特定算法本身不需要，所有目标文件在每次迭代开始时都会被清除。

GetFiles 测试方法

每种算法的输出都需要保存在 `List` 结构中。这个要求增加了一些算法的长度，因为它们的返回值是 `string[]` 格式。例如，`GetFiles` 算法

// *********************************** retrieve_GetFiles_files

void retrieve_GetFiles_files ( string root )
    {
    string [ ]      entries;
    
    entries = Directory.GetFiles (
                         root,
                         "*.*",
                         SearchOption.AllDirectories );
                                // convert string [ ] to 
                                // List < string >
    foreach ( string entry in entries )
        {
        FileAttributes attributes = File.GetAttributes ( 
                                                     entry );
                                                     
        if ( ( attributes & FileAttributes.Directory ) ==
             FileAttributes.Directory )
            {
                                // directory, not a file
            }
        else
            {
            GetFiles_files.Add ( entry );
            }
        }
    }

所有工作都在 `GetFiles` 调用中完成，但由于要求输出是文件的 `List`，因此 `GetFiles` 测试方法需要进行修改以

排除目录，
将输出从 `string[]` 转换为 `List`。

这个转换增加了 `GetFiles` 算法的时间。

FileSystemInfo 测试方法

// ***************************** retrieve_FileSystemInfo_files

void retrieve_FileSystemInfo_files ( string root )
    {
    DirectoryInfo       directory_info = null;
    FileSystemInfo [ ]  directory_entries = null;

    directory_info = new DirectoryInfo ( root );

    directory_entries = directory_info.
                                GetFileSystemInfos ( );

    foreach ( FileSystemInfo entry in directory_entries )
        {
        if ( entry is DirectoryInfo )
            {
            retrieve_FileSystemInfo_files ( entry.FullName );
            }
        else if ( entry is FileInfo )
            {
            FileSystemInfo_files.Add ( entry.FullName );
            }
        else
            {

            }
        }
    }

与 `GetFiles` 调用一样，`FileSystemInfo` 调用的输出包含目录和文件。此外，还需要递归遍历目录。因此，无法在 `retrieve_FileSystemInfo_files` 方法中记录文件名。这是迫使在调用方法之前清除输出 `List` 的方法之一。

堆栈和递归

在搜索算法时，我遇到了 Microsoft Developers Network (MSDN) 的文章如何：遍历目录树 (C# 编程指南)。

我发现特别有趣的是，文章中隐含地不推荐 `GetFiles` 和 `FileSystemInfo` 方法。因为文章中出现了以下（转述的）警告：

这种方法的缺点（`root.GetDirectories("*.*", System.IO.SearchOption.AllDirectories);`）是，如果指定根目录下的任何一个子目录导致 DirectoryNotFoundException 或 UnauthorizedAccessException，整个方法就会失败，并且不返回任何目录。使用 GetFiles 方法也是如此。如果您必须处理特定子文件夹的这些异常，您必须手动遍历目录树……

考虑到这个警告，我实现了 MSDN 文章中提供的堆栈和递归方法。

堆栈测试方法

我认为堆栈是被遗忘的数据结构。我在 20 世纪 70 年代用 C 实现它们；在 20 世纪 80 年代用 Pascal 使用它们；然后，出于某种原因，停止使用它们。但在这里，堆栈证明了它的价值。

// ************************************** retrieve_Stack_files

void retrieve_Stack_files ( string root )
    {
                                // names of directories to be 
                                // examined for files
    Stack < string > directories = new Stack < string > ( 20 );

    directories.Push ( root );

    while ( directories.Count > 0 )
        {
        string      current_directory = directories.Pop ( );
        string [ ]  subdirectories;
        
        try
            {
            subdirectories = Directory.GetDirectories ( 
                                        current_directory );
            }
                                // UnauthorizedAccessException 
                                // will be thrown if the pro-
                                // cess does not have dis-
                                // covery permission on a 
                                // directory or file; this 
                                // exception will be ignored
        catch ( UnauthorizedAccessException e )
            {
            continue;
            }
                                // DirectoryNotFound may be 
                                // raised, although unlikely;
                                // this occurs if current_dir-
                                // ectory has been deleted by 
                                // another application; this 
                                // exception will be ignored
        catch ( DirectoryNotFoundException e )
            {
            continue;
            }
            
        try
            {
            foreach ( string file in Directory.GetFiles ( 
                                        current_directory ) )
                {
                Stack_files.Add ( file );
                }
            }
                                // earlier comments apply
        catch ( UnauthorizedAccessException e )
            {
            continue;
            }
                                // earlier comments apply
        catch ( DirectoryNotFoundException e )
            {
            continue;
            }
                                // push the subdirectories 
                                // onto the stack for further
                                // traversal
        try
            {
            foreach ( string subdirectory in subdirectories )
                {
                directories.Push ( subdirectory );
                }
            }
                                // earlier comments apply
        catch ( UnauthorizedAccessException e )
            {
            continue;
            }
                                // earlier comments apply
        catch ( DirectoryNotFoundException e )
            {
            continue;
            }
        }
    }

是的，源代码比前两种算法要长一些，但我认为它可能更具可读性。但更重要的是它的健壮性。如果我们只能检索到第一批子目录，那么我们至少可以获得每个子目录下的所有文件——我们不会在没有任何返回的情况下失败。但是，如果读者更喜欢抛出异常，那么这段代码可以很容易地添加到每个 `catch` 块中。

尽管折叠 try-catch 块很诱人，但实验表明执行时间会增加。我不知道为什么会这样。

递归测试方法

// ********************************** retrieve_Recursive_files
        
void retrieve_Recursive_files ( DirectoryInfo root  )
{
    FileInfo [ ]        files = null;
    DirectoryInfo [ ]   subdirectories = null;
                                // process all the files 
                                // directly under this folder 
    try
    {
        files = root.GetFiles ( "*.*" );
    }
                                // UnauthorizedAccessException 
                                // will be thrown if the pro-
                                // cess does not have dis-
                                // covery permission on a 
                                // directory or file; this 
                                // exception will be ignored
    catch  ( UnauthorizedAccessException e )
    {

    }
                            // DirectoryNotFound may be 
                                // raised, although unlikely;
                                // this occurs if current_dir-
                                // ectory has been deleted by 
                                // another application; this 
                                // exception will be ignored
    catch  ( DirectoryNotFoundException e )
    {
    
    }

    if  ( files != null )
    {
        foreach  ( FileInfo fi in files )
        {
                                // only access the existing 
                                // FileInfo object; if it is 
                                // required to open, delete, 
                                // or modify the file, then a 
                                // try-catch block is required 
                                // to handle the case where 
                                // the file has been deleted 
                                // since the invocation of 
                                // this method
            Recursive_files.Add ( fi.FullName );
        }

                                // Now find all the subdirect-
                                // ories under this directory
        subdirectories = root.GetDirectories ( );

        foreach  ( DirectoryInfo dirInfo in subdirectories )
        {
                                // Recursive call for each 
                                // subdirectory.
            retrieve_Recursive_files ( dirInfo );
        }
    }
}

计时结果

我运行了 `ComparingDirectoryTraversalMethods` 项目工具，分别执行了一次和十次迭代，使用四种算法遍历了一个选定的目录。结果如下：

Iterations   GetFiles   FileSystemInfo   Stack   Recursive
    1           3062        4158          5139      6164
   10          31098       41895         51453     61823

结论

本文展示了四种常用目录遍历算法的时间性能。

考虑到早期引用的 MSDN 文章中的警告，以及执行时间没有数量级差异的事实，我认为应该认真考虑使用堆栈和递归算法来遍历目录结构。

我为 Transfer 工具选择了堆栈实现。它将在一个独立的线程中执行，与 UI 线程分开。

历史

2012/11/9 - 原始文章。