65.9K
CodeProject 正在变化。 阅读更多。
Home

换行符:从 Windows 到 Unix 再到回来

starIconstarIconstarIconstarIconstarIcon

5.00/5 (3投票s)

2019年6月20日

CPOL

13分钟阅读

viewsIcon

14590

这是关于字符串转换问题系列文章的第一部分,该问题最近在我工作中出现。

引言

本文介绍了我开发的一种解决方案,用以弥补 Git clone 命令在下载包含 Unix 行尾符的文件并将其转换为 Windows 行尾符时产生的意外行为,而应用程序则期望 Unix 行尾符。

背景

Git 仓库可能包含一个可选的 .gitattributes 文件,用于指示 Git 客户端如何处理文本文件中的行尾符。当文件上传到 GitHub、Bitbucket 或 TFS 的远程仓库时,这能很好地工作。然而,当文件作为 clone、pull 或 merge 操作的一部分下载到本地仓库时,Git 客户端要么忽略,要么误解其指令。我不确定 clone、merge 和 pull 是忽略 .gitattributes,还是误解它。

无论如何,结果是一样的;经过 Git 的往返处理后,不能假定文本文件包含 Unix 行尾符。

如今,有三种广泛使用的行尾符约定,总结在下面的表 1 中。对于一些读者来说,第三种约定,即传统 Mac,可能会令人困惑;由于现代 Macintosh 计算机上的操作系统是高度定制的 Linux 版本,它们遵循 Unix 行尾符约定。

表 1 总结了当前常用的行尾符约定。

约定 字符编码 注释
Windows 0x0d0a 回车符后跟换行符
Unix 0a 换行符
传统 Mac 0d 回车符

默认情况下,当运行在 Windows 上的 Git 客户端将文本文件传输到远程仓库时,Windows 行尾符会被替换为 Unix 行尾符。反之,当包含 Unix 行尾符的文本文件从远程 Git 仓库传输到 Windows 主机上的本地仓库时,Unix 行尾符会被替换为 Windows 行尾符。

在大多数情况下,这种转换是无缝且可取的。然而,导致本文出现的这个案例却是一个例外,因为必须在 JSON 字符串中替换的某些子字符串包含嵌入的行尾符,这些行尾符很重要,因为它们对于准确识别子字符串至关重要。此外,替换字符串也包含行尾符。虽然最终处理该字符串的 JSON 解析器可能不受这些问题的影响,但我的预处理器却不行。

以上解释了为什么仓库包含 Test_Data.zip

另一个存档 Binaries.zip 包含已编译的代码和中间文件,按照惯例,这些文件不包含在源代码控制中。

  1. bin 目录包含最终产品,包括所需 DLL 的副本。
  2. obj 目录包含构建过程生成的中间文件;没有它们,构建引擎会在您第一次按 F5 在 Visual Studio 中运行代码时坚持重新构建它们。

两个存档的结构都是当它们都“在此处”解压缩时,能够重现预期的目录结构,这在 README.md 中有解释。

尽管我无疑可以采用许多库来解决这个问题,但由于我有一些时间,我决定自己编写。这并不像看起来那么容易,但这个任务适合应用一个简单的状态机,类似于驱动 一个健壮的 CSV 读取器 的状态机。(最新源代码在 GitHub 上:https://github.com/txwizard/AnyCSV,对应的 NuGet 包在:https://nuget.net.cn/packages/WizardWrx.AnyCSV/。)

Using the Code

由于 Git 客户端的行为是导致本文出现的这个问题的核心,它的演示程序仅作为 Git 仓库提供:https://github.com/txwizard/LineBreakFixupsDemo。像许多重要的开源仓库一样,准备好使用它需要比仅仅克隆仓库或下载并解压缩其内容的 ZIP 文件更多的操作。由于分步说明在仓库的 README.md 文件中,该文件在仓库主页上醒目显示,并且可以直接在线查看:https://github.com/txwizard/LineBreakFixupsDemo/blob/master/README.md,因此本文省略了这些说明。我预计以上背景信息足以解释为什么需要这些额外的步骤。

LineBreakFixupsDemo.exe 在命令提示符窗口中执行或通过在文件浏览器中双击其图标执行时,表 2 中列出的进程(我通常称之为练习)将按Listed顺序列出,生成与 LineBreakFixupDemo_Complete_Report_20190616_183828.TXT(位于仓库根目录)相似的输出。该文件是如何创建的。

  1. 在文件浏览器中双击程序文件。
  2. 使用其输出窗口左上角的上下文菜单,选择输出窗口中的所有内容,并将其复制到 Windows 剪贴板。
  3. 在文本编辑器中打开一个新文件。由于我想控制保存到文件中的行尾符类型,我使用了我常用的文本编辑器 UltraEdit Studio
  4. 将 Windows 剪贴板的内容粘贴到其中。
  5. 保存文件,注意指定我要以 Unix 行尾符保存它。

选择 Unix 行尾符是故意的,因为它确保当文件被 GitHub 服务器压缩到仓库 ZIP 中时,该文件具有 Unix 行尾符。如果您通过克隆来构建您的仓库,您的本地文件副本将拥有 Windows 行尾符。这很容易用任何程序员的文本编辑器或任何十六进制文件查看器或编辑器来演示。我将演示留作练习。

您还可以通过附加表 2名称列中列出的四个参数之一来独立执行其四个练习。

表 2 总结了执行单个测试集的命令行选项。

名称 描述
LineBreaks 练习类 StringExtensions(行尾符转换方法)
AppSettingsList 练习类 Program(方法 ListAppSettings,用于排序和列出应用程序设置)
StringResourceList 练习类 Program(方法 ListEmbeddedResources,用于排序和列出嵌入的字符串资源)
TransformJSONString 练习类 JSONFixups,处理包含已转换和原始 Windows 行尾符以及 Unix 行尾符的文件(3 个测试)

本文的主题是第一个练习,即 LineBreaks。其余三个练习将是后续文章的主题。

字符串转换是通过三个字符串扩展方法完成的。

  1. OldMacLineEndings 接受包含任何组合的行尾符的字符串,并返回一个字符串,其中每个合法的行尾符都被替换为表 1 中描述的传统 Mac 行尾符。
  2. UnixLineEndings 接受包含任何组合的行尾符的字符串,并返回一个字符串,其中每个合法的行尾符都被替换为表 1 中描述的 Unix 行尾符。
  3. WindowsLineEndings 接受包含任何组合的行尾符的字符串,并返回一个字符串,其中每个合法的行尾符都被替换为表 1 中描述的 Windows 行尾符。

这些字符串扩展方法由 WizardWrx.Core.dll 导出,它是 WizardWrx .NET API 的一个组件,也可以作为 NuGet 包 使用。

尽管表 3 中显示的 NuGet 包列表包含十二个项目,但只需要显式安装三个;其他九个是依赖项,NuGet 包管理器会在选择安装这三个时下载它们。它们是 WizardWrx.ConsoleAppAids3WizardWrx.EmbeddedTextFile 或两者的依赖项。Newtonsoft.Json 没有包依赖项。这里列出的所有内容都包含在 Binaries.zip 中,并在 README.md 的准备说明中进行了介绍。

3 列出了所有 NuGet 包,并指明了每个包是否必须安装。

包名 必须安装
Newtonsoft.Json
WizardWrx.AnyCSV
WizardWrx.ASCIIInfo

WizardWrx.AssemblyUtils
WizardWrx.BitMath
WizardWrx.Common

WizardWrx.ConsoleAppAids3

WizardWrx.ConsoleStreams
WizardWrx.Core
WizardWrx.DLLConfigurationManager
WizardWrx.EmbeddedTextFile

关注点

除了确定是否跳过或运行它的代码外,LineBreaks 练习是由 LineEndingFixupTests 上的静态 Exercise 方法实现的,它本身也是静态的。由于它所需的一切都来自程序中固定的 TestCase 结构体数组,因此我没有定义任何基本实例构造函数,这会增加不必要的体积(就像饮食中的空热量一样)。

        struct TestCase
        {
            public string InputString;
            public int LineCount;

            public TestCase ( string pstrInputString , int pintLineCount )
            {
                InputString = pstrInputString;
                LineCount = pintLineCount;
            }
        };  // struct TestCases
        static readonly TestCase [ ] s_astrTestStrings = new TestCase [ ]

        {   //             InputString                                                                                                                                                                                                                                                                                                                                                                                      LineCount
            //             ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------   ---------
            new TestCase ( "Test line 1 is followed by a Unix newline.\nTest line 2 is also followed by a Unix newline.\nTest line 3 is followed by a Windows line break.\r\nTest line 4 is followed by the unusual line LF/CR line break.\n\rTest line 5 is followed by the old Macintosh line break, CR.\rTest line 6 is followed by a Unix newline.\nTest line 7 is followed by one last Unix newline.\n"              , 7         ) ,
            new TestCase ( "Test line 1 is followed by Windows newline.\r\nTest line 2 is also followed by a Windows newline.\r\nTest line 3 is followed by a Windows line break.\r\nTest line 4 is followed by the unusual line LF/CR line break.\n\rTest line 5 is followed by the old Macintosh line break, CR.\rTest line 6 is followed by a Unix newline.\nTest line 7 is followed by one last Windows newline.\r\n" , 7         ) ,
            new TestCase ( "Test line 1 is followed by a Windows newline.\r\nTest line 2 is also followed by a Windows newline.\r\nTest line 3 is followed by yet another Windows neline.\r\n"                                                                                                                                                                                                                            , 3         ) ,
            new TestCase ( "Test line 1 is followed by a Unix newline.\nTest line 2 is also followed by a Unix newline.\nTest line 3 is followed by yet another Unix neline.\n"                                                                                                                                                                                                                                           , 3         ) ,
            new TestCase ( "Test line 1 is followed by a Unix newline.\nTest line 2 is followed by 2 Unix newlines.\n\nTest line 4 is followed by one Unix newline.\nTest line 5 is unterminated."                                                                                                                                                                                                                        , 4         ) ,
            new TestCase ( "Test line 1 is followed by a Old Macintosh newline.\rTest line 2 is followed by 2 Old Macintosh newlines.\r\rTest line 4 is followed by one Old Macintosh newline.\rTest line 5 is unterminated."                                                                                                                                                                                             , 4         ) ,
            new TestCase ( "Test line 1 is followed by a Windows newline.\r\nTest line 2 is followed by 2 Windows newlines.\r\n\r\nTest line 4 is followed by one Windows newline.\r\nTest line 5 is unterminated."                                                                                                                                                                                                       , 4         ) ,
        };  // static readonly TestCase [ ] s_astrTestStrings

Exercise 方法的流程如下,它是对 System.String 类上的扩展方法的精彩展示。

  • 静态方法 Utl.BeginTest 递增测试编号(该编号已显式初始化为零),并在描述测试的单行消息中显示递增后的测试编号。描述如上表 2 所示。
  • 三个整数计数器变量被初始化。尽管它的值在运行时永远不会改变,但 intTotalTestCases 不能声明为 const,这是由于它的初始化方式。
  • 构造并初始化了两个 WizardWrx.ConsoleStreams.MessageInColor 对象,尽管预计只有 micSuccessfulOutcomeMessage 会有实际作用。MessageInColor 对象类似于静态 Console 对象中实现其 WriteLine 方法的部分。不同之处在于,MessageInColor 对象上的 WriteLine 方法会以其构造函数中指定的颜色(前景文本和背景色)渲染文本。
    • 分配给 micSuccessfulOutcomeMessage 对象的颜色是从应用程序设置文件中存储的字符串设置中读取的。这是通过 EnumFromString<ConsoleColor> 实现的,它是 WizardWrx 命名空间中定义的几个自定义字符串扩展方法之一。使用此转换方法消除了对自定义转换类的需求。
    • 分配给 micUnsuccessfulOutcomeMessage 对象的颜色是 WizardWrx.ConsoleStreams.ErrorMessagesInColor 类上的 FatalExceptionTextColorFatalExceptionBackgroundColor 属性。分配给这些颜色的值来自 WizardWrx.DLLConfigurationManager.dll.config,它随 WizardWrx.DLLConfigurationManager.dll 一起分发。这意味着无论 WizardWrx.DLLConfigurationManager.dll 到哪里,它都会查找一个同名的配置文件。如果存在,则其设置的应用方式就像它们在常规 App.config 文件中指定的一样。所有内容都有默认值,恰好是随库分发的配置文件中指定的值。目标是没有任何硬编码。
    • WizardWrx.DLLConfigurationManager.dllWizardWrx_NET_API 的另一个组件,也是一个 NuGet 包(https://nuget.net.cn/packages/WizardWrx.DLLConfigurationManager/)。关于这个库如何工作的解释超出了本文的范围,尽管它可能会成为另一篇文章的主题。
  • Exercise 方法的核心是 for 循环,它迭代 s_astrTestStrings 数组。该语句的格式是刻意的,目的是引起注意:像所有 for 语句一样,它是由三个紧密耦合的语句组成的。这是 C 语言及其所有衍生语言的共同点。如果您在调试器中仔细观察,您会注意到这三个语句中,第一个只在第一次迭代时执行,第二个在每次迭代时执行,第三个在除第一次迭代外的所有迭代时执行。
  • 每次迭代都会将输入字符串转换为一个新字符串,该字符串实现了表 1 中列出的三种行尾符约定之一。
  • 接下来,对每个输出字符串调用一次 ReportTestOutcome 方法。由于三种行尾符约定中有两种只需要一个字符,而第三种需要一个双字符字符串,因此 ReportTestOutcome 有两个重载。ReportTestOutcome 还负责递增它作为输入接收的 intOverallCase 计数器,并将其作为其返回值返回。
  • SpecialCharacters.CARRIAGE_RETURNSpecialCharacters.LINEFEEDSpecialStrings.STRING_SPLIT_NEWLINEWizardWrx.Common.dll 定义并导出到根 WizardWrx 命名空间中的许多通用常量之一。该库是 WizardWrx .NET API 的另一个组件,也是 WizardWrx.Common NuGet 包,可从 https://nuget.net.cn/packages/WizardWrx.Common/ 获取。
  • 最后,s_intBadOutcomess_intGoodOutcomes 是两个静态整数计数器,其中第一个 s_intBadOutcomes 被测试以确定通过两个 MessageInColor 对象之一写入的两个最终消息中的哪个。除非程序有 bug 或测试数据表损坏,否则第一个消息“A-OK!”是预期的结果。
  • 测试程序的最后一个有趣之处是 ActualLineCount,它由两个 ReportTestOutcome 重载调用。与它的调用者一样,并且出于同样的原因,ActualLineCount 被重载了,因为它的第二个参数直接从 ReportTestOutcome 传递。
    • 第一个重载接受输出字符串和 Windows 行尾符(另一个字符串),并返回 CountSubstrings 返回的整数,CountSubstrings 是导出到 WizardWrx 命名空间中的另一个 System.String 扩展方法。
    • 第二个重载接受输出字符串和 Unix 或 Mac 行尾符(它们都是一个字符),并返回 CountCharacterOccurrences 返回的整数,这是另一个扩展方法。

尽管这些字符串扩展方法很简单,但它们值得快速浏览一下。尽管 StringExtensions 类本身属于另一个仓库,但我将 StringExtensions.cs 的副本存放在了这个仓库的根目录中。这个类会随着我发现和实现新的字符串扩展方法而频繁更新;要利用它们,请将 WizardWrx.Core NuGet 包导入您的项目中。

遵循标准字符串比较方法,CountSubstrings 有两个重载,第一个假定 CurrentCulture,即默认的 StringComparison。第一个重载调用第二个,这允许覆盖比较算法。由于其参数列表是完整的,因此第二个方法为两者提供了实现。

public static int CountSubstrings (
    this string pstrSource ,
    string pstrToCount )
{
    return pstrSource.CountSubstrings (
        pstrToCount ,
        StringComparison.CurrentCulture );
}   // CountSubstrings (1 of 2)
public static int CountSubstrings (
        this string pstrSource ,
        string pstrToCount ,
        StringComparison penmComparisonType )
{
    if ( string.IsNullOrEmpty ( pstrSource ) )
    {   // Treat null strings as empty, and treat both as a valid, but degenerate, case.
        return MagicNumbers.ZERO;
    }   // if ( string.IsNullOrEmpty ( pstrSource ) )

    if ( string.IsNullOrEmpty ( pstrToCount ) )
    {   // This is an error. String pstrToCount should never be null or empty.
        return MagicNumbers.STRING_INDEXOF_NOT_FOUND;
    }   // if ( string.IsNullOrEmpty ( pstrToCount ) )

    int rintCount = MagicNumbers.ZERO;

    // ----------------------------------------------------------------
    // Unless pstrSource contains at least one instance of pstrToCount,
    // this first IndexOf is the only one that executes.
    //
    // If there are no matches, intPos is STRING_INDEXOF_NOT_FOUND (-1)
    // and the WHILE loop is skipped. Hence, if control falls into the
    // loop, at least one item was found, and must be counted, and the
    // loop continues until intPos becomes STRING_INDEXOF_NOT_FOUND.
    // ----------------------------------------------------------------

    int intPos = pstrSource.IndexOf (
        pstrToCount ,
        penmComparisonType );                                               

    // Look for first instance.

    while ( intPos != MagicNumbers.STRING_INDEXOF_NOT_FOUND )
    {   // Found at least one.
        rintCount++;    // Count it.
        intPos = pstrSource.IndexOf (
            pstrToCount ,
            ( intPos + ArrayInfo.NEXT_INDEX ) ,
            penmComparisonType );    // Search for more.
    }  // while ( intPos != MagicNumbers.STRING_INDEXOF_NOT_FOUND )

    return rintCount;    // Report.
}   // CountSubstrings (2 of 2)

这两个方法都定义在 StringExtensions.cs 中,这是 WizardWrx.Core.dll 源代码的一部分。最后,我们注意到本文的缘起,即 UnixLineEndingsWindowsLineEndingsOldMacLineEndings,这三个也都是字符串扩展方法,定义在同一个源文件中。

public static string OldMacLineEndings ( this string pstrSource )
{
    return LineEndingFixup (
        pstrSource ,                        // string                  pstrSource
        RequiredLineEndings.OldMacintosh ); // RequiredLineEndings     penmRequiredLineEndings
}   // OldMacLineEndings method

public static string UnixLineEndings ( this string pstrSource )
{
    return LineEndingFixup (
        pstrSource ,                        // string                  pstrSource
                RequiredLineEndings.Unix ); // RequiredLineEndings     penmRequiredLineEndings
}   // UnixLineEndings method

public static string WindowsLineEndings ( this string pstrSource )
{
    return LineEndingFixup (
        pstrSource ,                        // string                  pstrSource
        RequiredLineEndings.Windows );      // RequiredLineEndings     penmRequiredLineEndings
}   // WindowsLineEndings method

这三个都调用 LineEndingFixup

private static string LineEndingFixup (
    string pstrSource ,
    RequiredLineEndings penmRequiredLineEndings )
{
    //  ----------------------------------------------------------------
    //  Construct a StringBuilder with sufficient memory allocated to
    //  support a final string twice as long as the input string, which
    //  covers the worst-case scenario of an input string composed
    //  entirely of single-character newlines, expecting the returned
    //  string to have Windows line endings.
    //
    //  Copy the input string into an array of characters and initialize
    //  the state machine. Since both are easier to maintain as part of
    //  the state machine, LineEndingFixupState, the input string is fed
    //  into its constructor, since it can construct both from it.
    //  ----------------------------------------------------------------

    LineEndingFixupState state = new LineEndingFixupState (
         penmRequiredLineEndings ,
         pstrSource );

    //  ----------------------------------------------------------------
    //  Using the state machine, a single pass over the character array
    //  is sufficient.
    //  ----------------------------------------------------------------

    int intCharsInLine = state.InputCharacterCount;

    for ( int intCurrCharPos = ArrayInfo.ARRAY_FIRST_ELEMENT ;
              intCurrCharPos < intCharsInLine ;
              intCurrCharPos++ )
    {
        state.UpdateState ( intCurrCharPos );
    }   // for ( int intCurrCharPos = ArrayInfo.ARRAY_FIRST_ELEMENT ; intCurrCharPos < intCharsInLine ; intCurrCharPos++ )

    return state.GetTransformedString ( );
}   // private static string LineEndingFixup

真正的功劳委托给了一个私有的嵌套类 LineEndingFixupState,如下完整所示。

#region Private nested class LineEndingFixupState
/// <summary>
/// On behalf of public static methods OldMacLineEndings,
/// UnixLineEndings, and WindowsLineEndings, private static method
/// LineEndingFixup uses an instance of this class to manage the
/// resources required to perform its work. Public method UpdateState
/// does most of the work required by LineEndingFixup.
/// </summary>
private class LineEndingFixupState
{
    #region Public Interface of nested LineEndingFixupState class
    /// <summary>
    /// This enumeration is used internally to indicate the state of the
    /// LineEndingFixup state machine.
    /// </summary>
    public enum CharacterType
    {
        /// <summary>
        /// The initial state of the machine is that the last character seen
        /// is unknown.
        /// </summary>
        Indeterminate ,

        /// <summary>
        /// The last character seen isn't a newline character.
        /// </summary>
        Other ,

        /// <summary>
        /// The last character seen was a bare CR character, which is either
        /// the old Macintosh line break character, or belongs to a
        /// two-character line break.
        /// </summary>
        OldMacintosh ,

        /// <summary>
        /// The last character seen was a bare LF character, which is either
        /// a Unix line break character, or belongs to a two-character line
        /// break.
        /// </summary>
        Unix
    };  // public enum CharacterType


    /// <summary>
    /// The constructor is kept private to guarantee that all instances
    /// are fully initialized.
    /// </summary>
    private LineEndingFixupState ( )
    {
    }   // private LineEndingFixupState constructor prohibits uninitialized instances.


    /// <summary>
    /// The only public constructor initializes an instance for use by
    /// LineEndingFixup.
    /// </summary>
    /// <param name="penmRequiredLineEndings">
    /// LineEndingFixup uses a member of the RequiredLineEndings
    /// enumeration to specify the type of line endings to be included
    /// in the new string that it generates from input string
    /// <paramref name="pstrInput"/>.
    /// </param>
    /// <param name="pstrInput">
    /// Existing line endings in this string are replaced as needed by
    /// the type of line endings specified by <paramref name="penmRequiredLineEndings"/>.
    /// </param>
    public LineEndingFixupState (
        RequiredLineEndings penmRequiredLineEndings ,
        string pstrInput )
    {
        NewLineEndings = penmRequiredLineEndings;
        DesiredLineEnding = SetDesiredLineEnding ( );
        _sbWork = new StringBuilder ( pstrInput.Length * MagicNumbers.PLUS_TWO );
        _achrInputCharacters = pstrInput.ToCharArray ( );
        InputCharacterCount = _achrInputCharacters.Length;
    }   // public NewLineEndings constructor guarantees initialized instances.


    /// <summary>
    /// Since the StringBuilder is a reference type, exposing it makes
    /// it vulnerable to attack. Therefore, it is kept private, and this
    /// instance method must be explicitly called to get its value as an
    /// immutable entity, a new string.
    /// </summary>
    /// <returns>
    /// The contents of the StringBuilder in which the transformed
    /// string is assembled are returned as a new string.
    /// </returns>
    public string GetTransformedString ( )
    {
        return _sbWork.ToString ( );
    }   // public string GetTransformedString


    /// <summary>
    /// LineEndingFixup calls this method once for each character in the
    /// string that was fed into the instance constructor, and once more
    /// to handle the final character. The algorithm that it implements
    /// completes all conversions in one pass.
    /// </summary>
    /// <param name="pintCurrCharPos">
    /// The index of the FOR loop within which this routine is called
    /// identifies the zero-based position within the internal array of
    /// characters that is constructed from the input string to process.
    /// </param>
    public void UpdateState ( int pintCurrCharPos )
    {
        //  ------------------------------------------------------------
        //  Processing a scalar is slightly more efficient than
        //  processing an array element.
        //  ------------------------------------------------------------

        char chrCurrent = GetCharacterAtOffset ( pintCurrCharPos );

        //  ------------------------------------------------------------
        //  Defer updating the instance property, which would otherwise
        //  break the test performed by IsRunOfNelines.
        //  ------------------------------------------------------------

        CharacterType enmCharacterType = ClassifyThisCharacter ( chrCurrent );

        if ( IsThisCharANewline ( chrCurrent ) )
        {
            if ( IsRunOfNelines ( ) )
            {   // Some newlines are pairs, of which the second is ignored.
                if ( AppendNewline ( pintCurrCharPos , enmCharacterType ) )
                {
                    _sbWork.Append ( DesiredLineEnding );
                }   // if ( AppendNewline ( pintCurrCharPos , enmCharacterType ) )
            }   // TRUE block, if ( IsRunOfNelines ( ) )
            else
            {   // Regardless, the first character elicits a newline, and set the run counter.
                _intPosNewlineRunStart = _intPosNewlineRunStart == ArrayInfo.ARRAY_INVALID_INDEX
                    ? pintCurrCharPos
                    : _intPosNewlineRunStart;
                _sbWork.Append ( DesiredLineEnding );
            }   // FALSE block, if ( IsRunOfNelines ( ) )
        }   // if ( IsThisCharANewline ( chrCurrent ) )
        else
        {   // It isn't a newline; append it, and reset the run counter.
            _sbWork.Append ( chrCurrent );
            _intPosNewlineRunStart = ArrayInfo.ARRAY_INVALID_INDEX;
        }   // if ( IsThisCharANewline ( chrCurrent ) )

        LastCharacter = enmCharacterType;
    }   // public void UpdateState


    /// <summary>
    /// Strictly speaking this string could be left private. Making it
    /// public as a read-only member is a debugging aid that preserves
    /// the integrity of the instance.
    /// </summary>
    public string DesiredLineEnding { get; private set; } = null;


    /// <summary>
    /// The FOR loop at the heart of LineEndingFixup initializes its
    /// limit value from this read-only property.
    /// </summary>
    public int InputCharacterCount { get; }


    /// <summary>
    /// Like DesiredLineEnding, this could be kept private, but is made
    /// public as a debugging aid.
    /// </summary>
    public CharacterType LastCharacter { get; private set; } = CharacterType.Indeterminate;


    /// <summary>
    /// Like DesiredLineEnding, this could be kept private, but is made
    /// public as a debugging aid.
    /// </summary>
    public RequiredLineEndings NewLineEndings { get; private set; }
    #endregion  // Public Interface of nested LineEndingFixupState class


    #region Private nested class LineEndingFixupState code and data
    /// <summary>
    /// Use the current character position relative to the beginning of
    /// the run of newline characters and the type of the current and
    /// immediately previous character in the run to determine whether a
    /// newline should be emitted.
    /// </summary>
    /// <param name="pintCurrCharPos">
    /// The position (offset) of the current character is compared with
    /// the position of the first character in the current run of
    /// newline characters to determine whether to append a newline.
    /// </param>
    /// <param name="penmCurrentCharacterType"></param>
    /// <returns></returns>
    private bool AppendNewline (
        int pintCurrCharPos ,
        CharacterType penmCurrentCharacterType )
    {
        const int LONGEST_VALID_NEWLINE_SEQUENCE = MagicNumbers.PLUS_TWO;

        switch ( ( pintCurrCharPos - _intPosNewlineRunStart ) % LONGEST_VALID_NEWLINE_SEQUENCE )
        {
            case MagicNumbers.EVENLY_DIVISIBLE:
                return true;
            default:
                return penmCurrentCharacterType == LastCharacter;
        }   // switch ( ( pintCurrCharPos - _intPosNewlineRunStart ) % LONGEST_VALID_NEWLINE_SEQUENCE )
    }   // private bool AppendNewline

   
    /// <summary>
    /// Update the LastCharacter property (CharacterType enum).
    /// </summary>
    /// <param name="pchrCurrent">
    /// Pass in the current character, which is about to become the last
    /// character processed.
    /// </param>
    private CharacterType ClassifyThisCharacter ( char pchrCurrent )
    {
        switch ( pchrCurrent )
        {
            case CHAR_SPLIT_OLD_MACINTOSH:
                return CharacterType.OldMacintosh;
            case CHAR_SPLIT_UNIX:
                return CharacterType.Unix;
            default:
                return CharacterType.Other;
        }   // switch ( pchrCurrent )
    }   // private void ClassifyThisCharacter


    /// <summary>
    /// Evaluate the character at a specified position in the input
    /// string, returning TRUE if it is a newline character (CR or LF).
    /// </summary>
    /// <param name="pchrThis">
    /// Specify the character to evaluate.
    /// </param>
    /// <returns>
    /// Return TRUE if the character at the position specified by
    /// <paramref name="pchrThis"/> is a newline (CR or LF)
    /// character. Otherwise, return FALSE.
    /// </returns>
    private bool IsThisCharANewline ( char pchrThis )
    {
        switch ( pchrThis )
        {
            case CHAR_SPLIT_OLD_MACINTOSH:
            case CHAR_SPLIT_UNIX:
                return true;
            default:
                return false;
        }   // switch ( pchrThis )
    }   // private bool IsThisCharANewline


    /// <summary>
    /// Determine whether the current newline character belongs to a run of them.
    /// </summary>
    /// <returns>
    /// Return TRUE unless _intPosLastNewlineChar is equal to
    /// ArrayInfo.ARRAY_INVALID_INDEX; otherwise, return FALSE. Though
    /// this method could go ahead and update _intPosLastNewlineChar, it
    /// leaves it unchanges, so that its execution is devoid of side
    /// effects.
    /// </returns>
    private bool IsRunOfNelines ( )
    {
        return ( _intPosNewlineRunStart != ArrayInfo.ARRAY_INVALID_INDEX );
    }   // private bool IsRunOfNelines


    /// <summary>
    /// Switch blocks in public instance method UpdateState use this
    /// routine to return the character at the position (offset)
    /// specified by <paramref name="pintCurrCharPos"/>.
    /// </summary>
    /// <param name="pintCurrCharPos">
    /// The zero-based offset that was fed into instance method
    /// UpdateState by its controler, LineEndingFixup
    /// </param>
    /// <returns>
    /// The return value is the character at the specified position
    /// (offset) in the input string, a copy of which is maintained in
    /// private character array _achrInputCharacters. Returning this in
    /// a method exposes the actual character that determines which
    /// branch of the switch block executes. Otherwise, the debugger
    /// reports only the return value returned by the property getter.
    /// It is anticipated that this routine will be optimized away in a
    /// release build.
    /// </returns>
    private char GetCharacterAtOffset ( int pintCurrCharPos )
    {
        return _achrInputCharacters [ pintCurrCharPos ];
    }   // private char GetCharacterAtOffset


    /// <summary>
    /// The public constructor invokes this method once, during the
    /// initialization phase, to establish the value of the desired line
    /// ending, which may be a single character or a pair of them.
    /// </summary>
    /// <returns>
    /// The return value is always a string that contains one character or a
    /// pair of them.
    /// </returns>
    private string SetDesiredLineEnding ( )
    {
        const string WINDOWS_LINE_BREAK = SpecialStrings.STRING_SPLIT_NEWLINE;

        switch ( NewLineEndings )
        {
            case RequiredLineEndings.OldMacintosh:
                return CHAR_SPLIT_OLD_MACINTOSH.ToString ( );
            case RequiredLineEndings.Unix:
                return CHAR_SPLIT_UNIX.ToString ( );
            case RequiredLineEndings.Windows:
                return WINDOWS_LINE_BREAK;
            default:
                throw new InvalidEnumArgumentException (
                    nameof ( NewLineEndings ) ,
                    ( int ) NewLineEndings ,
                    NewLineEndings.GetType ( ) );
        }   // switch ( NewLineEndings )
    }   // private string SetDesiredLineEnding


    /// <summary>
    /// The constructor initializes this character array from the input
    /// string. Thereafter, public method UpdateState processes it one
    /// character at a time.
    /// </summary>
    private readonly char [ ] _achrInputCharacters = null;


    /// <summary>
    /// When two or more newline characters appear in a sequence, it is
    /// essential to determine whether they are a pair and, if so, treat
    /// them as such.
    /// </summary>
    private int _intPosNewlineRunStart = ArrayInfo.ARRAY_INVALID_INDEX;


    /// <summary>
    /// Since the StringBuilder is a reference type, exposing it makes
    /// it vulnerable to attack. Therefore, it is kept private, and a
    /// public instance method, GetTransformedString, must be explicitly
    /// called to get its value as a new string, an immutable entity.
    /// </summary>
    private StringBuilder _sbWork { get; }
    #endregion  // Private nested class LineEndingFixupState code and data
}   // private class LineEndingFixupState
#endregion  // Private nested class LineEndingFixupState

这三个方法都调用 LineEndingFixup,传入输入字符串和一个 RequiredLineEndings 枚举成员,该成员标识其输出字符串中所需的行尾符类型。它的第一个任务是从两者中构造一个新的 LineEndingFixupState 实例。接下来,从 LineEndingFixup 对象(即状态机)中检索输入字符串的长度,然后一个循环调用 UpdateState,传入一个索引,使其评估输入字符串中的每个字符,在进行过程中更新状态,并在遇到新的行尾符时生成所需的行尾符类型。任何不是行尾符一部分的字符都会被追加到一个 StringBuilder 中,该 StringBuilderLineEndingFixup 维护。处理完最后一个字符后,调用 GetTransformedString,然后将其返回的字符串通过调用堆栈传回。

  • UpdateState 方法首先将偏移量处的字符(由其整数参数指示)复制到一个局部字符对象中。接下来,调用私有方法 ClassifyThisCharacter 来将字符识别为两个有效行尾符中的一个,或者其他字符。由于所有三种约定都使用两个字符,单独或组合使用,因此只有这两个字符需要特殊处理。其他任何内容都将被追加到输出字符串中。
  • 如果当前字符是行尾符字符(由 IsThisCharANewline 返回 true 指示),则实例方法 IsRunOfNelines 确定当前字符是否是一系列两个或更多连续行尾符中的一个,这是通过计算连续行尾符字符的数量来确定的。
  • 如果连续行尾符字符的数量能被 2 整除(两个有效的行尾符字符的数量),或者当前字符和前一个字符相同,则表示找到了一个新的行尾符,并且将一个所需类型的行尾符追加到输出字符串中。
  • 在任何情况下,行尾符序列的第一个字符都会导致追加一个行尾符并初始化运行计数器。
  • 最后,除了行尾符之外的任何字符都会导致行尾符的起始位置重置为 -1,这是一个无效的字符偏移量。

此程序集及其依赖的库中还有许多更酷的功能。请继续关注关于其中一些功能的后续文章,下一篇将是将 JSON 字符串改编为 C# 对象进行反序列化

历史

2019年6月19日星期三:首次发布

© . All rights reserved.