列表三部曲，第3部分

Qwertie

5.00/5 (7投票s)

2014年8月12日

LGPL3

10分钟阅读

12405

SparseAList 和一些 AList 基准测试。

SparseAList<T> 简介

在写完前两篇文章后，我又创建了一种 AList<T> 数据结构。这种数据结构的动机是_语法高亮_。稍后会详细介绍。

SparseAList<T> 是一个列表，其中并非所有索引都已“设置”。未设置（也称为“清除”）的索引是“虚拟的”，根本不占用内存，当 [i] 清除时，sparseAList[i] 返回 default(T)。同时，已设置的索引会额外占用 4 字节。Count 属性返回“虚拟”项目的总数，包括已设置和未设置的。SparseAList<T> 的内部节点实际上与普通 AList<T> 相同，但叶节点结构不同。

    public class SparseAListLeaf<T> : AListNode<int, T>
    {
        [DebuggerDisplay("Offset = {Offset}, Item = {Item}")]
        protected struct Entry
        {
            public Entry(uint offset, T item) { Offset = offset; Item = item; }
            public T Item;
            public uint Offset;
        }
        protected InternalDList<Entry> _list;
        protected uint _totalCount;
        ...

例如，运行此代码后

var list = SparseAList<string>();
list.InsertSpace(0, 1000);
list[321] = "three two one";
list[32] = "three two";
list[3] = "three";

列表由一个包含三个 Entry 结构的叶节点组成

_list[0]: Offset = 3, Item = "three"
_list[1]: Offset = 32, Item = "three two"
_list[2]: Offset = 321, Item = "three two one"
_totalCount = 1000

从外部看，它似乎是一个包含 1000 个项目的列表，但实际上只有三个。

这种列表可能类似于 SortedDictionary<int,T>，但有一个很大的区别：你可以_插入和移除_索引范围，这有效地“移动”了受影响索引之上所有项目的索引。例如，如果我向 SparseAList<T> 添加一百万个真实项目，我可以执行 list.InsertSpace(0, 1000)，这将使所有现有项目的索引增加 1000（在 O(log N) 时间内）。

SparseAList<T> 实现了我的 ISparseListSource<T> 和 ISparseList<T> 接口。与普通列表相比，稀疏列表提供以下额外方法

/// <summary>Represents a read-only indexed list in which parts of the index 
/// space may be unused or "clear".</summary>
/// ... long remarks section removed ...
public interface ISparseListSource<T> : IListSource<T>
{
	/// <summary>Increases <c>index</c> by at least one to reach the next index
	/// that is not classified as empty space, and returns the item at that 
	/// index.</summary>
	/// <param name="index">This parameter is increased by at least one, and
	/// perhaps more than one so that it refers to an index where there is a
	/// value. If <c>index</c> is null upon entering this method, the first 
	/// non-empty space in the list is found. If there are no values at higher 
	/// indexes, if the list is empty or if <c>index + 1 >= Count</c>, 
	/// <c>index</c> is <c>null</c> when the method returns.</param>
	/// <remarks>This method must skip over all indexes i for which 
	/// <c>IsSet(i)</c> returns false, and return the next index for which
	/// <c>IsSet(i)</c> returns true. This method must accept any integer as 
	/// input, including invalid indexes.
	/// </remarks>
	T NextHigherItem(ref int? index);

	/// <summary>Decreases <c>index</c> by at least one to reach the next index
	/// that is not classified as empty space, and returns the item at that 
	/// index.</summary>
	/// <param name="index">This parameter is increased by at least one, and
	/// perhaps more than one so that it refers to an index where there is a
	/// value. If <c>index</c> is null upon entering this method, the last
	/// non-empty space in the list is found. If there are no values at lower
	/// indexes, if the list is empty or if <c>index</c> is 0 or less, 
	/// <c>index</c> is <c>null</c> when the method returns.</param>
	/// <remarks>This method must skip over all indexes i for which 
	/// <c>IsSet(i)</c> returns false, and return the next index for which
	/// <c>IsSet(i)</c> returns true. This method must accept any integer as 
	/// input, including invalid indexes.
	/// </remarks>
	T NextLowerItem(ref int? index);

	/// <summary>Determines whether a value exists at the specified index.</summary>
	/// <param name="index"></param>
	/// <returns>true if a value is assigned at the specified index, or false
	/// if index is part of an empty space, or is outside the range of indexes
	/// that exist.</returns>
	bool IsSet(int index);
}

public partial class LCInterfaces
{
	/// <summary>Gets the next higher index that is not classified as an
	/// empty space, or null if there are no non-blank higher indexes.</summary>
	/// <remarks>This extension method works by calling <c>NextHigherItem()</c>.</remarks>
	public static int? NextHigherIndex<T>(this ISparseListSource<T> list, int? index)
	{
		list.NextHigherItem(ref index);
		return index;
	}
	
	/// <summary>Gets the next lower index that is not classified as an
	/// empty space, or null if there are no non-blank lower indexes.</summary>
	/// <remarks>This extension method works by calling <c>NextHigherItem()</c>.</remarks>
	public static int? NextLowerIndex<T>(this ISparseListSource<T> list, int? index)
	{
		list.NextLowerItem(ref index);
		return index;
	}

	/// <summary>Returns the non-cleared items in the sparse list, along with 
	/// their indexes, sorted by index.</summary>
	/// <remarks>
	/// The returned sequence should exactly match the set of indexes for which 
	/// <c>list.IsSet(Key)</c> returns true.</remarks>
	public static IEnumerable<KeyValuePair<int, T>> Items<T>(this ISparseListSource<T> list)
	{
		int? i = null;
		for (;;) {
			T value = list.NextHigherItem(ref i);
			if (i == null) break;
			yield return new KeyValuePair<int, T>(i.Value, value);
		}
	}
}

/// <summary>Represents a sparse list that allows insertion and removal of items
/// and empty spaces. In a sparse list, some spaces can be "clear" meaning that 
/// they have no value.</summary>
/// ... long remarks section removed ...
public interface ISparseList<T> : ISparseListSource<T>, IListAndListSource<T>
{
	/// <summary>Unsets the range of indices <c>index</c> to <c>index+count-1</c> inclusive.
	/// If <c>index + count > Count</c>, the sparse list shall enlarge <c>Count</c>
	/// to be equal to <c>index + count</c>.</summary>
	/// <exception cref="ArgumentOutOfRangeException"><c>index</c> or <c>count</c> was negative.</exception>
	/// <exception cref="OverflowException"><c>index + count</c> overflowed.</exception>
	void ClearSpace(int index, int count = 1);
	
	/// <summary>Inserts empty space starting at the specified index.</summary>
	/// <exception cref="OverflowException"><c>index + count</c> overflowed.</exception>
	/// <exception cref="ArgumentOutOfRangeException"><c>index</c> or <c>count</c
	/// > was negative. If <c>index > Count</c>, this method may throw: if, for 
	/// this kind of list, setting this[i] for some invalid i>=0 throws 
	/// <c>ArgumentOutOfRangeException</c>, then so too does this method throw.
	/// If you want the list to be enlarged instead, call <c>Clear(index, 0)</c> 
	/// first, since the contract of Clear() requires it not to throw in the 
	/// same situation.</exception>
	void InsertSpace(int index, int count = 1);
}

用于语法高亮的 `SparseAList<T>`

我的语法高亮器第一个版本只是记录每行开头的词法分析器状态。这基本上奏效了，但存在一两个挑战 [我忘记了细节，因为我已经好几个月没做这个了]，而且它不允许我高效地添加“更高级别”的语法高亮。我希望不仅能对标记进行语法高亮，还能对更高级别的元素进行语法高亮（以 C# 为例，想象一下高亮数据类型和方法名称）。

最简单的方法是解析整个文件，这很慢。我不知道（现在也不知道）如何实现增量解析，但我认为我至少可以实现增量_词法分析_，这将加快解析过程，因为词法分析通常占总解析时间的一半左右。我的想法是，我会构建一个文件中所有非空白标记及其起始位置的列表。然后，当用户输入内容时，我会在编辑位置之前开始实时重新词法分析一小块区域，这将是一个非常快速的操作（大多数时候都是如此）。解析器仍然需要重新处理整个文件，但它会在计时器上运行，每次解析之间至少等待一秒左右。这个解析器会在后台线程上运行，并且可以在令牌列表的冻结版本上操作，以防它在另一个线程上发生变化（请记住，AList 支持快速克隆）。

假设每个标记都是一个四元组（标记类型、标记值、起始位置、长度）。最大的问题是文件很容易包含 100,000 个标记（例如 10,000 多行代码）。如果用户编辑文件开头，我们不希望用户每按一个键就更新与 100,000 个标记相关的“起始位置”。这效率低下。第二个问题是这种列表所需的存储空间相对较大。

一个常用的替代方法是按行存储信息，而不是将整个文件的所有信息存储在一个集合中。这使得大多数更新只影响单行，并且我们不必更新行号或进行大规模数据移动操作，除非添加或删除了换行符（即使那样，“大规模”操作的大小也与行数成比例，而不是字符数）。非常长的行会导致一些效率低下，但这种情况很少见。

但是我已经编写的解析器和词法分析器是为处理索引而不是（行，列）对而设计的，而且我不清楚我是否能足够快地将一种表示形式映射到另一种。此外，这可能是大量_不可重用_的工作：我花在这上面的工作除了语法高亮的特定任务外，将毫无用处。我更喜欢创建具有_多种应用_的软件。

第二个替代方案是基于 AList（或为这种场景设计的其他数据结构，如 gap buffer）。在这种设计中，令牌不是一个四元组，而是一个三元组（令牌类型、令牌值、长度），每个令牌的起始位置由其在 AList 中的位置隐含，并且有一个特殊值表示“此位置没有令牌开始”。所需的存储空间可以使用享元模式减少，但可能仍然需要比磁盘文件大小多 4 倍以上的内存。

为了节省内存，一个变体是去掉标记长度和值，只在每个字符处存储标记类型。标记长度可以在需要时动态获取，而标记值对于语法高亮可能根本不需要（可以给解析器提供虚拟值，假设解析结果最终会被丢弃）。标记类型可能可以存储在一个字节中，这样标记信息占用的空间与文本相同或更少。

但我没有想到那个解决方案。相反，我意识到我可以创建一个 AList 的稀疏版本，这样列表的大小就与标记的数量成比例，而不是与字符的数量成比例（这可能会节省内存，但不一定）。所以我就是这样做的。在我当前的语法高亮器中，我使用 SparseAList<EditorToken>，其中 EditorToken 是普通 16 字节 Loyc Token 的紧凑 8 字节版本，没有 StartIndex。

/// <summary>A compact token representation used by <see cref="SyntaxClassifierForVS"/>.</summary>
[DebuggerDisplay("Type = {Type}, Length = {Length}, Value = {Value}")]
public struct EditorToken
{
    public EditorToken(int type, int length, object value)
    { TypeAndLength = (type & 0x3FFF) | (Math.Min(length, 0x3FFFF) << 14); 
      Value = value; }
    public int Type { get { return TypeAndLength & 0x3FFF; } }
    public int Length { get { return (int)((uint)TypeAndLength >> 14); } }
    public Token ToToken(int start) { 
        return new Token(Type, start, Length, NodeStyle.Default, Value); }

    public object Value;
    // 14 bits for token type (enough to handle TokenKind), 18 for length
    int TypeAndLength;
}

但这真的是另一个故事了。提醒我写一篇……

基准测试

我终于在新工作站 PC 上进行了一些简单的基准测试。公平警告，我只测试了 AList 和 BDictionary，而没有测试其他，如 SparseAList 或 BMultiMap。

第一个测试场景是这样的

创建一个特定大小的列表（例如 10,000 或 30,000 个项目）
启动计时器
循环 100,000 次：在随机位置插入或删除 10 个项目 (3b)。每 10 次迭代，使用 AddRange 或 RemoveRange 在列表末尾删除或插入 10 个项目，以使列表大小大致保持不变。

首先是好消息：当列表大小较大时，AList<T> 在随机插入和删除方面比 List<T> 具有显著更好的性能。

Results Results

什么是 DList<T>？它是标准 List<T> 的直接竞争对手，也是 AList<T> 的一个更简单的替代方案。我写了一篇关于 DList 的独立文章。

百万项列表的原始计时如下

1000000: Insert at random indexes: AList               |1|    82.0|
1000000: Insert at random indexes: DList               |1| 18174.0|
1000000: Insert at random indexes: List                |1| 36190.0|
1000000: Remove at random indexes: AList               |1|    60.0|
1000000: Remove at random indexes: DList               |1| 18251.0|
1000000: Remove at random indexes: List                |1| 35909.0|

所以 AList 碾压所有。然而，当列表很小时，AList<T> 并不是明显的赢家。

Results Results

事实上，对于小列表（小于 100 项），List<T> 大约快两倍。

AList 在简单的“填充”方面也很糟糕，即项目总是添加到列表的末尾。

Add-at-end benchmark

这个场景是从一个特定大小的 IList<long> 开始，然后只添加 100,000 个项目。AList<T> 在这方面表现不佳，至少与 List<T> 和 DList<T> 相比，这两者通常都在一毫秒或更短的时间内完成任务（为什么 DList<T> 在 300 万个项目时突然“变慢”？可能在 100,000 次迭代的某个时刻，DList 必须扩大其内部的 300 万个项目的数组）。

我还测试了以下速度

一个通过 [索引器] 反复读取集合所有元素的循环，即

for (int c = 0; c < Cycles; c++) {
    sum = 0;
    for (int i = 0; i < list.Count; i++)
        sum += list[i];
}

一个通过 IEnumerator 扫描所有元素的循环

for (int c = 0; c < Cycles; c++)
    avg = list.Average(); // LINQ scans the collection with `IEnumerator`

此处，Cycles 的选择使得 Cycles * list.Count = 10_000_000（例外：在 300 万个项目时，Cycles = 3）。

AList<T> 是一种更复杂的数据结构，因此它的遍历速度比 List<T> 和 DList<T> 慢。

Scan results

请注意，AList 的 IEnumerator 具有恒定速度，而 AList 的 [indexer] 对于大型列表会变慢。这是因为索引器需要 O(log N) 时间，而 IEnumerator<T>.MoveNext 是 O(1) 操作。然而，AList 枚举器对于短列表往往较慢，可能是因为枚举器需要更多时间初始化（除了包含单个叶节点的非常短的列表，枚举器在枚举开始时会分配一个小型数组——一个遍历栈）。

我注意到 List<T> 有一个有趣的地方：通过 IEnumerator<T> 访问比通过索引器直接访问慢了 7 倍！例如，在包含 100,000 个项目的列表中，List<T> 索引器基准测试需要 14 毫秒，但 List<T> IEnumerator 基准测试需要 92 毫秒。这是为什么呢？我怀疑这是因为当你使用 LINQ（通过 IEnumerator<T> 访问）时，遍历无法被 JIT 内联。如果发生内联，它自然会快得多，因为对计算机来说，加一大堆数字是一项非常非常简单的工作。此外，如果发生内联，List 所需的两个范围检查之一可能被消除。可以消除（是的，有两个。一个是针对 List<T> 的 Count 的范围检查，另一个是针对 List<T> 内部数组的 Length 的检查。）

我注意到 DList<T> 没有通过索引器获得那么多的性能提升，所以我尝试了 InternalDList<T>，它获得了更好的结果（请参阅我的 DList 文章。）

字典基准测试

我对五种 <long, int> 字典（Dictionary、MMap、BDictionary、SortedDictionary 和 SortedList）做了一个简单的基准测试。结果很清楚

Results

显然，SortedList 是邪恶的！具体来说，对于任何包含大约 300 个或更多项目的集合，SortedList 是最慢的字典类型，如果项目数达到 10000 个或更多，它就开始变得慢得离谱。如果 SortedList 包含 300 万个项目，那么在这台高端 PC 上，100,000 次随机插入需要 401 秒（将近 7 分钟）；这意味着每次迭代超过 4 毫秒。

现在我们来看一个没有 SortedList 搞砸 Y 轴的图表。

Results

需要注意的是，有序字典通常比无序字典慢。标准的 Dictionary<K,V> 和 Loyc.Collections 的 MMap<K,V> 类都是无序的，因此它们比 BDictionary<K,V> 和标准 SortedDictionary<K,V>（它们是有序的）要快。

令我失望的是，BDictionary<K,V> 通常比 SortedDictionary<K,V> 慢。但另一方面，BDictionary<K,V> 比 SortedDictionary<K,V> 使用更少的内存，因为它将其项目打包成数组。相比之下，SortedDictionary 是一个红黑树，它为字典中的每个键值对（节点）分配一个单独的堆对象。这会带来每个项目 5 个字的开销，在 64 位进程中通常是 40 字节（其中两个字用于对象头，两个字用于指向每个节点的左子节点和右子节点的引用，一个字用于红黑树的“红色”标志）。还应注意的是，SortedDictionary<K,V> 可能比 BDictionary<K,V> 对垃圾收集器造成更大的压力，因为它包含大量引用；我预计这只会在堆较大的应用程序中显著影响性能。

尽管其速度表现平平，但 BDictionary<K,V> 通常是比 SortedDictionary<K,V> 更好的选择，因为它支持许多 SortedDictionary<K,V> 不支持的操作，例如“查找下一个更高的键”、“将列表分成两部分”和“按索引获取项目”。

下载！

我很高兴地宣布一个新的 NuGet 包，名为“LoycCore”，它包含了所有 AList 的变体以及其他数据结构（包括 DList）和各种其他便捷工具，包括我过去发表文章中的大部分内容。我还为 Loyc Core 创建了一个新网站，你可以在 GitHub 上查看或下载源代码。

历史

2013年3月27日：第1部分发布
2013年9月8日：第2部分发布
2014年8月12日：第3部分发布

列表三部曲，第3部分

SparseAList<T> 简介

用于语法高亮的 SparseAList<T>

基准测试

字典基准测试

下载！

历史

用于语法高亮的 `SparseAList<T>`