使用哈希算法进行分布式缓存

Ivan L

3.40/5 (5投票s)

2007 年 12 月 1 日

CPOL

3分钟阅读

36251

可用于 Web 场分布式数据缓存或实现分布式哈希表 (DHT) 的哈希算法。

引言

本文介绍了一种哈希算法的思想，该算法可用于 Web 服务器场的分布式数据缓存或实现分布式哈希表 (DHT)。

算法

在高流量网站中，使用状态或在内存中缓存数据通常是一个问题，因为很难扩展。开箱即用的 ASP.NET 提供了三种存储会话信息的方式——进程内、进程外和使用数据库，但它们都不能轻松扩展。此外，使用数据库存储会话状态的成本相当高。

理想情况下，对于大规模 Web 应用程序中的会话管理，我们应该能够“插入”相对低规格的“缓存机器”，并拥有一个基础设施，可以将要缓存的数据均匀地分布到整个池中。下图说明了我所指的内容。

Screenshot - Distributed_Caching_Topology.png

我们有一个处理请求的 Web 服务器层和一个用于缓存的机器池。每个 Web 服务器都应该能够访问任何缓存机器。这里的棘手之处在于，当我们缓存一些数据作为对特定 Web 服务器的请求的结果时，我们应该能够从任何其他 Web 服务器检索该数据。

经过一些搜索，我遇到了 Memchached，它使用哈希机制将缓存的数据分布在缓存机器池中，并提供了一种可靠的检索数据的方法。这促使我尝试用 C# 开发一种类似的算法，该算法可以在这种场景中使用。

如果我们将哈希用于这种类型的解决方案，我们可以从中受益两个特性。首先，当我们散列一段数据时，得到的位数组长度始终相同。其次，我不确定这是否适用于所有哈希算法，但生成的哈希在统计上是随机的，因此我们可以假设它们形成某种均匀分布。

这是我们可以做的。我们想要缓存键值对，然后在稍后阶段通过提供相应的键来获取缓存的值。下面的代码获取一个键，并使用 SHA1 生成一个哈希位数组。然后经过一些转换，我们导出一个整数。给定的键总是产生相同的数字。此外，所有数字都是均匀分布的，这允许我们“分页”它们并将它们分配给给定数量的缓存机器或“桶”。

使用此算法，我们确保每个键始终指向同一个桶，并且所有键均匀地分布在缓存机器池中。

KeyDistributor 类

public class KeyDistributor<TKey>
{
    private int _numBuckets;
    private int _pageSize;
    private SHA1 _sha;

    public int CalculateBucketIndex(TKey key)
    {
        if (_numBuckets == 1)
        {
            return 0;
        }

        // Create a SHA1 hash of the provided key
        byte[] result;
        byte[] data = BitConverter.GetBytes(key.GetHashCode());
        result = _sha.ComputeHash(data);

        // Split the hash into 4 integer numbers
        uint n1 = BitConverter.ToUInt32(result, 0);
        uint n2 = BitConverter.ToUInt32(result, 4);
        uint n3 = BitConverter.ToUInt32(result, 8);
        uint n4 = BitConverter.ToUInt32(result, 12);
        uint n5 = BitConverter.ToUInt32(result, 16);

        // Apply XOR to derive a combined number
        uint index = n1 ^ n2 ^ n3 ^ n4 ^ n5;

        // Calculate the bucket index
        int bucketIndex = Convert.ToInt32(Math.Floor((double)(index / _pageSize)));

        return bucketIndex;
    }

    public KeyDistributor(int numBuckets)
    {
        _numBuckets = numBuckets;

        if (_numBuckets > 1)
        {
            // Based on the number of buckets calculate the page size.
            // It will be to link a given key to a specific bucket
            _pageSize = Convert.ToInt32
			(Math.Ceiling((double)(uint.MaxValue / numBuckets)));
            _sha = new SHA1Managed();
        }
    }
}

测试代码

现在让我们测试一下这个算法。下面的static 方法创建一些桶并在其中缓存项目。每个项目都是一个键值对，其中键是 GUID。

static void Main(string[] args)
{
    // The number of items to cache
    int k = 10000;
    // The number of buckets
    int bucketsNum = 8;
    List<Dictionary<Guid, string>> buckets = new List<Dictionary<Guid, string>>();
    List<Guid> values = new List<Guid>();

    // Initialize the buckets
    for (int i = 0; i < bucketsNum; i++)
    {
        buckets.Add(new Dictionary<Guid, string>());
    }

    // Create an instance of the KeyDistributor class
    KeyDistributor<Guid> kd = new KeyDistributor<Guid>(bucketsNum);

    for (int i = 0; i < k; i++)
    {
        // Create a Guid type of key
        Guid guid = Guid.NewGuid();
        // Calculate the bucket index
        int bucketIndex = kd.CalculateBucketIndex(guid);
        // Cache the item in the bucket
        buckets<bucketindex>.Add(guid, guid.ToString());
        // Add the key to a list so that we can later retrieve corresponding value
        values.Add(guid);
    }

    // Display the number of items in each bucket
    for (int i = 0; i < buckets.Count; i++)
    {
                Console.WriteLine("Bucket " + i + ": " + buckets[i].Count);
    }

    // Retrieve each cached item. This is to check that we can get hold of
    // each cached item
    for (int i = 0; i < values.Count; i++)
    {
        Guid guid = values[i];
        int bucketIndex = kd.CalculateBucketIndex(guid);
        string val = buckets<bucketindex>[guid];

        // If the item is not found raise an error
        if (String.IsNullOrEmpty(val))
        {
            throw new Exception("Cached item cannot be found");
        }
    }

    Console.ReadLine();
}

这些是结果

Number of buckets 3, items to cache 10,000.
Bucket 0: 3297
Bucket 1: 3342
Bucket 2: 3361

Number of buckets 8, items to cache 1,000,000.
Bucket 0: 125195
Bucket 1: 125419
Bucket 2: 124767
Bucket 3: 125105
Bucket 4: 124933
Bucket 5: 125086
Bucket 6: 125045
Bucket 7: 124450

可以看出，项目的分布在各个桶中几乎是均匀的。代码没有引发任何错误，这意味着我们可以成功检索每个缓存的项目。

后续步骤

在另一篇文章中，我将演示使用命名管道的完整分布式缓存原型。

历史

2007 年 12 月 1 日：初始帖子