65.9K
CodeProject 正在变化。 阅读更多。
Home

C++ 中的多级缓存,每秒可执行十亿次查找

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.99/5 (26投票s)

2021年10月11日

GPL3

31分钟阅读

viewsIcon

33980

前端为直接映射缓存,后端为 LRU 近似缓存,最底层为任何 LLC。

缓存

硬件设计和软件开发中的每个人,或多或少都在使用缓存。缓存是将密集内存的成本效益与快速内存的低延迟相结合的有用的结构。

由于信息技术中没有免费的午餐,缓存必须在性能和容量上有所权衡才能真正有用。这就是为什么桌面处理器中没有 8GB 的 L1 缓存。缓存越大,延迟越高。同样,缓存越小,缓存命中率越低,这可能意味着性能越低。为了达到最佳性能和最低成本,必须找到一个中间点。

与硬件一样,软件缓存也有其权衡。例如,要创建一个完美的 LRU 以仅逐出最近最少使用的缓存项,开发人员至少需要使用 std::mapstd::unordered_map。这具有不容忽视的 O(1) 摊销时间复杂度,并且逐出必须在一个链表(有时是双向链表)上进行,这需要频繁的内存分配或多个指针赋值。这些解决方案会增加每次缓存命中和每次缓存未命中的额外延迟。

LRU 近似

LRU 近似算法是在缓存命中率和访问延迟之间进行权衡。通常,当基于访问延迟的性能提升远大于损失时,略低的命中率是可以容忍的。在本文中,实现了 CLOCK 二次机会版本的 LRU 近似并进行了基准测试,然后通过在其前面添加一个直接映射缓存对其进行了进一步优化,以便为较小的数据集提供单周期逐出项搜索。

CLOCK 二次机会算法

CLOCK-二次机会算法通过使用固定大小的缓冲区(循环缓冲区)来优化节点分配(链表)。在每次 get/set 调用时都没有内存分配,缓存命中和缓存未命中的时间都会最小化,但仍然需要映射,这会在访问数据之前和之后产生一些哈希计算和比较。

CLOCK 有两只“手”,一只用于逐出受害者,另一只用于给其他可能的受害者二次机会。一只手领先另一只手 50% 的相位差,并且都可以通过两个整数计数器(如果缓存元素数量超过 40 亿,则为 size_t 类型计数器)来实现。

每当调用缓存的 get set 方法时,首先要做的就是在缓存中查找请求的键。这是对映射的 find 方法和与 end() 方法的比较的一次调用。如果找到,则键/值对在 RAM 中。如果未找到,则 CLOCK 算法的项搜索过程开始迭代循环缓冲区并检查可用项。一个 CLOCK“手”简单地检查任何尚未访问的项,并将其用作要逐出的受害者项。另一只手以 50% 的相位差沿同一缓冲区跟随,并为其他“不幸”的项提供二次机会,以证明它们“不是最近最少使用”。这是一个近似值,即使在只有 2 个大小的缓存情况下,它也无法像完整的 LRU 那样工作。

CLOCK-二次机会算法实现 (LruClockCache.h)

#ifndef LRUCLOCKCACHE_H_
#define LRUCLOCKCACHE_H_

#include<vector>
#include<algorithm>
#include<unordered_map>
#include<functional>
#include<mutex>
#include<unordered_map>

/* LRU-CLOCK-second-chance implementation
 *
 * LruKey: type of key (std::string, int, char, size_t, objects)
 * LruValue: type of value that is bound to key (same as above)
 * ClockHandInteger: just an optional optimization to reduce memory consumption
 * when cache size is equal to or less than 255,65535,4B-1,...
 */
template<	typename LruKey, typename LruValue,typename ClockHandInteger=size_t>
class LruClockCache
{
public:
	// allocates circular buffers for numElements number of cache slots
	// readMiss: 	cache-miss for read operations. User needs to give this function
	// 				to let the cache automatically get data from backing-store
	//				example: [&](MyClass key){ return redis.get(key); }
	//				takes a LruKey as key, returns LruValue as value
	// writeMiss: 	cache-miss for write operations. User needs to give this function
	// 				to let the cache automatically set data to backing-store
	//				example: [&](MyClass key, MyAnotherClass value){ redis.set(key,value); }
	//				takes a LruKey as key and LruValue as value

	LruClockCache(ClockHandInteger numElements,
				const std::function<LruValue(LruKey)> & readMiss,
				const std::function<void(LruKey,LruValue)> &
                writeMiss):size(numElements),loadData(readMiss),saveData(writeMiss)
	{
		ctr = 0;
		// 50% phase difference between eviction and second-chance hands of the
        // "second-chance" CLOCK algorithm
		ctrEvict = numElements/2;

		//loadData=readMiss;
		//saveData=writeMiss;

		// initialize circular buffers
		for(ClockHandInteger i=0;i<numElements;i++)
		{
			valueBuffer.push_back(LruValue());
			chanceToSurviveBuffer.push_back(0);
			isEditedBuffer.push_back(0);
			keyBuffer.push_back(LruKey());
		}
		mapping.reserve(numElements);
	}

	// get element from cache
	// if cache doesn't find it in circular buffers,
	// then cache gets data from backing-store
	// then returns the result to user
	// then cache is available from RAM on next get/set access with same key
	inline
	const LruValue get(const LruKey & key)  noexcept
	{
		return accessClock2Hand(key,nullptr);
	}

	// only syntactic difference
	inline
	const std::vector<LruValue> getMultiple(const std::vector<LruKey> & key)  noexcept
	{
		const int n = key.size();
		std::vector<LruValue> result(n);

		for(int i=0;i<n;i++)
		{
			result[i]=accessClock2Hand(key[i],nullptr);
		}
		return result;
	}

	// thread-safe but slower version of get()
	inline
	const LruValue getThreadSafe(const LruKey & key)  noexcept
	{
		std::lock_guard<std::mutex> lg(mut);
		return accessClock2Hand(key,nullptr);
	}

	// set element to cache
	// if cache doesn't find it in circular buffers,
	// then cache sets data on just cache
	// writing to backing-store only happens when
	// 					another access evicts the cache slot containing this key/value
	//					or when cache is flushed by flush() method
	// then returns the given value back
	// then cache is available from RAM on next get/set access with same key
	inline
	void set(const LruKey & key, const LruValue & val) noexcept
	{
		accessClock2Hand(key,&val,1);
	}

	// thread-safe but slower version of set()
	inline
	void setThreadSafe(const LruKey & key, const LruValue & val)  noexcept
	{
		std::lock_guard<std::mutex> lg(mut);
		accessClock2Hand(key,&val,1);
	}

	// use this before closing the backing-store to store the latest bits of data
	void flush()
	{
		for (auto mp = mapping.cbegin();
             mp != mapping.cend() /* not hoisted */; /* no increment */)
		{
		  if (isEditedBuffer[mp->second] == 1)
		  {
				isEditedBuffer[mp->second]=0;
				auto oldKey = keyBuffer[mp->second];
				auto oldValue = valueBuffer[mp->second];
				saveData(oldKey,oldValue);
				mapping.erase(mp++);    // or "it = m.erase(it)" since C++11
		  }
		  else
		  {
			  	++mp;
		  }
		}
	}

	// CLOCK algorithm with 2 hand counters
    // (1 for second chance for a cache slot to survive, 1 for eviction of cache slot)
	// opType=0: get
	// opType=1: set
	LruValue const accessClock2Hand(const LruKey & key,const LruValue * value,
                                    const bool opType = 0)
	{
		// check if it is a cache-hit (in-cache)
		typename std::unordered_map<LruKey,ClockHandInteger>::iterator it = mapping.find(key);
		if(it!=mapping.end())
		{
			chanceToSurviveBuffer[it->second]=1;
			if(opType == 1)
			{
				isEditedBuffer[it->second]=1;
				valueBuffer[it->second]=*value;
			}
			return valueBuffer[it->second];
		}
		else // could not find key in cache, so searching in circular-buffer starts
		{
			long long ctrFound = -1;
			LruValue oldValue;
			LruKey oldKey;
			while(ctrFound==-1)
			{
				// eviction hand lowers the "chance" status down if it's 1
                // but slot is saved from eviction
				if(chanceToSurviveBuffer[ctr]>0)
				{
					chanceToSurviveBuffer[ctr]=0;
				}

				// circular buffer has no bounds
				ctr++;
				if(ctr>=size)
				{
					ctr=0;
				}

				// unlucky slot is selected for eviction
				if(chanceToSurviveBuffer[ctrEvict]==0)
				{
					ctrFound=ctrEvict;
					oldValue = valueBuffer[ctrFound];
					oldKey = keyBuffer[ctrFound];
				}

				// circular buffer has no bounds
				ctrEvict++;
				if(ctrEvict>=size)
				{
					ctrEvict=0;
				}
			}

			// eviction algorithm start
			if(isEditedBuffer[ctrFound] == 1)
			{
				// if it is "get"
				if(opType==0)
				{
					isEditedBuffer[ctrFound]=0;
				}

				saveData(oldKey,oldValue);

				// "get"
				if(opType==0)
				{
					const LruValue && loadedData = loadData(key);
					mapping.erase(keyBuffer[ctrFound]);
					valueBuffer[ctrFound]=loadedData;
					chanceToSurviveBuffer[ctrFound]=0;

					mapping.emplace(key,ctrFound);
					keyBuffer[ctrFound]=key;

					return loadedData;
				}
				else /* "set" */
				{
					mapping.erase(keyBuffer[ctrFound]);

					valueBuffer[ctrFound]=*value;
					chanceToSurviveBuffer[ctrFound]=0;

					mapping.emplace(key,ctrFound);
					keyBuffer[ctrFound]=key;
					return *value;
				}
			}
			else // not edited
			{
				// "set"
				if(opType == 1)
				{
					isEditedBuffer[ctrFound]=1;
				}

				// "get"
				if(opType == 0)
				{
					const LruValue && loadedData = loadData(key);
					mapping.erase(keyBuffer[ctrFound]);
					valueBuffer[ctrFound]=loadedData;
					chanceToSurviveBuffer[ctrFound]=0;

					mapping.emplace(key,ctrFound);
					keyBuffer[ctrFound]=key;

					return loadedData;
				}
				else // "set"
				{
					mapping.erase(keyBuffer[ctrFound]);

					valueBuffer[ctrFound]=*value;
					chanceToSurviveBuffer[ctrFound]=0;

					mapping.emplace(key,ctrFound);
					keyBuffer[ctrFound]=key;
					return *value;
				}
			}
		}
	}

private:
	ClockHandInteger size;
	std::mutex mut;
	std::unordered_map<LruKey,ClockHandInteger> mapping;
	std::vector<LruValue> valueBuffer;
	std::vector<unsigned char> chanceToSurviveBuffer;
	std::vector<unsigned char> isEditedBuffer;
	std::vector<LruKey> keyBuffer;
	const std::function<LruValue(LruKey)>  loadData;
	const std::function<void(LruKey,LruValue)>  saveData;
	ClockHandInteger ctr;
	ClockHandInteger ctrEvict;
};

#endif /* LRUCLOCKCACHE_H_ */

算法的核心,逐出部分,由一个 while 循环和一个条件性逐出数据流组成。while 循环仅检查符合条件的逐出项和二次机会项。由于是循环迭代,它很可能在仅几次迭代中就找到一个要逐出的项,二次机会部分也是如此。拥有更大的缓存甚至可以减少逐出项的额外迭代次数的概率。

在找到一个要逐出的项后,它会检查该项是否被编辑过。已编辑状态意味着“之前被 set() 方法更改过”,或者在缓存术语中是“脏位”。每项只需要一个字节就可以保存所需信息。如果设置了 isEdited 状态,则会更新后端存储或缓存所依赖的任何源。这可能是写入数据库、耗时计算或另一个较慢的缓存。

在有条件地更新后端存储后,它会分支到 get/set 路径。如果是一个 set 方法调用,它会用新的键/值更新项,并更新映射对象以将新键指向受害者项。如果是一个 get 方法,它会额外执行一次从后端存储读取新值的操作。然后,新值将从 get 方法返回。

缓存的实现要求用户将读缓存未命中和写缓存未命中操作的缓存未命中函数作为构造函数参数提供。读未命中函数可以作为 lambda 函数给出,其参数与模板的第一个参数类型相同,并返回模板的第二个参数类型的返回值。

LruClockCache<Key,Value> cache(numberOfCacheItems,[&](Key key){

  // cache miss (read)
  // access data-store
  return receiveDataFromVoyager1(key);
  },[&](Key key,Value value){

  // cache miss (write)
  // access data-store
  transmitDataToVoyager1(key,value);
});

当然,这个项目不适用于在太空中运行的 것입니다。在这样的系统中,应该选择最高的缓存命中率而不是访问延迟,最好是动态演进的逐出算法(自适应)。本项目旨在用于每秒调用数百万次甚至数十亿次缓存未命中和缓存命中的情况,例如 Minecraft 克隆版中加载/存储世界块。

构建完成后,它会在需要时自动影响后端存储,如下所示:

// cache handles all cace-miss functions automatically
WorldChunk chunk = cache.get("terrain data at 1500 35 2000");

chunk.type = WorkdChunk::Lava();

// cache handles all cace-miss functions automatically
cache.set("terrain data at 1500 35 2000",chunk);

// should be called only before cache or backing-store is freed/disconnected
cache.flush();

由于 set 调用不会立即写入后端存储,因此在关闭应用程序之前必须调用 flush() 方法,如果用户需要在下一个游戏会话加载时使用当前会话的最新数据。

cache.flush(); // all pending writes are flushed to backing-store

除了普通的 get/set 方法外,还有 getThreadSafesetThreadSafe 方法可用作最后一级缓存(LLC)。这些方法具有简单的锁保护机制,可使多个线程之间的数据传输同步。

Single-level cache with data coherence for multiple threads

// somewhere in thread-1
cache.setThreadSafe("terrain data at 1500 35 2000",lava);
...
// somewhere in thread-2
WorldChunk chunk = cache.getThreadSafe("terrain data at 1500 35 2000");

当许多线程争用同一资源时,同步会大大增加延迟。最佳情况是没有任何线程争用资源。然后它会快得多,但这需要缓存资源(这里,资源已经是缓存,所以一个线程也应该拥有自己的私有缓存,以减少对共享缓存(最后一级缓存)的依赖,稍后会详细介绍)。

为了以简单、可读的 RAII 方式测试缓存的性能,还有一个头文件(源代码

#ifndef CPUBENCHMARKER_H_
#define CPUBENCHMARKER_H_

#include <chrono>
#include <string>
#include <iostream>
#include <iomanip>

// RAII type benchmarker
class CpuBenchmarker
{
public:
	CpuBenchmarker():CpuBenchmarker(0,"",0)
	{
		measurementTarget=nullptr;
	}

	CpuBenchmarker(size_t bytesToBench):CpuBenchmarker(bytesToBench,"",0)
	{
		measurementTarget=nullptr;
	}

	CpuBenchmarker(size_t bytesToBench, std::string infoExtra):CpuBenchmarker
                  (bytesToBench,infoExtra,0)
	{
		measurementTarget=nullptr;
	}

	CpuBenchmarker(size_t bytesToBench, std::string infoExtra,
                   size_t countForThroughput):t1(std::chrono::duration_cast<
                   std::chrono::nanoseconds >
                  (std::chrono::high_resolution_clock::now().time_since_epoch()))
	{
		bytes=bytesToBench;
		info=infoExtra;
		count = countForThroughput;
		measurementTarget=nullptr;
	}

	// writes elapsed time (in seconds) to this variable upon destruction
	void addTimeWriteTarget(double * measurement)
	{
		measurementTarget=measurement;
	}

	~CpuBenchmarker()
	{
		std::chrono::nanoseconds t2 =  std::chrono::duration_cast<
                                       std::chrono::nanoseconds >
                           (std::chrono::high_resolution_clock::now().time_since_epoch());
		size_t t = t2.count() - t1.count();
		if(measurementTarget!=nullptr)
		{
			*measurementTarget=t/1000000000.0; // seconds
		}
		if(info!=std::string(""))
			std::cout<<info<<": ";
		std::cout<<t<<" nanoseconds    ";
		if(bytes>0)
		{
			std::cout <<" (bandwidth = ";
		    std::cout << std::fixed;
		    std::cout << std::setprecision(2);
			std::cout <<   (bytes/(((double)t)/1000000000.0))/1000000.0 <<" MB/s)     ";
		}
		if(count>0)
		{
			std::cout<<" (throughput = ";
		    std::cout << std::fixed;
		    std::cout << std::setprecision(2);
			std::cout <<   (((double)t)/count) <<" nanoseconds per iteration) ";
		}
		std::cout<<std::endl;
	}

private:
	std::chrono::nanoseconds t1;
	size_t bytes;
	size_t count;
	std::string info;
	double * measurementTarget;
};

#endif /* CPUBENCHMARKER_H_ */

此类的实例在其构造函数中初始化时间测量,在其析构函数中完成测量。用法是基于作用域的。它测量其构造和销毁之间的时间跨度

for(int i=0;i<10;i++)
{
    CpuBenchmarker bench;
    for(int j=0;j<100000;j++)
        summation += cache.get(j);
}

然后将结果输出到 cout

14292519 nanoseconds
1613470 nanoseconds
1484368 nanoseconds
1475362 nanoseconds
1476067 nanoseconds
1445494 nanoseconds
1476075 nanoseconds
1443211 nanoseconds
1444241 nanoseconds
1542156 nanoseconds

通过检查 benchmarker 类的用法和 LRU 近似算法的性能,每次 cache.get(key) 调用平均延迟约为 15 纳秒。这与测试系统处理器 L2 缓存的延迟一致。FX8150 在 2GHz 时,L2 缓存为 21 个时钟周期 = 10.5 纳秒。额外的延迟来自逐出逻辑。Dataset 只有 100000 个元素(int 类型),这相当于 400kB 数据,不适合 L1 缓存,但肯定受益于 L2,它有 2MB 容量。

dataset 增加到超出 L3 缓存时

for(int i=0;i<10;i++)
{
    CpuBenchmarker bench;
    for(int j=0;j<10000000;j++)    // 40M iterations = 40MB dataset
        summation += cache.get(j); // cache size = 20M items
}

输出是

2429736836 nanoseconds
184091798 nanoseconds
183418885 nanoseconds
182178757 nanoseconds
182320176 nanoseconds
180748781 nanoseconds
181500728 nanoseconds
182763052 nanoseconds
184306739 nanoseconds
182924882 nanoseconds   

这相当于每次 cache.get() 的平均延迟为 18 纳秒。这并非实际延迟,而是其逆吞吐量。现代 CPU 执行超标量、流水线和乱序操作,隐藏了代码块的很大一部分实际延迟。要对实际延迟有所了解,可以将迭代次数减少到只有几次

for(int i=0;i<10;i++) {
    CpuBenchmarker bench;
    summation += cache.get(5);
}

结果如下

3700 nanoseconds
208 nanoseconds
165 nanoseconds
167 nanoseconds
167 nanoseconds
174 nanoseconds
167 nanoseconds
167 nanoseconds
166 nanoseconds
166 nanoseconds

第一个做了什么

  • 尝试在 unordered_map 中查找键并失败
  • 尝试在缓存中查找要逐出的受害者项
  • 调用用户提供的缓存未命中函数并执行逐出
  • 将必要的数据从 RAM(测试中使用了 std::vector 作为后端存储)加载到 CPU 的 L3/L2/L1 缓存中
  • 从缓存服务数据
  • 3700 纳秒

第二个做了什么

  • 尝试在 unordered_map 中查找键而不失败
  • 直接从缓存项返回数据
  • CPU 分支预测尚未达到足够好的效果
  • 208 纳秒

其余迭代

  • 尝试在 unordered_map 中查找键而不失败
  • 直接从缓存项返回数据
  • CPU 分支预测认为这是对同一项的又一次调用
  • 165 纳秒

以上所有测试均使用足够大的缓存以实现 100% 缓存命中。要测试 API 的缓存未命中性能,数据集应大于缓存,并且其访问模式应该是算法的弱点。LRU 近似算法的弱点是对大型数据集进行顺序访问

std::vector<int> test(1000000);

for(int i=0;i<1000000;i++)
    test[i]=sin(i*3.14f);

LruClockCache<int,int> cache(100,[&](int key){
        return test[key];
    },[&](int key, int value){
        test[key]=value;
});

size_t s1=0;
for(int i=0;i<10;i++)
{
    CpuBenchmarker bench;
    for(int j=0;j<1000000;j++)
        s1+=cache.get(j);
}
std::cout<<"======"<<s1<<std::endl;

输出如下

118993415 nanoseconds
120295111 nanoseconds
120736697 nanoseconds
119908045 nanoseconds
118299823 nanoseconds
118886337 nanoseconds
119130691 nanoseconds
118587024 nanoseconds
119342589 nanoseconds
119526892 nanoseconds
======120

当缓存大小为 100 个元素,数据集为 1M 个元素时,命中率为 0.0001%。~119 毫秒处理 1M 次 cache.get() 调用相当于每次调用 119 纳秒。这主要受限于 RAM 延迟、调用 lambda 函数、map.find() 调用失败以及将 std::vector 作为二次间接访问。1M 个元素仅使用 4MB,而处理器 L3 缓存为 8MB,因此只有在数据集更大时才会观察到 RAM 延迟。当 dataset 大小增加 10 倍时,平均延迟从 119 纳秒增加到 120 纳秒,这种微小的增加可能是由于 map.find() 的 O(logN) 时间复杂度失败造成的。dataset 再增加 10 倍,每次调用的延迟为 125 纳秒,而不是 1/100 大小的缓存的 119 纳秒。

直接映射缓存

为了过滤 LRU 近似缓存的输入以应对某些极端情况并实现更快的访问,可以在 LRU 近似缓存的前面添加一个直接映射缓存。

直接映射缓存是一种无比较缓存,其中要逐出的实际缓存项直接通过一个简单的操作得出。最简单的是模运算符。这使得每个 kth 整数键映射到相同的缓存项,每个 (k+1)st 键映射到相同的相邻缓存项,整个缓存项数组映射到整个键空间,仅通过

int tag = key % size;

计算。算法的其余部分与 LRU 相同,其中正在进行逐出,但由于簿记工作较少(无需任何映射),因此每个缓存大小的字节也更少。因此,相同大小(元素数量)的直接映射缓存比相同大小(元素数量)的 LRU 缓存占用更少的内存空间。这进一步增加了缓存本身包含在 CPU L1(或 L2)缓存中的机会。拥有一个(循环或非循环)缓冲区有利于引用局部性,并与 CPU 缓存更好地配合。

直接映射缓存存在缺点。一个缺点是存在相同缓存项的键冲突,这会降低缓存命中率;另一个缺点是仅限于整数键。LRU 缓存可以与任何类型的键一起工作,从 string 到对象再到整数。

实现软件端的直接映射缓存很简单(源代码

#ifndef DIRECTMAPPEDCACHE_H_
#define DIRECTMAPPEDCACHE_H_

#include<vector>
#include<functional>
#include<mutex>
#include<iostream>

/* Direct-mapped cache implementation
 * Only usable for integer type keys
 *
 * CacheKey: type of key (only integers: int, char, size_t)
 * CacheValue: type of value that is bound to key (same as above)
 */
template<	typename CacheKey, typename CacheValue>
class DirectMappedCache
{
public:
	// allocates buffers for numElements number of cache slots/lanes
	// readMiss: 	cache-miss for read operations. User needs to give this function
	// 				to let the cache automatically get data from backing-store
	//				example: [&](MyClass key){ return redis.get(key); }
	//				takes a CacheKey as key, returns CacheValue as value
	// writeMiss: 	cache-miss for write operations. User needs to give this function
	// 				to let the cache automatically set data to backing-store
	//				example: [&](MyClass key, MyAnotherClass value){ redis.set(key,value); }
	//				takes a CacheKey as key and CacheValue as value
	// numElements: has to be integer-power of 2 (e.g. 2,4,8,16,...)
	DirectMappedCache(CacheKey numElements,
				const std::function<CacheValue(CacheKey)> & readMiss,
				const std::function<void(CacheKey,CacheValue)> &
                writeMiss):size(numElements),sizeM1(numElements-1),
                loadData(readMiss),saveData(writeMiss)
	{
		// initialize buffers
		for(CacheKey i=0;i<numElements;i++)
		{
			valueBuffer.push_back(CacheValue());
			isEditedBuffer.push_back(0);
			keyBuffer.push_back(CacheKey());
		}
	}

	// get element from cache
	// if cache doesn't find it in buffers,
	// then cache gets data from backing-store
	// then returns the result to user
	// then cache is available from RAM on next get/set access with same key
	inline
	const CacheValue get(const CacheKey & key)  noexcept
	{
		return accessDirect(key,nullptr);
	}

	// only syntactic difference
	inline
	const std::vector<CacheValue> getMultiple(const std::vector<CacheKey> & key)  noexcept
	{
		const int n = key.size();
		std::vector<CacheValue> result(n);

		for(int i=0;i<n;i++)
		{
			result[i]=accessDirect(key[i],nullptr);
		}
		return result;
	}

	// thread-safe but slower version of get()
	inline
	const CacheValue getThreadSafe(const CacheKey & key)  noexcept
	{
		std::lock_guard<std::mutex> lg(mut);
		return accessDirect(key,nullptr);
	}

	// set element to cache
	// if cache doesn't find it in buffers,
	// then cache sets data on just cache
	// writing to backing-store only happens when
	// 					another access evicts the cache slot containing this key/value
	//					or when cache is flushed by flush() method
	// then returns the given value back
	// then cache is available from RAM on next get/set access with same key
	inline
	void set(const CacheKey & key, const CacheValue & val) noexcept
	{
		accessDirect(key,&val,1);
	}

	// thread-safe but slower version of set()
	inline
	void setThreadSafe(const CacheKey & key, const CacheValue & val)  noexcept
	{
		std::lock_guard<std::mutex> lg(mut);
		accessDirect(key,&val,1);
	}

	// use this before closing the backing-store to store the latest bits of data
	void flush()
	{
		try
		{
		for (CacheKey i=0;i<size;i++)
		{
		  if (isEditedBuffer[i] == 1)
		  {
				isEditedBuffer[i]=0;
				auto oldKey = keyBuffer[i];
				auto oldValue = valueBuffer[i];
				saveData(oldKey,oldValue);
		  }
		}
		}catch(std::exception &ex){ std::cout<<ex.what()<<std::endl; }
	}

	// CLOCK algorithm with 2 hand counters
    // (1 for second chance for a cache slot to survive, 1 for eviction of cache slot)
	// opType=0: get
	// opType=1: set
	CacheValue const accessDirect(const CacheKey & key,const CacheValue * value,
               const bool opType = 0)
	{
		// find tag mapped to the key
		CacheKey tag = key & sizeM1;

		// compare keys
		if(keyBuffer[tag] == key)
		{
			// cache-hit

			// "set"
			if(opType == 1)
			{
				isEditedBuffer[tag]=1;
				valueBuffer[tag]=*value;
			}

			// cache hit value
			return valueBuffer[tag];
		}
		else // cache-miss
		{
			CacheValue oldValue = valueBuffer[tag];
			CacheKey oldKey = keyBuffer[tag];

			// eviction algorithm start
			if(isEditedBuffer[tag] == 1)
			{
				// if it is "get"
				if(opType==0)
				{
					isEditedBuffer[tag]=0;
				}

				saveData(oldKey,oldValue);

				// "get"
				if(opType==0)
				{
					const CacheValue && loadedData = loadData(key);
					valueBuffer[tag]=loadedData;
					keyBuffer[tag]=key;
					return loadedData;
				}
				else /* "set" */
				{
					valueBuffer[tag]=*value;
					keyBuffer[tag]=key;
					return *value;
				}
			}
			else // not edited
			{
				// "set"
				if(opType == 1)
				{
					isEditedBuffer[tag]=1;
				}

				// "get"
				if(opType == 0)
				{
					const CacheValue && loadedData = loadData(key);
					valueBuffer[tag]=loadedData;
					keyBuffer[tag]=key;
					return loadedData;
				}
				else // "set"
				{
					valueBuffer[tag]=*value;
					keyBuffer[tag]=key;
					return *value;
				}
			}
		}
	}

private:
	const CacheKey size;
	const CacheKey sizeM1;
	std::mutex mut;

	std::vector<CacheValue> valueBuffer;
	std::vector<unsigned char> isEditedBuffer;
	std::vector<CacheKey> keyBuffer;
	const std::function<CacheValue(CacheKey)>  loadData;
	const std::function<void(CacheKey,CacheValue)>  saveData;
};

#endif /* DIRECTMAPPEDCACHE_H_ */

从外部来看,它的工作方式几乎与 LRU 相同。有 getset 方法,构造函数接受类似的参数,并且它也有一个 flush() 方法。

通过对该缓存进行相同的微基准测试,计时结果更好。

缓存未命中测试

std::vector<int> test(10000);

for(int i=0;i<10000;i++)
    test[i]=sin(i*3.14f);

DirectMappedCache<int,int> cache(100,[&](int key)
    { return test[key]; },[&](int key, int value){ test[key]=value; });

size_t s1=0;
for(int i=0;i<10;i++)
{
    CpuBenchmarker bench;
    for(int j=0;j<10000;j++)
        s1+=cache.get(j);
}
std::cout<<"======"<<s1<<std::endl;

输出如下

101048 nanoseconds
99746 nanoseconds
100138 nanoseconds
100637 nanoseconds
101137 nanoseconds
104758 nanoseconds
101301 nanoseconds
99834 nanoseconds
100759 nanoseconds
329917 nanoseconds

每次 10k 次迭代 100k 纳秒,即每次缓存未命中 10 纳秒。部分原因是它适合更靠近核心的 CPU 缓存,并且访问后端存储的延迟少得多(可能可向量化)。没有受害者槽搜索,也没有映射搜索。没有这些,编译器可能会向量化操作,并且大部分缓存未命中会从中受益。每次未命中 10 纳秒,相当于**每秒 1 亿次缓存未命中查找**,访问后端存储(假设后端存储可以如此快速地获取数据)。在实际使用中,性能几乎与后端存储相同。本地 Web 服务可能需要微秒,读取文件需要 100 微秒,同步锁保护需要 1 毫秒,通过 pcie 桥复制 100MB 数据需要 10 毫秒,等等。

缓存命中测试(与缓存大小 = 数据集大小相同的代码)

80869 nanoseconds
80969 nanoseconds
80617 nanoseconds
80226 nanoseconds
80709 nanoseconds
80264 nanoseconds
80806 nanoseconds
80550 nanoseconds
79972 nanoseconds
80372 nanoseconds

每次调用与缓存未命中延迟仅差 2 纳秒,它表现得像后端存储输入的向量化多路复用器。部分缓慢来自于基准测试的循环计数器增量和停止 for 循环的条件检查。

每次 cache.get(key) 操作 8 纳秒是数组的顺序迭代,缓存命中率为 100%。性能改变了访问模式。当 CPU 频率为 3.6GHz 而非 2.1 GHz 时

51439 nanoseconds
51087 nanoseconds
51026 nanoseconds
50527 nanoseconds
50970 nanoseconds
50819 nanoseconds
50623 nanoseconds
50655 nanoseconds
50666 nanoseconds
50767 nanoseconds 

频率增加 50% 导致性能增加 60%。在 https://www.codechef.com/ide 上,相同的代码输出此

36424 nanoseconds
50099 nanoseconds
35613 nanoseconds
35614 nanoseconds
35615 nanoseconds
35729 nanoseconds
35604 nanoseconds
35637 nanoseconds
35671 nanoseconds
35605 nanoseconds 

每次查找 3.5 纳秒,相当于每秒 2.85 亿次查找,在一个可能已经在编译其他客户端代码并运行它们的服务器上。一台现代 CPU 频率为 5GHz,拥有至少双通道内存而不是单通道(测试机器拥有),每秒查找次数应增加 50% 到 100%,可能超过每秒 5 亿次查找。

多级缓存

当两种不同类型的缓存连接在一起时(一个缓存成为另一个的后端存储),它会使一种缓存的极端访问模式被另一种缓存优化掉。

例如,直接映射缓存会因访问模式 key = k%cacheSize(对于 10k 缓存大小,为 3,10003,20003 等)而失败,并在第一次访问后导致所有后续访问都发生缓存未命中。当第二层缓存是 LRU 时,这些失败会被拦截并有效地缓存。

另一个例子是,LRU 缓存会因访问模式 key = k*5(缓存大小小于数据集)而失败。这被前面的直接映射缓存有效地(但部分地)优化了。因为直接映射缓存的映射是 tag = key % size,当 size 是 5 时,访问模式仅使用直接映射缓存的一个标签(或整个缓存的 1/stride_length 部分)。其他 4 个标签保持不变,并保留其值以供将来的缓存命中。保持第一级缓存内容相对不变可能比单个缓存更具性能。

直接映射缓存的低延迟或可向量化特性以及 LRU 缓存的关联性(+缓存命中率)结合了两者之长,为各种数据大小和访问模式提供了更平滑的访问延迟。

对于单线程应用程序,将两个缓存连接起来创建一个读写缓存系统非常简单:

std::vector<int> test(40000); // simple backing-store simulation

LruClockCache<int,int> cacheL2(10000,[&](const int key){
        return test[key];
    },[&](const int key, const int value){
        test[key]=value;
    });

DirectMappedCache<int,int> cacheL1(1000,[&](const int key){
        return cacheL2.get(key);
    },[&](const int key, const int value){
        cacheL2.set(key,value);
    });

这种设置对于单线程使用来说本质上是-一致的(在读写操作上),无需任何同步。唯一需要的额外簿记工作是在关闭应用程序或断开后端存储之前调用 cacheL1.flush() ,然后在 cacheL2.flush() L1 上的刷新会将脏数据发送到 L2L2 上的刷新会将脏数据发送到后端存储。

在选择合适的缓存大小比例后,可以对不同的访问模式进行基准测试。总的来说,随机键访问对许多缓存类型都不利,除了随机逐出缓存。

随机键模式

  • 随机键生成会破坏 CPU 缓存访问的流水线:平均延迟增加
  • 随机键可能包含重复项:缓存命中率可能因机会而增加
  • 编译器无法向量化未知模式:平均延迟增加
  • 非顺序访问不会重用 CPU 的缓存行:平均延迟增加
  • 随机键生成本身有延迟:平均延迟增加
  • 幸运的是,RAM 是随机存取存储器

测试代码

std::vector<int> test(40000);
for(int i=0;i<40000;i++)
    test[i]=sin(i*3.14f);

LruClockCache<int,int> cacheL2(10000,[&](const int key){ return test[key]; },
             [&](const int key, const int value){ test[key]=value; });
DirectMappedCache<int,int> cacheL1(1000,[&](const int key){ return cacheL2.get(key); },
            [&](const int key, const int value){ cacheL2.set(key,value); });

// heat cache
for(int i=0;i<10000;i++)
    cacheL1.get(i);
std::cout<<"-----"<<std::endl;
size_t s1=0;
float smooth = 1.2f;
for(float scale=1.0001f;scale<180.0f;scale*=smooth)
{
    if(scale>90.0f){    smooth=1.01f;}
    if(scale>105.0f){    smooth=1.2f;}
    const int n = scale * 100;
    const int repeats = 1000000/n;

    // prepare randomness
    std::random_device rd;
    std::mt19937 rng(rd());
    std::uniform_real_distribution<float> rnd(0,n);

    // benchmark offset timing
    if(scale<1.003f)
    {
        CpuBenchmarker bench(n*sizeof(int)*repeats,"rnd(rng)",n*repeats);
        for(int repeat=0;repeat<repeats;repeat++)
        for(int j=0;j<n;j++)
        {
            s1+=rnd(rng);
        }
    }

    // benchmark timing
    {
        CpuBenchmarker bench(n*sizeof(int)*repeats,"dataset N="+std::to_string(n),n*repeats);
        for(int repeat=0;repeat<repeats;repeat++)
        for(int j=0;j<n;j++)
        {
            s1+=cacheL1.get(rnd(rng));
        }
    }
}
std::cout<<"======"<<s1<<std::endl;

输出

rnd(rng): 18200457 nanoseconds     (bandwidth = 219.77 MB/s)      
(throughput = 18.20 nanoseconds per iteration)
dataset N=100: 52300989 nanoseconds     (bandwidth = 76.48 MB/s)      
(throughput = 52.30 nanoseconds per iteration)
dataset N=120: 52082830 nanoseconds     (bandwidth = 76.80 MB/s)      
(throughput = 52.08 nanoseconds per iteration)
dataset N=144: 52181890 nanoseconds     (bandwidth = 76.65 MB/s)      
(throughput = 52.19 nanoseconds per iteration)
dataset N=172: 52853591 nanoseconds     (bandwidth = 75.67 MB/s)      
(throughput = 52.86 nanoseconds per iteration)
dataset N=207: 53201082 nanoseconds     (bandwidth = 75.17 MB/s)      
(throughput = 53.21 nanoseconds per iteration)
dataset N=248: 53568062 nanoseconds     (bandwidth = 74.67 MB/s)      
(throughput = 53.57 nanoseconds per iteration)
dataset N=298: 55443692 nanoseconds     (bandwidth = 72.13 MB/s)      
(throughput = 55.46 nanoseconds per iteration)
dataset N=358: 55607077 nanoseconds     (bandwidth = 71.93 MB/s)      
(throughput = 55.61 nanoseconds per iteration)
dataset N=430: 57725808 nanoseconds     (bandwidth = 69.28 MB/s)      
(throughput = 57.74 nanoseconds per iteration)
dataset N=516: 58905768 nanoseconds     (bandwidth = 67.87 MB/s)      
(throughput = 58.94 nanoseconds per iteration)
dataset N=619: 60888459 nanoseconds     (bandwidth = 65.67 MB/s)      
(throughput = 60.91 nanoseconds per iteration)
dataset N=743: 62053290 nanoseconds     (bandwidth = 64.42 MB/s)      
(throughput = 62.09 nanoseconds per iteration)
dataset N=891: 63346593 nanoseconds     (bandwidth = 63.13 MB/s)      
(throughput = 63.37 nanoseconds per iteration)
dataset N=1070: 64221898 nanoseconds     (bandwidth = 62.25 MB/s)      
(throughput = 64.26 nanoseconds per iteration)
dataset N=1284: 66012331 nanoseconds     (bandwidth = 60.53 MB/s)      
(throughput = 66.08 nanoseconds per iteration)
dataset N=1540: 67372349 nanoseconds     (bandwidth = 59.34 MB/s)      
(throughput = 67.41 nanoseconds per iteration)
dataset N=1849: 67558150 nanoseconds     (bandwidth = 59.12 MB/s)      
(throughput = 67.66 nanoseconds per iteration)
dataset N=2218: 71619432 nanoseconds     (bandwidth = 55.74 MB/s)      
(throughput = 71.76 nanoseconds per iteration)
dataset N=2662: 73942784 nanoseconds     (bandwidth = 54.00 MB/s)      
(throughput = 74.07 nanoseconds per iteration)
dataset N=3195: 76050286 nanoseconds     (bandwidth = 52.43 MB/s)      
(throughput = 76.29 nanoseconds per iteration)
dataset N=3834: 78384883 nanoseconds     (bandwidth = 50.87 MB/s)      
(throughput = 78.63 nanoseconds per iteration)
dataset N=4600: 81342232 nanoseconds     (bandwidth = 49.09 MB/s)      
(throughput = 81.49 nanoseconds per iteration)
dataset N=5521: 83421726 nanoseconds     (bandwidth = 47.92 MB/s)      
(throughput = 83.48 nanoseconds per iteration)
dataset N=6625: 85615888 nanoseconds     (bandwidth = 46.43 MB/s)      
(throughput = 86.15 nanoseconds per iteration)
dataset N=7950: 86101006 nanoseconds     (bandwidth = 46.17 MB/s)      
(throughput = 86.64 nanoseconds per iteration)
dataset N=9540: 88113967 nanoseconds     (bandwidth = 45.04 MB/s)      
(throughput = 88.81 nanoseconds per iteration)
dataset N=9635: 88369810 nanoseconds     (bandwidth = 44.92 MB/s)      
(throughput = 89.05 nanoseconds per iteration)
dataset N=9732: 88246496 nanoseconds     (bandwidth = 45.00 MB/s)      
(throughput = 88.90 nanoseconds per iteration)
dataset N=9829: 88797945 nanoseconds     (bandwidth = 44.72 MB/s)      
(throughput = 89.45 nanoseconds per iteration)
dataset N=9927: 88484586 nanoseconds     (bandwidth = 44.88 MB/s)      
(throughput = 89.14 nanoseconds per iteration)
dataset N=10027: 91343552 nanoseconds     (bandwidth = 43.47 MB/s)      
(throughput = 92.02 nanoseconds per iteration)
dataset N=10127: 96146837 nanoseconds     (bandwidth = 41.29 MB/s)      
(throughput = 96.88 nanoseconds per iteration)
dataset N=10228: 99899579 nanoseconds     (bandwidth = 39.72 MB/s)      
(throughput = 100.69 nanoseconds per iteration)
dataset N=10331: 102550351 nanoseconds     (bandwidth = 38.68 MB/s)      
(throughput = 103.40 nanoseconds per iteration)
dataset N=10434: 104217680 nanoseconds     (bandwidth = 38.04 MB/s)      
(throughput = 105.14 nanoseconds per iteration)
dataset N=10538: 106442779 nanoseconds     (bandwidth = 37.22 MB/s)      
(throughput = 107.46 nanoseconds per iteration)
dataset N=12646: 151418563 nanoseconds     (bandwidth = 26.39 MB/s)      
(throughput = 151.56 nanoseconds per iteration)
dataset N=15175: 184381479 nanoseconds     (bandwidth = 21.40 MB/s)      
(throughput = 186.93 nanoseconds per iteration)
======49931600

尽管困难重重,仍然设法实现了约 34 毫秒(52 毫秒基准测试 - 18 毫秒随机偏移),相当于每秒 29M 次查找,这还是在 2.1 GHz 的 Fx8150 上(预计在新 CPU 上可达 100+ M/s)。

Quicksort 访问模式

std::vector<int> test(40000);
for(int i=0;i<40000;i++)
    test[i]=sin(i*3.14f);

LruClockCache<int,int> cacheL2(100000,[&](const int key){ return test[key]; },
[&](const int key, const int value){ test[key]=value; });
DirectMappedCache<int,int> cacheL1(10000,[&](const int key)
{ return cacheL2.get(key); },[&](const int key, const int value){ cacheL2.set(key,value); });

// heat cache
for(int i=0;i<10000;i++)
    cacheL1.get(i);
std::cout<<"-----"<<std::endl;
size_t s1=0;
float smooth = 1.2f;
for(float scale=1.0001f;scale<180.0f;scale*=smooth)
{
    if(scale>90.0f){    smooth=1.01f;}
    if(scale>105.0f){    smooth=1.2f;}
    const int n = scale * 100;

    // benchmark timing
    {
        std::cout<<"--------------------------------------------------"<<std::endl;
        CpuBenchmarker bench(n*sizeof(int),"quicksort N="+std::to_string(n),1);
        ctrx=0;
        quickSort(cacheL1,0,n-1);
        std::cout<<"total get/set calls = "<<ctrx<<std::endl;
    }
}
std::cout<<"======"<<s1<<std::endl;

Quicksort 代码就地进行排序,并且通过添加 get/set 计数器变量来知道实际操作次数,从而用缓存访问替换了数组访问。带宽测量是每秒排序元素数,而不是访问数据。延迟测量是 quickSort 函数的调用计时,而不是每次项访问。L1 大小为 10k,L2 大小为 100k 个元素。

输出如下

--------------------------------------------------
total get/set calls = 25245
quicksort N=100: 283371 nanoseconds     (bandwidth = 1.41 MB/s)      
(throughput = 283371.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 36295
quicksort N=120: 400108 nanoseconds     (bandwidth = 1.20 MB/s)      
(throughput = 400108.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 52195
quicksort N=144: 596758 nanoseconds     (bandwidth = 0.97 MB/s)      
(throughput = 596758.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 74385
quicksort N=172: 857578 nanoseconds     (bandwidth = 0.80 MB/s)      
(throughput = 857578.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 107635
quicksort N=207: 1289555 nanoseconds     (bandwidth = 0.64 MB/s)      
(throughput = 1289555.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 154375
quicksort N=248: 1851512 nanoseconds     (bandwidth = 0.54 MB/s)      
(throughput = 1851512.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 222750
quicksort N=298: 2732207 nanoseconds     (bandwidth = 0.44 MB/s)      
(throughput = 2732207.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 321300
quicksort N=358: 4087281 nanoseconds     (bandwidth = 0.35 MB/s)      
(throughput = 4087281.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 463320
quicksort N=430: 5672015 nanoseconds     (bandwidth = 0.30 MB/s)      
(throughput = 5672015.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 666925
quicksort N=516: 8550306 nanoseconds     (bandwidth = 0.24 MB/s)      
(throughput = 8550306.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 959445
quicksort N=619: 12136510 nanoseconds     (bandwidth = 0.20 MB/s)      
(throughput = 12136510.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 1381975
quicksort N=743: 17599695 nanoseconds     (bandwidth = 0.17 MB/s)      
(throughput = 17599695.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 1986925
quicksort N=891: 25069224 nanoseconds     (bandwidth = 0.14 MB/s)      
(throughput = 25069224.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 2864920
quicksort N=1070: 36701579 nanoseconds     (bandwidth = 0.12 MB/s)      
(throughput = 36701579.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 4124845
quicksort N=1284: 52731251 nanoseconds     (bandwidth = 0.10 MB/s)      
(throughput = 52731251.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 5932845
quicksort N=1540: 76673623 nanoseconds     (bandwidth = 0.08 MB/s)      
(throughput = 76673623.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 8551620
quicksort N=1849: 109418022 nanoseconds     (bandwidth = 0.07 MB/s)      
(throughput = 109418022.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 12304350
quicksort N=2218: 156312967 nanoseconds     (bandwidth = 0.06 MB/s)      
(throughput = 156312967.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 17722260
quicksort N=2662: 225647562 nanoseconds     (bandwidth = 0.05 MB/s)      
(throughput = 225647562.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 25528045
quicksort N=3195: 326446975 nanoseconds     (bandwidth = 0.04 MB/s)      
(throughput = 326446975.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 36758470
quicksort N=3834: 469934209 nanoseconds     (bandwidth = 0.03 MB/s)      
(throughput = 469934209.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 52911495
quicksort N=4600: 678869539 nanoseconds     (bandwidth = 0.03 MB/s)      
(throughput = 678869539.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 76217400
quicksort N=5521: 980819568 nanoseconds     (bandwidth = 0.02 MB/s)      
(throughput = 980819568.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 109743120
quicksort N=6625: 1416697916 nanoseconds     (bandwidth = 0.02 MB/s)      
(throughput = 1416697916.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 158026120
quicksort N=7950: 2010299830 nanoseconds     (bandwidth = 0.02 MB/s)      
(throughput = 2010299830.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 227552845
quicksort N=9540: 2897904620 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 2897904620.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 232107145
quicksort N=9635: 2969687754 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 2969687754.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 236803885
quicksort N=9732: 3062550046 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 3062550046.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 241547670
quicksort N=9829: 3092673884 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 3092673884.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 246388135
quicksort N=9927: 3158007191 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 3158007191.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 251376885
quicksort N=10027: 3260430385 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 3260430385.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 256415635
quicksort N=10127: 3319869940 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 3319869940.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 261555525
quicksort N=10228: 3344205937 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 3344205937.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 266849725
quicksort N=10331: 3420943265 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 3420943265.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 272196970
quicksort N=10434: 3493719216 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 3493719216.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 277649950
quicksort N=10538: 3574425004 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 3574425004.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 399834900
quicksort N=12646: 5157410859 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 5157410859.00 nanoseconds per iteration)
--------------------------------------------------
total get/set calls = 575739495
quicksort N=15175: 7395595194 nanoseconds     (bandwidth = 0.01 MB/s)      
(throughput = 7395595194.00 nanoseconds per iteration)
======0

对于 N=100,进行了 25245 次 get/set 调用,总延迟为 283 微秒。每次元素访问 11 纳秒。

对于 N=15175,进行了 575739495 次 get/set 调用,总延迟为 7.4 秒。每次元素访问 12.8 纳秒。在 3.6GHz 而非 2.1GHz 下,性能变为每次元素访问 7.5 纳秒,或每秒 1.33 亿次 get/set。即使在 15k 大小的数据集中,它每秒也能排序 10 kB 的数据(对于 32 位整数值)。对于更长的字符串等值类型,内存效率会更高,访问延迟也变得不那么重要。

跨步顺序访问模式

理论上,这会一直无效 L1,但会在 L2 处停止,很大程度上取决于跨步大小。

std::vector<int> test(4000000);
for(int i=0;i<4000000;i++)
    test[i]=sin(i*3.14f);

LruClockCache<int,int> cacheL2(10000,[&](const int key){ return test[key]; },
[&](const int key, const int value){ test[key]=value; });
DirectMappedCache<int,int> cacheL1(1000,[&](const int key){ return cacheL2.get(key);
},[&](const int key, const int value){ cacheL2.set(key,value); });

// heat cache
for(int i=0;i<4000000;i++)
    cacheL1.get(i);
std::cout<<"-----"<<std::endl;
size_t s1=0;
float smooth = 1.2f;
for(float scale=1.0001f;scale<180.0f;scale*=smooth)
{
    if(scale>90.0f){    smooth=1.01f;}
    if(scale>105.0f){    smooth=1.2f;}
    const int n = scale * 100;
    const int repeats = 1000;
    // benchmark timing
    {
        CpuBenchmarker bench(n*sizeof(int)*repeats,"N="+std::to_string(n),n*repeats);
        for(int repeat = 0; repeat<repeats; repeat++)
        {
            for(int i=0;i<n;i++)
            {
                s1+=cacheL1.get(i*100);
            }
        }
    }
}
std::cout<<"======"<<s1<<std::endl;

输出如下

N=100: 3694579 nanoseconds     (bandwidth = 108.27 MB/s)      
(throughput = 36.95 nanoseconds per iteration)
N=120: 5410847 nanoseconds     (bandwidth = 88.71 MB/s)      
(throughput = 45.09 nanoseconds per iteration)
N=144: 7500038 nanoseconds     (bandwidth = 76.80 MB/s)      
(throughput = 52.08 nanoseconds per iteration)
N=172: 9356409 nanoseconds     (bandwidth = 73.53 MB/s)      
(throughput = 54.40 nanoseconds per iteration)
N=207: 11756553 nanoseconds     (bandwidth = 70.43 MB/s)      
(throughput = 56.79 nanoseconds per iteration)
N=248: 14402026 nanoseconds     (bandwidth = 68.88 MB/s)      
(throughput = 58.07 nanoseconds per iteration)
N=298: 16514732 nanoseconds     (bandwidth = 72.18 MB/s)      
(throughput = 55.42 nanoseconds per iteration)
N=358: 20141895 nanoseconds     (bandwidth = 71.10 MB/s)      
(throughput = 56.26 nanoseconds per iteration)
N=430: 24245251 nanoseconds     (bandwidth = 70.94 MB/s)      
(throughput = 56.38 nanoseconds per iteration)
N=516: 28535297 nanoseconds     (bandwidth = 72.33 MB/s)      
(throughput = 55.30 nanoseconds per iteration)
N=619: 34730097 nanoseconds     (bandwidth = 71.29 MB/s)      
(throughput = 56.11 nanoseconds per iteration)
N=743: 41695556 nanoseconds     (bandwidth = 71.28 MB/s)      
(throughput = 56.12 nanoseconds per iteration)
N=891: 49929799 nanoseconds     (bandwidth = 71.38 MB/s)      
(throughput = 56.04 nanoseconds per iteration)
N=1070: 59782292 nanoseconds     (bandwidth = 71.59 MB/s)      
(throughput = 55.87 nanoseconds per iteration)
N=1284: 70488677 nanoseconds     (bandwidth = 72.86 MB/s)      
(throughput = 54.90 nanoseconds per iteration)
N=1540: 83974809 nanoseconds     (bandwidth = 73.36 MB/s)      
(throughput = 54.53 nanoseconds per iteration)
N=1849: 101579922 nanoseconds     (bandwidth = 72.81 MB/s)      
(throughput = 54.94 nanoseconds per iteration)
N=2218: 119347699 nanoseconds     (bandwidth = 74.34 MB/s)      
(throughput = 53.81 nanoseconds per iteration)
N=2662: 141495512 nanoseconds     (bandwidth = 75.25 MB/s)      
(throughput = 53.15 nanoseconds per iteration)
N=3195: 166500722 nanoseconds     (bandwidth = 76.76 MB/s)      
(throughput = 52.11 nanoseconds per iteration)
N=3834: 195072477 nanoseconds     (bandwidth = 78.62 MB/s)      
(throughput = 50.88 nanoseconds per iteration)
N=4600: 229180314 nanoseconds     (bandwidth = 80.29 MB/s)      
(throughput = 49.82 nanoseconds per iteration)
N=5521: 270186192 nanoseconds     (bandwidth = 81.74 MB/s)      
(throughput = 48.94 nanoseconds per iteration)
N=6625: 315055195 nanoseconds     (bandwidth = 84.11 MB/s)      
(throughput = 47.56 nanoseconds per iteration)
N=7950: 361494623 nanoseconds     (bandwidth = 87.97 MB/s)      
(throughput = 45.47 nanoseconds per iteration)
N=9540: 413941004 nanoseconds     (bandwidth = 92.19 MB/s)      
(throughput = 43.39 nanoseconds per iteration)
N=9635: 414901046 nanoseconds     (bandwidth = 92.89 MB/s)      
(throughput = 43.06 nanoseconds per iteration)
N=9732: 414661443 nanoseconds     (bandwidth = 93.88 MB/s)      
(throughput = 42.61 nanoseconds per iteration)
N=9829: 432962408 nanoseconds     (bandwidth = 90.81 MB/s)      
(throughput = 44.05 nanoseconds per iteration)
N=9927: 414821653 nanoseconds     (bandwidth = 95.72 MB/s)      
(throughput = 41.79 nanoseconds per iteration)
N=10027: 1515980500 nanoseconds     (bandwidth = 26.46 MB/s)      
(throughput = 151.19 nanoseconds per iteration)
N=10127: 1551122918 nanoseconds     (bandwidth = 26.12 MB/s)      
(throughput = 153.17 nanoseconds per iteration)
N=10228: 1568851575 nanoseconds     (bandwidth = 26.08 MB/s)      
(throughput = 153.39 nanoseconds per iteration)
N=10331: 1602725754 nanoseconds     (bandwidth = 25.78 MB/s)      
(throughput = 155.14 nanoseconds per iteration)
N=10434: 1623017665 nanoseconds     (bandwidth = 25.72 MB/s)      
(throughput = 155.55 nanoseconds per iteration)
N=10538: 1626828752 nanoseconds     (bandwidth = 25.91 MB/s)      
(throughput = 154.38 nanoseconds per iteration)
N=12646: 2059143192 nanoseconds     (bandwidth = 24.57 MB/s)      
(throughput = 162.83 nanoseconds per iteration)
N=15175: 2672005898 nanoseconds     (bandwidth = 22.72 MB/s)      
(throughput = 176.08 nanoseconds per iteration)
======29000

正如预期的那样,延迟一开始很高,直到 N=10000(L2 缓存的大小)时才保持相对稳定。在 N=10000 之前平均延迟的轻微下降是由迭代次数的增加引起的,这比 N=100(1000 次重复不足以完全压榨硬件)的测量效果更好。

图像处理模式(高斯模糊)

  • 分块计算增加了 L1/L2 命中率:降低了延迟
  • 模式在编译时已知:降低/隐藏了延迟

int imageWidth, imageHeight, maxN;
double minR, maxR, minI, maxI;

imageWidth = 1024;
imageHeight = 1024;
maxN = 512;
minR = -1.5;
maxR = 0.7;
minI = -1.0;
maxI = 1.0;

size_t cacheScaling = 1;
for (int cacheScalingIter = 0; cacheScalingIter < 8; cacheScalingIter++)
{
cacheScaling *= 2;

const int L2size = 1024 * 128 * cacheScaling;
const int L1size = L2size / 4; // L1 size has to be integer power of 2 !!!

std::vector < int > backingStore(5000000);
auto L2 = LruClockCache< int, int >(L2size,
  [ & ](int key) {

    return backingStore[key];
  },
  [ & ](int key, int value) {

    backingStore[key] = value;
  });

auto L1 = DirectMappedCache< int, int >(L1size,
  [ & ](int key) {

    return L2.get(key);
  },
  [ & ](int key, int value) {

    L2.set(key,value);
  });

std::ofstream output("output_image_scaling" + std::to_string(cacheScaling) + ".ppm");
output << "P6" << std::endl;
output << imageWidth << " " << imageHeight << std::endl;
output << "255" << std::endl;

for (int i = 0; i < imageHeight; i++) {
  for (int j = 0; j < imageWidth; j++) {
    double cr = mapToReal(j, imageWidth, minR, maxR);
    double ci = mapToImaginary(i, imageHeight, minI, maxI);

    L1.set(i + j * imageWidth, findMandelbrot(cr, ci, maxN));
  }
}

// benchmark
{
  const int nGauss = 9;
  const int GaussianBlurKernel[nGauss] = {
    1,        2,        1,
    2,        4,        2,
    1,        2,        1
  };
  size_t nRepeats = 5;
  size_t nReadsPerIter = nGauss;
  size_t nWritesPerIter = 1;
  size_t totalLookups = nRepeats * (imageHeight - 2) *
  (imageWidth - 2) * (nReadsPerIter + nWritesPerIter);
  CpuBenchmarker bench(totalLookups * sizeof(int),
  "L1 tags = " + std::to_string(L1size) + " L2 tags=" +
   std::to_string(L2size)  + " performance", totalLookups);

  // image softening (may not be accurate)
  // column-major iteration better for the L1 but doesn't matter for L2
  // "nRepeats" iterations better for benchmarking
  // tiled-computing to make even better cache-hit ratio
  for (size_t k = 0; k < nRepeats; k++) {
    for (int tiley = 0; tiley < imageHeight/32; tiley++) {
      for (int tilex = 0; tilex < imageWidth/32; tilex++) {
        for (int jj = 0; jj < 32; jj++) {
          for (int ii = 0; ii < 32; ii++) {
            const int i = tilex * 32 + ii;
            const int j = tiley * 32 + jj;
            if (j >= 1 && j <= imageHeight - 2 && i >= 1 && i <= imageWidth - 2) {
              unsigned int pixelAccumulator1 = 0;
              unsigned int pixelAccumulator2 = 0;
              unsigned int pixelAccumulator3 = 0;

              pixelAccumulator1 += L1.get(i - 1 + (j - 1) * imageWidth) *
                                   GaussianBlurKernel[-1 + 1 + (-1 + 1) * 3];
              pixelAccumulator2 += L1.get(i + 0 + (j - 1) * imageWidth) *
                                   GaussianBlurKernel[+0 + 1 + (-1 + 1) * 3];
              pixelAccumulator3 += L1.get(i + 1 + (j - 1) * imageWidth) *
                                   GaussianBlurKernel[+1 + 1 + (-1 + 1) * 3];

              pixelAccumulator1 += L1.get(i - 1 + (j - 0) * imageWidth) *
                                   GaussianBlurKernel[-1 + 1 + (-0 + 1) * 3];
              pixelAccumulator2 += L1.get(i + 0 + (j - 0) * imageWidth) *
                                   GaussianBlurKernel[+0 + 1 + (-0 + 1) * 3];
              pixelAccumulator3 += L1.get(i + 1 + (j - 0) * imageWidth) *
                                   GaussianBlurKernel[+1 + 1 + (-0 + 1) * 3];

              pixelAccumulator1 += L1.get(i - 1 + (j + 1) * imageWidth) *
                                   GaussianBlurKernel[-1 + 1 + (+1 + 1) * 3];
              pixelAccumulator2 += L1.get(i + 0 + (j + 1) * imageWidth) *
                                   GaussianBlurKernel[+0 + 1 + (+1 + 1) * 3];
              pixelAccumulator3 += L1.get(i + 1 + (j + 1) * imageWidth) *
                                   GaussianBlurKernel[+1 + 1 + (+1 + 1) * 3];

              const int n = (pixelAccumulator1 + pixelAccumulator2 + pixelAccumulator3) >> 4;
              L1.set(i + j * imageWidth, n);
            }
          }
        }
      }
    }
  }
}
for (int i = 0; i < imageHeight; i++) {
  for (int j = 0; j < imageWidth; j++) {

    int n = L1.get(i + j * imageWidth);

    int r = ((int) sqrt(n) % 256);
    int gr = (2 * n % 256);
    int b = (n % 256);

    output << (char) r << (char) gr << (char) b;
  }
}
std::cout << "Finished!" << std::endl;

L1.flush();
output.flush();
}

输出如下

对于外层 for 循环中的每一次缓存扩展,它都会生成相同的图像,但使用的是比上一次更大的缓存大小。

3.6GHz FX8150 的控制台输出

L1 tags = 65536 L2 tags=262144 performance: 1051230942 nanoseconds
(bandwidth = 198.72 MB/s)      (throughput = 20.13 nanoseconds per iteration)
Finished!
L1 tags = 131072 L2 tags=524288 performance: 983512878 nanoseconds
(bandwidth = 212.40 MB/s)      (throughput = 18.83 nanoseconds per iteration)
Finished!
L1 tags = 262144 L2 tags=1048576 performance: 743200249 nanoseconds
(bandwidth = 281.08 MB/s)      (throughput = 14.23 nanoseconds per iteration)
Finished!
L1 tags = 524288 L2 tags=2097152 performance: 608610176 nanoseconds
(bandwidth = 343.24 MB/s)      (throughput = 11.65 nanoseconds per iteration)
Finished!
L1 tags = 1048576 L2 tags=4194304 performance: 97744550 nanoseconds
(bandwidth = 2137.17 MB/s)      (throughput = 1.87 nanoseconds per iteration)
Finished!
L1 tags = 2097152 L2 tags=8388608 performance: 148706457 nanoseconds
(bandwidth = 1404.76 MB/s)      (throughput = 2.85 nanoseconds per iteration)
Finished!
L1 tags = 4194304 L2 tags=16777216 performance: 97734293 nanoseconds
(bandwidth = 2137.40 MB/s)      (throughput = 1.87 nanoseconds per iteration)
Finished!
L1 tags = 8388608 L2 tags=33554432 performance: 148821672 nanoseconds
(bandwidth = 1403.67 MB/s)      (throughput = 2.85 nanoseconds per iteration)
Finished!

像素的有效重用也有效地利用了 CPU 的 L1 缓存。FX8150 的 L1 延迟为 4 个周期。3.6GHz 和 4 个周期意味着大约 1 纳秒的延迟。对于 1M 的直接映射缓存大小,为 1.87 纳秒。

1.87 纳秒 = 每秒 5.34 亿次查找。

读者可能会问标题中的每秒十亿次查找在哪里。要实现功能齐全的多线程读写缓存,所有 L1L2 缓存之间必须存在缓存一致性。目前,实现仅支持单级读写多线程或多级只读多线程,使用类 CacheThreader.h

#ifndef CACHETHREADER_H_
#define CACHETHREADER_H_

#include<vector>
#include<memory>
#include<thread>
#include<atomic>
#include"DirectMappedCache.h"
#include"LruClockCache.h"
/* L1: direct mapped cache, for each thread
 * L2: LRU clock cache, for each thread (size must be integer-power of 2)
 * LLC: user-defined cache with thread-safe get/set methods that is slower but global
 * currently only 1 thread is supported
*/
template<template<typename,typename, typename> class Cache,
typename CacheKey, typename CacheValue, typename CacheInternalCounterTypeInteger=size_t>
class CacheThreader
{
private:
	// last level cache, slow because of lock-guard
	std::shared_ptr<Cache<CacheKey,CacheValue,CacheInternalCounterTypeInteger>> LLC;
	int nThreads;
	std::vector<std::shared_ptr<LruClockCache<CacheKey,CacheValue,
         CacheInternalCounterTypeInteger>>> L2;
	std::vector<std::shared_ptr<DirectMappedCache<CacheKey,CacheValue>>> L1;

public:
	CacheThreader(std::shared_ptr<Cache<CacheKey,CacheValue,
    CacheInternalCounterTypeInteger>> cacheLLC, int sizeCacheL1,
    int sizeCacheL2, int nThread=1)
	{
		LLC=cacheLLC;
		nThreads = nThread;
		// backing-store of L1 is LLC
		for(int i=0;i<nThread;i++)
		{
			L2.push_back( std::make_shared<LruClockCache<CacheKey,
            CacheValue,CacheInternalCounterTypeInteger>>(sizeCacheL2,[i,this](CacheKey key){

				return this->LLC->getThreadSafe(key);
			},[i,this](CacheKey key, CacheValue value){

				this->LLC->setThreadSafe(key,value);
			}));
			L1.push_back( std::make_shared<DirectMappedCache<CacheKey,CacheValue>>
                        (sizeCacheL1,[i,this](CacheKey key){

				return this->L2[i]->get(key);
			},[i,this](CacheKey key, CacheValue value){

				this->L2[i]->set(key,value);
			}));
		}
	}

	// get data from closest cache
	// currently only 1 thread supported
	const CacheValue get(CacheKey key) const
	{
		return L1[0]->get(key);
	}

	// set data to closest cache
	// currently only 1 thread supported
	void set(CacheKey key, CacheValue value) const
	{
		L1[0]->set(key,value);
	}

	// currently only 1 thread supported for read+write
	// only read-only usage for multi-threaded apps
	// must be called from all threads
	// does not flush LLC
	// LLC needs to be flushed manually by main-thread
	void flush()
	{
		L1[0]->flush();
		L2[0]->flush();
	}

	~CacheThreader(){  }
};

#endif /* CACHETHREADER_H_ */

工作原理

  • 应用程序中的每个线程都可以实例化自己的 CacheThreader 对象。
  • 构造函数接受一个包装在 std::shared_ptr 中的缓存实例,用作 LLC(最后一级缓存)供多个线程共享。由于 LLC 是共享的,只有主线程应该创建它并与其他线程共享。LLC 实现必须具有 getThreadSafesetThreadSafe 方法。
  • 构造函数在 LLC 前面添加了一个 L2 LRU 近似(LruClockCache),但仅限于其自己的线程(私有缓存)。
  • 构造函数在 L2 前面添加了另一个缓存(DirectMappedCache),作为 L1 缓存。
  • 每个线程可以直接操作自己的 L1 缓存,所有逐出操作都会传播到 LLC。
  • 只读访问。只有 L1.get(key) 调用保证了一致性。
  • 适用于只读数据库、游戏中的静态世界对象,以及任何不会改变的东西。
auto LLC = std::make_shared<LruClockCache< int, int >>(L2size*4,
              [ & ](int key) {

                return backingStore[key];
              },
              [ & ](int key, int value) {

                backingStore[key] = value;
});

auto L1L2LLC = CacheThreader<LruClockCache, int, int >(LLC,L1size,L2size);

CacheThreader 构造函数完成了 L1L2 和 LLC 之间的所有必要绑定,并在需要时调用 getThreadSafe/setThreadSafe 方法(当逐出到达 LLC 时)。只有 LLC 需要访问后端存储,因为它是最后一级缓存。为了展示 C++ 中的锁定开销(lock_guard),下面的示例选择了 LruClockCache。每个线程在调用 getThreadSafe 时锁定整个 LruClockCache 实例。DirectMappedMultiThreadCache 则不是这样,它在缓存项级别进行锁定,大大减少了锁争用。

高斯模糊只读测试(由于未写入像素,因此无法产生正确输出),它执行相同的工作,但使用 8 个线程和 CacheThreader,以及略有不同的热点。

PARALLEL_FOR(0,8,loop,
{
          auto L1 = CacheThreader<LruClockCache, int, int >(LLC,L1size,L2size);
          for (size_t k = 0; k < nRepeats; k++) {
            for (int tiley = 0; tiley < imageHeight/32; tiley++) {
              for (int tilex = 0; tilex < imageWidth/32; tilex++) {
                for (int jj = 0; jj < 32; jj++) {
                  for (int ii = 0; ii < 32; ii++) {
                    const int i = tilex * 32 + ii;
                    const int j = tiley * 32 + jj;
                    if (j >= 1 && j <= imageHeight - 2 && i >= 1 && i <= imageWidth - 2) {
                      unsigned int pixelAccumulator1 = 0;
                      unsigned int pixelAccumulator2 = 0;
                      unsigned int pixelAccumulator3 = 0;

                      pixelAccumulator1 += L1.get(i - 1 + (j - 1) * imageWidth) *
                                           GaussianBlurKernel[-1 + 1 + (-1 + 1) * 3];
                      pixelAccumulator2 += L1.get(i + 0 + (j - 1) * imageWidth) *
                                           GaussianBlurKernel[+0 + 1 + (-1 + 1) * 3];
                      pixelAccumulator3 += L1.get(i + 1 + (j - 1) * imageWidth) *
                                           GaussianBlurKernel[+1 + 1 + (-1 + 1) * 3];

                      pixelAccumulator1 += L1.get(i - 1 + (j - 0) * imageWidth) *
                                           GaussianBlurKernel[-1 + 1 + (-0 + 1) * 3];
                      pixelAccumulator2 += L1.get(i + 0 + (j - 0) * imageWidth) *
                                           GaussianBlurKernel[+0 + 1 + (-0 + 1) * 3];
                      pixelAccumulator3 += L1.get(i + 1 + (j - 0) * imageWidth) *
                                           GaussianBlurKernel[+1 + 1 + (-0 + 1) * 3];

                      pixelAccumulator1 += L1.get(i - 1 + (j + 1) * imageWidth) *
                                           GaussianBlurKernel[-1 + 1 + (+1 + 1) * 3];
                      pixelAccumulator2 += L1.get(i + 0 + (j + 1) * imageWidth) *
                                           GaussianBlurKernel[+0 + 1 + (+1 + 1) * 3];
                      pixelAccumulator3 += L1.get(i + 1 + (j + 1) * imageWidth) *
                                           GaussianBlurKernel[+1 + 1 + (+1 + 1) * 3];

                      const int n = (pixelAccumulator1 +
                                     pixelAccumulator2 + pixelAccumulator3) >> 4;
                      L1.set(i + j * imageWidth, n);

                    }
                  }
                }
              }
            }
          }
});

由于目前还没有多级多线程的缓存一致性,因此仅测量读取性能而非输出质量。源代码如下:

  int imageWidth, imageHeight, maxN;
  double minR, maxR, minI, maxI;

  imageWidth = 256;
  imageHeight = 256;
  maxN = 512;
  minR = -1.5;
  maxR = 0.7;
  minI = -1.0;
  maxI = 1.0;

  size_t cacheScaling = 1;
  for (int cacheScalingIter = 0; cacheScalingIter < 8; cacheScalingIter++)
  {
    cacheScaling *= 2;

    const int L2size = 1024 * 16 * cacheScaling;
    const int L1size = L2size / 4; // L1 size has to be integer power of 2 !!!

    std::vector < int > backingStore(5000000);

    auto LLC = std::make_shared<LruClockCache< int, int >>(L2size*4,
      [ & ](int key) {

        return backingStore[key];
      },
      [ & ](int key, int value) {

        backingStore[key] = value;
      });

    std::ofstream output("output_image_scaling" + std::to_string(cacheScaling) + ".ppm");
    output << "P6" << std::endl;
    output << imageWidth << " " << imageHeight << std::endl;
    output << "255" << std::endl;

    for (int i = 0; i < imageHeight; i++) {
      for (int j = 0; j < imageWidth; j++) {
        double cr = mapToReal(j, imageWidth, minR, maxR);
        double ci = mapToImaginary(i, imageHeight, minI, maxI);

        LLC->set(i + j * imageWidth, findMandelbrot(cr, ci, maxN));
      }
    }

    // benchmark
    {
      const int nGauss = 9;
      const int GaussianBlurKernel[nGauss] = {
        1,        2,        1,
        2,        4,        2,
        1,        2,        1
      };
      size_t nRepeats = 5 + (cacheScalingIter * 400);
      size_t nReadsPerIter = nGauss;
      size_t nWritesPerIter = 1;
      size_t totalLookups = nRepeats * (imageHeight - 2) *
                            (imageWidth - 2) * (nReadsPerIter + nWritesPerIter);
      CpuBenchmarker bench(totalLookups*8 * sizeof(int),
                          "L1 tags = " + std::to_string(L1size) + " L2 tags=" +
                           std::to_string(L2size)  + " performance", totalLookups*8);

      // image softening (may not be accurate)
      // column-major iteration better for the L1 but doesn't matter for L2
      // "nRepeats" iterations better for benchmarking
      // tiled-computing to make even better cache-hit ratio
      PARALLEL_FOR(0,8,loop,
      {
                // each thread creating&using only its own instance of CacheThreader
                auto L1 = CacheThreader<LruClockCache, int, int >(LLC,L1size,L2size);

                  for (size_t k = 0; k < nRepeats; k++) {
                    for (int tiley = 0; tiley < imageHeight/32; tiley++) {
                      for (int tilex = 0; tilex < imageWidth/32; tilex++) {
                        for (int jj = 0; jj < 32; jj++) {
                          for (int ii = 0; ii < 32; ii++) {
                            const int i = tilex * 32 + ii;
                            const int j = tiley * 32 + jj;
                            if (j >= 1 && j <= imageHeight - 2 && i >= 1 &&
                                i <= imageWidth - 2) {
                              unsigned int pixelAccumulator1 = 0;
                              unsigned int pixelAccumulator2 = 0;
                              unsigned int pixelAccumulator3 = 0;

                              pixelAccumulator1 += L1.get(i - 1 + (j - 1) * imageWidth) *
                              GaussianBlurKernel[-1 + 1 + (-1 + 1) * 3];
                              pixelAccumulator2 += L1.get(i + 0 + (j - 1) * imageWidth) *
                              GaussianBlurKernel[+0 + 1 + (-1 + 1) * 3];
                              pixelAccumulator3 += L1.get(i + 1 + (j - 1) * imageWidth) *
                              GaussianBlurKernel[+1 + 1 + (-1 + 1) * 3];

                              pixelAccumulator1 += L1.get(i - 1 + (j - 0) * imageWidth) *
                              GaussianBlurKernel[-1 + 1 + (-0 + 1) * 3];
                              pixelAccumulator2 += L1.get(i + 0 + (j - 0) * imageWidth) *
                              GaussianBlurKernel[+0 + 1 + (-0 + 1) * 3];
                              pixelAccumulator3 += L1.get(i + 1 + (j - 0) * imageWidth) *
                              GaussianBlurKernel[+1 + 1 + (-0 + 1) * 3];

                              pixelAccumulator1 += L1.get(i - 1 + (j + 1) * imageWidth) *
                              GaussianBlurKernel[-1 + 1 + (+1 + 1) * 3];
                              pixelAccumulator2 += L1.get(i + 0 + (j + 1) * imageWidth) *
                              GaussianBlurKernel[+0 + 1 + (+1 + 1) * 3];
                              pixelAccumulator3 += L1.get(i + 1 + (j + 1) * imageWidth) *
                              GaussianBlurKernel[+1 + 1 + (+1 + 1) * 3];

                              const int n = (pixelAccumulator1 + pixelAccumulator2 + 
                                             pixelAccumulator3) >> 4;
                              L1.set(i + j * imageWidth, n);

                            }
                          }
                        }
                      }
                    }
                  }
                  L1.flush();
      });
    }
    for (int i = 0; i < imageHeight; i++) {
      for (int j = 0; j < imageWidth; j++) {

        int n = LLC->get(i + j * imageWidth);

        int r = ((int) sqrt(n) % 256);
        int gr = (2 * n % 256);
        int b = (n % 256);

        output << (char) r << (char) gr << (char) b;
      }
    }
    std::cout << "Finished!" << std::endl;

    LLC->flush();
    output.flush();
  }

控制台输出如下:

L1 tags = 8192 L2 tags=32768 performance: 4519806245 nanoseconds
(bandwidth = 22.84 MB/s)      (throughput = 175.14 nanoseconds per iteration)
Finished!
L1 tags = 16384 L2 tags=65536 performance: 3086661771 nanoseconds
(bandwidth = 2708.84 MB/s)      (throughput = 1.48 nanoseconds per iteration)
Finished!
L1 tags = 32768 L2 tags=131072 performance: 6959029488 nanoseconds
(bandwidth = 2388.17 MB/s)      (throughput = 1.67 nanoseconds per iteration)
Finished!
L1 tags = 65536 L2 tags=262144 performance: 2885031311 nanoseconds
(bandwidth = 8622.91 MB/s)      (throughput = 0.46 nanoseconds per iteration)
Finished!
L1 tags = 131072 L2 tags=524288 performance: 3514979863 nanoseconds
(bandwidth = 9426.92 MB/s)      (throughput = 0.42 nanoseconds per iteration)
Finished!
L1 tags = 262144 L2 tags=1048576 performance: 4235796489 nanoseconds
(bandwidth = 9772.30 MB/s)      (throughput = 0.41 nanoseconds per iteration)
Finished!
L1 tags = 524288 L2 tags=2097152 performance: 4981730617 nanoseconds
(bandwidth = 9966.72 MB/s)      (throughput = 0.40 nanoseconds per iteration)
Finished!
L1 tags = 1048576 L2 tags=4194304 performance: 5783022399 nanoseconds
(bandwidth = 10013.72 MB/s)      (throughput = 0.40 nanoseconds per iteration)
Finished!

每次迭代 0.4 纳秒(每秒 25 亿次查找),8 个线程的平均每个线程访问延迟为 3.2 纳秒。当 L1/L2 缓存小于数据集(256x256 图像)时,LLC 的单个入口点上的锁争用导致 8 个线程为 175 纳秒,每个线程为 1400 纳秒。这是仅针对单个数据进行同步的成本。

运行相同的代码,但使用DirectMappedMultiThreadCache.h作为 LLC(只需将对象类型替换为DirectMappedMultiThreadCache)并将缓存大小减少 4 倍

L1 tags = 512 L2 tags=2048 performance: 254857624 nanoseconds
(bandwidth = 405.03 MB/s)      (throughput = 9.88 nanoseconds per iteration)
Finished!
L1 tags = 1024 L2 tags=4096 performance: 18364903435 nanoseconds
(bandwidth = 455.29 MB/s)      (throughput = 8.79 nanoseconds per iteration)
Finished!
L1 tags = 2048 L2 tags=8192 performance: 32524791168 nanoseconds
(bandwidth = 510.97 MB/s)      (throughput = 7.83 nanoseconds per iteration)
Finished!
L1 tags = 4096 L2 tags=16384 performance: 39102175517 nanoseconds
(bandwidth = 636.21 MB/s)      (throughput = 6.29 nanoseconds per iteration)
Finished!
L1 tags = 8192 L2 tags=32768 performance: 53978929784 nanoseconds
(bandwidth = 613.86 MB/s)      (throughput = 6.52 nanoseconds per iteration)
Finished!
L1 tags = 16384 L2 tags=65536 performance: 17846742945 nanoseconds
(bandwidth = 2319.38 MB/s)      (throughput = 1.72 nanoseconds per iteration)
Finished!
L1 tags = 32768 L2 tags=131072 performance: 28040567180 nanoseconds
(bandwidth = 1770.70 MB/s)      (throughput = 2.26 nanoseconds per iteration)
Finished!
L1 tags = 65536 L2 tags=262144 performance: 5105202102 nanoseconds
(bandwidth = 11343.25 MB/s)      (throughput = 0.35 nanoseconds per iteration)
Finished!

将锁定分配给整个缓存标签数组,使得即使 L1L2 相对于数据集(256x256 = 65536 像素)非常小,访问缓存的速度也快得多,最后一次迭代实现了**每秒 28 亿次查找**。但是,直接映射缓存作为 LLC 是否足以满足高缓存命中率?为了测试这一点,只需在缓存未命中函数中添加两个原子计数器,并与所有读取次数进行比较。

read hit ratio=98.5571%
L1 tags = 512 L2 tags=2048 performance: 802570239 nanoseconds
(bandwidth = 128.62 MB/s)      (throughput = 31.10 nanoseconds per iteration)
Finished!
read hit ratio=98.59%
L1 tags = 1024 L2 tags=4096 performance: 66764400442 nanoseconds
(bandwidth = 125.24 MB/s)      (throughput = 31.94 nanoseconds per iteration)
Finished!
...

使用相同的高斯模糊算法和 LRU 近似,命中率仅提高 0.03%,每次访问延迟增加 130 纳秒。这是否会提升性能取决于后端存储的延迟。对于非常慢的后端存储,LRU 可以容忍更高的访问延迟,而直接映射缓存实际上可以等待比 LRU 更长的时间来访问后端存储,因为存在额外的未命中。这也取决于访问模式。每次缓存命中率提高 1% 会在某些访问模式下带来 100% 的性能提升。但直接映射缓存的一些反模式被中间的 LRU 过滤掉了,因此对于 LLC 场景来说变得更容易。

多线程直接映射缓存与 LruClockCache 兼容,可用作 CacheThreader 的 LLC,并可单独用于具有足够快后端存储(可容忍较低命中率)的多线程应用程序(源代码

#ifndef DIRECTMAPPEDMULTITHREADCACHE_H_
#define DIRECTMAPPEDMULTITHREADCACHE_H_

#include<vector>
#include<functional>
#include<mutex>

/* Direct-mapped cache implementation with granular locking (per-tag)
 * Only usable for integer type keys and intended to be used as
 *      LLC(last level cache) for CacheThreader instances
 * 		to optimize contentions out in multithreaded read-only scenarios
 *	also can be used alone, as a read+write multi-threaded cache
 *  using getThreadSafe setThreadSafe methods but cache-hit ratio will not be good
 * CacheKey: type of key (only integers: int, char, size_t)
 * CacheValue: type of value that is bound to key (same as above)
 * InternalKeyTypeInteger: type of tag found after modulo operationa
 * (is important for maximum cache size. unsigned char = 255, unsigned int=1024*1024*1024*4)
 */
template<	typename CacheKey, typename CacheValue, typename InternalKeyTypeInteger=size_t>
class DirectMappedMultiThreadCache
{
public:
	// allocates buffers for numElements number of cache slots/lanes
	// readMiss: 	cache-miss for read operations. User needs to give this function
	// 				to let the cache automatically get data from backing-store
	//				example: [&](MyClass key){ return redis.get(key); }
	//				takes a CacheKey as key, returns CacheValue as value
	// writeMiss: 	cache-miss for write operations. User needs to give this function
	// 				to let the cache automatically set data to backing-store
	//				example: [&](MyClass key, MyAnotherClass value){ redis.set(key,value); }
	//				takes a CacheKey as key and CacheValue as value
	// numElements: has to be integer-power of 2 (e.g. 2,4,8,16,...)
	DirectMappedMultiThreadCache(InternalKeyTypeInteger numElements,
				const std::function<CacheValue(CacheKey)> & readMiss,
				const std::function<void(CacheKey,CacheValue)> &
                writeMiss):size(numElements),sizeM1(numElements-1),
                loadData(readMiss),saveData(writeMiss)
	{
		mut = std::vector<std::mutex>(numElements);
		// initialize buffers
		for(InternalKeyTypeInteger i=0;i<numElements;i++)
		{
			valueBuffer.push_back(CacheValue());
			isEditedBuffer.push_back(0);
			keyBuffer.push_back(CacheKey());
		}
	}

	// get element from cache
	// if cache doesn't find it in buffers,
	// then cache gets data from backing-store
	// then returns the result to user
	// then cache is available from RAM on next get/set access with same key
	inline
	const CacheValue get(const CacheKey & key)  noexcept
	{
		return accessDirect(key,nullptr);
	}

	// only syntactic difference
	inline
	const std::vector<CacheValue> getMultiple(const std::vector<CacheKey> & key)  noexcept
	{
		const int n = key.size();
		std::vector<CacheValue> result(n);

		for(int i=0;i<n;i++)
		{
			result[i]=accessDirect(key[i],nullptr);
		}
		return result;
	}

	// thread-safe but slower version of get()
	inline
	const CacheValue getThreadSafe(const CacheKey & key)  noexcept
	{
		return accessDirectLocked(key,nullptr);
	}

	// set element to cache
	// if cache doesn't find it in buffers,
	// then cache sets data on just cache
	// writing to backing-store only happens when
	// 					another access evicts the cache slot containing this key/value
	//					or when cache is flushed by flush() method
	// then returns the given value back
	// then cache is available from RAM on next get/set access with same key
	inline
	void set(const CacheKey & key, const CacheValue & val) noexcept
	{
		accessDirect(key,&val,1);
	}

	// thread-safe but slower version of set()
	inline
	void setThreadSafe(const CacheKey & key, const CacheValue & val)  noexcept
	{
		accessDirectLocked(key,&val,1);
	}

	// use this before closing the backing-store to store the latest bits of data
	void flush()
	{
		try
		{
		for (CacheKey i=0;i<size;i++)
		{
		  if (isEditedBuffer[i] == 1)
		  {
				isEditedBuffer[i]=0;
				auto oldKey = keyBuffer[i];
				auto oldValue = valueBuffer[i];
				saveData(oldKey,oldValue);
		  }
		}
		}catch(std::exception &ex){ std::cout<<ex.what()<<std::endl; }
	}

	// CLOCK algorithm with 2 hand counters
    // (1 for second chance for a cache slot to survive, 1 for eviction of cache slot)
	// opType=0: get
	// opType=1: set
	CacheValue const accessDirectLocked(const CacheKey & key,
               const CacheValue * value, const bool opType = 0)
	{
		// find tag mapped to the key
		InternalKeyTypeInteger tag = key & sizeM1;
		std::lock_guard<std::mutex> lg(mut[tag]); // N parallel locks in-flight =
                                                  // less contention in multi-threading

		// compare keys
		if(keyBuffer[tag] == key)
		{
			// cache-hit

			// "set"
			if(opType == 1)
			{
				isEditedBuffer[tag]=1;
				valueBuffer[tag]=*value;
			}

			// cache hit value
			return valueBuffer[tag];
		}
		else // cache-miss
		{
			CacheValue oldValue = valueBuffer[tag];
			CacheKey oldKey = keyBuffer[tag];

			// eviction algorithm start
			if(isEditedBuffer[tag] == 1)
			{
				// if it is "get"
				if(opType==0)
				{
					isEditedBuffer[tag]=0;
				}

				saveData(oldKey,oldValue);

				// "get"
				if(opType==0)
				{
					const CacheValue && loadedData = loadData(key);
					valueBuffer[tag]=loadedData;
					keyBuffer[tag]=key;
					return loadedData;
				}
				else /* "set" */
				{
					valueBuffer[tag]=*value;
					keyBuffer[tag]=key;
					return *value;
				}
			}
			else // not edited
			{
				// "set"
				if(opType == 1)
				{
					isEditedBuffer[tag]=1;
				}

				// "get"
				if(opType == 0)
				{
					const CacheValue && loadedData = loadData(key);
					valueBuffer[tag]=loadedData;
					keyBuffer[tag]=key;
					return loadedData;
				}
				else // "set"
				{
					valueBuffer[tag]=*value;
					keyBuffer[tag]=key;
					return *value;
				}
			}
		}
	}

	CacheValue const accessDirect(const CacheKey & key,const CacheValue * value,
                                  const bool opType = 0)
	{
		// find tag mapped to the key
		InternalKeyTypeInteger tag = key & sizeM1;

		// compare keys
		if(keyBuffer[tag] == key)
		{
			// cache-hit

			// "set"
			if(opType == 1)
			{
				isEditedBuffer[tag]=1;
				valueBuffer[tag]=*value;
			}

			// cache hit value
			return valueBuffer[tag];
		}
		else // cache-miss
		{
			CacheValue oldValue = valueBuffer[tag];
			CacheKey oldKey = keyBuffer[tag];

			// eviction algorithm start
			if(isEditedBuffer[tag] == 1)
			{
				// if it is "get"
				if(opType==0)
				{
					isEditedBuffer[tag]=0;
				}

				saveData(oldKey,oldValue);

				// "get"
				if(opType==0)
				{
					const CacheValue && loadedData = loadData(key);
					valueBuffer[tag]=loadedData;
					keyBuffer[tag]=key;
					return loadedData;
				}
				else /* "set" */
				{
					valueBuffer[tag]=*value;
					keyBuffer[tag]=key;
					return *value;
				}
			}
			else // not edited
			{
				// "set"
				if(opType == 1)
				{
					isEditedBuffer[tag]=1;
				}

				// "get"
				if(opType == 0)
				{
					const CacheValue && loadedData = loadData(key);
					valueBuffer[tag]=loadedData;
					keyBuffer[tag]=key;
					return loadedData;
				}
				else // "set"
				{
					valueBuffer[tag]=*value;
					keyBuffer[tag]=key;
					return *value;
				}
			}
		}
	}

private:
	const CacheKey size;
	const CacheKey sizeM1;
	std::vector<std::mutex> mut;

	std::vector<CacheValue> valueBuffer;
	std::vector<unsigned char> isEditedBuffer;
	std::vector<CacheKey> keyBuffer;
	const std::function<CacheValue(CacheKey)>  loadData;
	const std::function<void(CacheKey,CacheValue)>  saveData;
};

#endif /* DIRECTMAPPEDMULTITHREADCACHE_H_ */

N路组相联多线程缓存

N路组相联缓存是直接映射缓存和全关联(LRU)缓存之间的折衷。它具有直接映射缓存的多线程可扩展性,并且接近 LRU 缓存的缓存命中率特性(来源)。当单独使用时,它具有**缓存一致性**。这使得读写操作在所有通过 getThreadSafe/setThreadSafe 方法进行的访问之间可见(对于单线程,仅 set/get 就足够了)。

#ifndef NWAYSETASSOCIATIVEMULTITHREADCACHE_H_
#define NWAYSETASSOCIATIVEMULTITHREADCACHE_H_

#include<vector>
#include<memory>
#include<functional>
#include"LruClockCache.h"

/* N parallel LRU approximations (Clock Second Chance)
* Each with own mutex
* cache-coherent writes+reads as long as user-given cache-miss functions 
* handle the synchronization on the backing store
* 				each key is guarded by its own mutex guard so it shouldn't be a problem 
*               if backing-store can do parallel operations on different keys
* 				if you need also the backing-store be thread-safe, then put a 
*               lock guard in the miss-functions
* numberOfSets = number of LRUs in parallel (has to be power of 2: 2,4,8,...16k,
* 32k,64k,....1M,2M,....)
* numberOfTagsPerLRU = number of cache items per set (LRU Clock cache)
* 			total size of cache is (numberOfSets * numberOfTagsPerLRU) elements
* ClockHandInteger: just an optional optimization to reduce memory consumption 
* when cache size is equal to or less than 255,65535,4B-1,...
*/

template<typename CacheKey, typename CacheValue, typename CacheHandInteger=size_t>
class NWaySetAssociativeMultiThreadCache
{
public:
	NWaySetAssociativeMultiThreadCache(size_t numberOfSets, size_t numberOfTagsPerLRU,
			const std::function<CacheValue(CacheKey)> & readMiss,
			const std::function<void(CacheKey,CacheValue)> & 
            writeMiss):numSet(numberOfSets),numSetM1(numberOfSets-1),
            numTag(numberOfTagsPerLRU)
	{
		for(CacheHandInteger i=0;i<numSet;i++)
		{
			sets.push_back(std::make_shared<LruClockCache<CacheKey,
                 CacheValue,CacheHandInteger>>(numTag,readMiss,writeMiss));
		}
	}

	const CacheValue get(CacheKey key) const noexcept
	{
		// select set
		CacheHandInteger set = key & numSetM1;
		return sets[set]->get(key);
	}

	void set(CacheKey key, CacheValue value) const noexcept
	{
		// select set
		CacheHandInteger set = key & numSetM1;
		sets[set]->set(key,value);
	}

	const CacheValue getThreadSafe(CacheKey key) const noexcept
	{
		// select set
		CacheHandInteger set = key & numSetM1;
		return sets[set]->getThreadSafe(key);
	}

	void setThreadSafe(CacheKey key, CacheValue value) const noexcept
	{
		// select set
		CacheHandInteger set = key & numSetM1;
		sets[set]->setThreadSafe(key,value);
	}

	void flush()
	{
		for(CacheHandInteger i=0;i<numSet;i++)
		{
			sets[i]->flush();
		}
	}

private:
	const CacheHandInteger numSet;
	const CacheHandInteger numSetM1;
	const CacheHandInteger numTag;
	std::vector<std::shared_ptr<LruClockCache<CacheKey,CacheValue,CacheHandInteger>>> sets;
};

#endif /* NWAYSETASSOCIATIVEMULTITHREADCACHE_H_ */

由于它由多个 LRU 近似实例组成,因此其随机访问和顺序访问性能相同(不如 DirectMappedMultiThreadCache 的 5 倍顺序访问性能),并且对于 3.6GHz FX8150,缓存命中率高于 7000 万次缓存命中查找,同时保持比直接映射版本更好的缓存命中率。对于 Minecraft 世界中玩家在三维空间中创建和销毁方块的情况,这将意味着当可见世界大小为 200x20x200 块时,支持 60 帧每秒的性能。

实现两级缓存的多线程缓存一致性

为了在线程之间实现正确的数据排序和可见性,缓存需要无效化或每个键/每个缓存层需要一个单一入口点。硬件世界可能有高效的缓存间流水线和并行单周期操作来维护一致性,但在软件中,线程之间没有捷径。数据必须以某种方式对一个线程显式化,然后进行计算,最后对另一个线程可用。当多个线程尝试访问代码的同一关键部分时,这通常是一个瓶颈。对于这种情况下的最坏场景,访问被有效地序列化,并且可能比在单线程上运行相同的代码更慢。为了克服这个问题,访问可以在不同级别的通信中序列化,其中每个线程的独占访问分散在比单个项更宽的区域。生产者-消费者模式部分解决了这个问题。

在**生产者-消费者**问题中,一个或多个线程生成数据并将其推入队列,一个或多个消费者从队列中弹出数据并处理它。为了实现固有的主动缓存一致性,只有一个消费者线程被授予访问缓存(或多级缓存)的权限,并接收来自多个生产者线程的序列化 get/set 流。序列化同步仅在每个线程上进行,而不是在每个项上进行。这消除了客户端的每个键锁定争用。

与硬件缓存无效化技术相比,实现单消费者-多生产者缓存一致性相对容易。下图展示了 AsyncCache 的源代码 的工作原理。

当一个线程调用 AsyncCachegetAsync / setAsync 方法时,它会将 get/set 命令添加到某个插槽的专用队列中,同时锁定选定的插槽。然后,通过锁定选定的插槽,将专用队列与消费者线程的内部队列交换。然后,消费者线程直接从自己的队列读取,无需锁定。这有效地将锁争用减少了,比例为直到消费者访问同一生产者线程的私有队列的每个线程推送的命令数量。如果一个线程将其自己的队列推入 100 个命令项,那么生产者线程最多会锁定/解锁 100 次,可能甚至不会与消费者线程发生争用。然后,消费者最多锁定/解锁同一插槽 2 次,一次用于获取队列,一次用于响应屏障调用(以便等待的生产者线程可以继续)。每个线程请求的键越多,性能就越接近无锁+无等待的单线程缓存,大约为每秒 7500 万次查找。

通过高效的生产者-消费者模型实现一致缓存,生产者线程之间的**公平性**以命令**块**的形式维护。理想情况下,所有线程都会有完全交错的队列入口,但由于性能优化,命令以块的形式接收。这使得直接的线程间通信对于流行的方案(如

  • setAsync(100,..) 来自 thread-1
  • getAsync(100,..) 来自 thread-2,在一个自旋等待循环中

在这里,thread-1setAsync 命令和 thread-2getAsync 命令可能出现在消费者线程的同一个块周期中,但如果它们的块处理顺序相反,则总是先处理 thread-2,然后处理 thread-1。这会导致重复用例效率低下,因为迫使 thread-2 至少检查值 2 次。这是来自“吞吐量”优化(多线程)的众多延迟贡献之一。

  • 数据通过**线程 1** 的插槽**锁定**进入线程 1 的私有队列。
  • 数据通过对同一插槽进行额外的**锁定**由消费者线程获取。
  • 处理过的数据在**缓存本身的延迟**内通过缓存传输。
  • 结果存储在 thread-1 提供的变量指针上,或存储在缓存/后端存储中。
  • 消费者线程通过对插槽进行另一次**锁定**来发出信号,表明该插槽现在是无屏障的(已完成读/写)。
  • (可选所有其他由 N-2 个其他线程在队列中发出的命令)。
  • thread-1 异步:Thread-2 **锁定**其自己的插槽并传入其自己的 get 命令。
  • 数据在**锁定**该插槽一次之后,再次由消费者线程获取。
  • 处理过的数据在**缓存本身的延迟**内通过缓存传输。
  • 消费者线程通过另一次**锁定**发出屏障信号。
  • 最后,线程 2 在其自己的插槽锁定后获得结果。.

至少进行 7 次互斥锁(具体来说是 lock_guard)和 2 次缓存访问(延迟取决于引用局部性)。如果一次只访问单个元素,这将至少需要几微秒的延迟。但在大规模并行系统中,当许多项同时处于飞行状态时,这会发生数十或数百次,并且产生的平均延迟(逆吞吐量,不是实际延迟)为每次 get/set 调用 12-13 纳秒。

如果访问模式不需要链式 set->get 操作,那么每次 get 或每次 set 的实际延迟是上述列表的一半。每次生产者线程发出 1000 次 get/set 请求和 1 次屏障调用,会产生 1000 次锁定,每个线程最多 2 次争用,以及只有 2 次消费者线程锁定操作,最多 2 次争用。那么一个具有 8 个线程的 CPU 最多可以有 8 个同时锁。这使得每线程锁争用性能与单线程锁争用性能相似,而消费者线程成为瓶颈,因为它必须进行簿记,例如交换队列、选择命令逻辑以及每块进行 1-2 次锁定。

维护一致性的另一种方法是分片。通过**分片**,锁定是按键或按集进行的,而不是按插槽/线程。

(显示来自两个线程的 get(4) 方法调用的键基于锁争用的图像)

按键锁定在所有线程访问同一键时会带来非常高的锁定争用。另一方面,线程之间分散的访问模式会完全并行化,并且每个键访问的实际延迟会远好于生产者-消费者版本。一个 128 核 CPU 可能具有 128 个最佳情况下的并行性,可能远超生产者-消费者模型的性能,但在所有 128 个线程访问同一键时也很糟糕。分片的优点在于其适用性。一个分片缓存即使在其之上还有一个分片缓存,也可以作为一个多级缓存正常工作。MultiLevelCache 的源代码(MultiLevelCache)实现了分片到多级缓存。直接映射缓存具有分片键访问,n路组相联缓存具有分片集访问。两者都以简单且线程安全的方式相互连接。任何两个线程访问相同键(或 L1 中的同一集合--缓存未命中)都会导致锁争用。但易于阅读、实现,并且可以在更多 CPU 核心上扩展。这种一致性方式只需要避免多个线程同时访问相同的键。

MultiLevelCache 类的用法

// simulating a backing-store
std::vector<int> data(1024*1024);

// construction similar to other cache types
MultiLevelCache<int,int> cache(L1tags,L2sets,L2tagsPerSet,
    [&](int key){ return data[key]; },
    [&](int key, int value){ data[key]=value;}
);

#pragma omp parallel for
for(int i=0;i<1000;i++)
    cache.setThreadSafe(i,0);

cache.flush();

此示例中的 for 循环实现了 1000 路并行性,前提是存在足够多的 CPU 核心和 L1 标签,或者对于 FX8150 CPU 则实现 8 路并行性。

AsyncCache 类的用法

std::vector<int> data(1000000);

// has a consumer thread constantly swapping vectors & checking commands
AsyncCache<int,int> cache(L1tags,L2sets,L2tagsPerSet,
    [&](int key){ return data[key]; },
    [&](int key, int value){ data[key]=value; }
);

int val;

// setAsync/getAsync returns selected slot
// method call returns immediately
int slot = cache.setAsync(5,100);

// slot selection can be made manually with integer id
// such as omp_get_thread_num()
cache.getAsync(5,&val,slot);

// ... some work can be done here while cache thread is working ...
computeStuff();

// data is not serialized yet, outputs garbage garbage
std::cout<<data[5]<<" "<<val<<std::endl;

// data is synchronized for selected slot
cache.barrier(slot);

// outputs garbage 100
std::cout<<data[5]<<" "<<val<<std::endl;

// data is saved on backing-store
cache.flush();

// outputs 100 100
std::cout<<data[5]<<" "<<val<<std::endl;

// cache object is out of scope, destructed and its internal resources are freed

2D/3D 直接映射缓存

普通的直接映射缓存的标签冲突使其难以在许多模式下获得良好的缓存命中率。但它可以针对特定场景进行优化。在图像处理和 2D 矩阵-矩阵乘法算法中,迭代一个维度比迭代另一个维度更容易产生标签冲突(因此导致缓存未命中)。拥有一个 2D/3D 标签数组并独立地在每个维度上进行映射,可以在处理图像的分块时保留部分缓存内容。对于矩阵,这相当于将子矩阵相乘。

从多维直接映射缓存的源代码来看,映射计算只有细微差别。

CacheKey tagX = keyX & sizeXM1;
CacheKey tagY = keyY & sizeYM1;

相同的计算但在两个维度上,用于 2D 标签数组。其命中率特性在类似 Minecraft 的游戏引擎的视频中进行了视觉测试,这转化为在地形生成引擎中进行的基准测试所显示的更高性能。

当访问模式是单维的时,多维直接映射缓存的性能比普通直接映射缓存差,但在分块访问模式下,当索引的后端存储远大于缓存大小时,其性能会提高几个数量级。一个 100000x100000 的矩阵在分块处理 16x16 区域时,在 100000 大小的直接映射缓存上会导致 100% 的缓存未命中。使用 16x16 的 2D 直接映射缓存(仅 256 大小)可将缓存命中率提高到至少 50% 或更多,具体取决于元素的重用率。

何时不应使用自定义缓存

在某些情况下,系统函数(例如访问大文件)很容易超越自定义算法的性能。一个例子是通过 mmap 区域访问大文件,并且不关心是否使用了整个 RAM。mmap 函数使用特定于系统的缓存和编译器功能(CPU 缓存、向量化、分页访问等),其速度几乎与在 C++ 中使用简单变量一样快。没有自定义缓存能超越其性能。这里将 N 路组相联缓存与直接 mmap 区域访问进行了比较。

int fd = open("output_image_scaling256.ppm", O_RDONLY);
struct stat file_info;
fstat (fd, &file_info);

char * mmap_addr = (char *) mmap(NULL, file_info.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
auto cacheLLC = std::make_shared<NWaySetAssociativeMultiThreadCache<size_t,char>>(128,
        [&](const size_t & key){
            return mmap_addr[key];
        },
        [&](size_t key, char value){  }
);

const int datasetSize = file_info.st_size/10;
const int numThread = 8;
const int numAccess = 1000000;
const int numRepeat = 200;
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_real_distribution<float> rnd(0,datasetSize);
std::vector<int> r(20000000);
for(int i=0;i<datasetSize;i++)
    r[i]=rnd(rng);

std::string result;

for(int j=0;j<5;j++)
{
    {
        CpuBenchmarker bench(numAccess*numThread*numRepeat,
                             "mmap",numAccess*numThread*numRepeat);
        #pragma omp parallel for
        for(int ii=0;ii<numThread;ii++)
        {
            size_t ct=0;
            for(int j=0;j<numRepeat;j++)
            for(int i=0;i<numAccess;i++)
            {
                ct += mmap_addr[r[i]];

            }
            result += std::to_string(ct)+" ";
        }
    }

    {
        CpuBenchmarker bench(numAccess*numThread*numRepeat,"cached mmap",
                             numAccess*numThread*numRepeat);
        #pragma omp parallel for
        for(int ii=0;ii<numThread;ii++)
        {
            size_t ct2=0;
            CacheThreader<NWaySetAssociativeMultiThreadCache,size_t,char> 
                          cache(cacheLLC,1024*1024,1024*1024*2);

            for(int j=0;j<numRepeat;j++)
            for(int i=0;i<numAccess;i++)
            {
                ct2 += cache.get(r[i]);

            }
            result += std::to_string(ct2)+" ";
        }
    }
}

std::cout<<result<<std::endl;

munmap (mmap_addr, file_info.st_size);
close (fd);

此代码片段比较了直接 mmapped 区域访问与缓存的 mmapped 访问,结果是 mmap 以巨大优势获胜,正如预期的那样。

mmap: 452761856 nanoseconds     (bandwidth = 3533.87 MB/s)      
                                (throughput = 0.28 nanoseconds per iteration)
cached mmap: 3127557924 nanoseconds     (bandwidth = 511.58 MB/s)      
                                (throughput = 1.95 nanoseconds per iteration)
mmap: 478753028 nanoseconds     (bandwidth = 3342.02 MB/s)      
                                (throughput = 0.30 nanoseconds per iteration)
cached mmap: 3022541644 nanoseconds     (bandwidth = 529.36 MB/s)      
                                (throughput = 1.89 nanoseconds per iteration)
mmap: 441924232 nanoseconds     (bandwidth = 3620.53 MB/s)      
                                (throughput = 0.28 nanoseconds per iteration)
cached mmap: 2937497135 nanoseconds     (bandwidth = 544.68 MB/s)      
                                (throughput = 1.84 nanoseconds per iteration)
mmap: 423980321 nanoseconds     (bandwidth = 3773.76 MB/s)      
                                (throughput = 0.26 nanoseconds per iteration)
cached mmap: 2905986572 nanoseconds     (bandwidth = 550.59 MB/s)      
                                (throughput = 1.82 nanoseconds per iteration)
mmap: 493596405 nanoseconds     (bandwidth = 3241.51 MB/s)      
                                (throughput = 0.31 nanoseconds per iteration)
cached mmap: 2911731956 nanoseconds     (bandwidth = 549.50 MB/s)      
                                (throughput = 1.82 nanoseconds per iteration)

尽管自定义缓存的 5.5 亿次随机查找/秒的性能,mmap 的速度至少快 6 倍。硬件缓存几乎总是胜过软件缓存,因为软件版本总是依赖于硬件。但有一个陷阱。Mmap 的 RAM 使用由操作系统控制。如果整个 RAM 不用于文件缓存,则可以通过自定义缓存(如代码片段中的 N 路组相联多线程缓存实例)来限制它。

不需要缓存的另一个情况是,不重用已访问的元素。只需解析文件一次,迭代数组一次,任何成本低于分配自定义缓存的操作都不需要缓存。矩阵-矩阵乘法就是一个高重用率的例子。矩阵中的每个元素都会被访问 N 次,这使得更大的矩阵具有更好的重用率。使用分块乘法时,它对缓存效果更好。

有时,即使只是使用临时变量也能很好地工作。那么就没有必要通过不必要的缓存层来增加延迟(除非临时变量的数量将来可能增长,超过 RAM)。

有时,后端存储非常慢(网站资源),使用 C++ 优化的缓存是过度的。使用基于 JavaScript 的缓存实现足以快速缓存视频块、HTML 页面和 CSS 文件。

有时,数据非常敏感,应防止其遭受**计时攻击**。对于此类问题,应**消除所有确定性时间变化**的来源。LRU 和所有其他节省时间的算法都容易受到此类攻击,即使是随机逐出也不是真正的解决方案,只能使其更难被破解。

何时使用自定义缓存

  • 从足够慢的后端存储中获得足够的数据重用率,使其比访问 RAM 慢
  • 要求限制大数据集的 RAM 使用量
  • 出于学习目的
  • 要求访问时间稳定性(访问数据存储的 X 区域比 Y 区域慢)
  • 记忆中等延迟(~20 纳秒到 1 微秒)的任务,内存预算有限,例如在曼德勃罗集生成器中缩放到前一帧,或在 Minecraft 世界中移动,或加速访问内存数据库

一个视频演示了通过缓存实现 160 倍性能提升,在过程地形生成应用程序中,尽管存在使用模式问题。

  • 缓存并未在需要相同数据的每个地方都使用。仅在渲染部分使用,并且在每次地形生成时重用率仅为 1。这使得 CPU 缓存不够有效。
  • 数据类型为双精度浮点数,在数据未重用时,对单通道内存计算机有一定带宽影响。
  • 索引(键)计算除了 cache.get() 调用外还有额外延迟。

CPU 缓存的影响

由于软件缓存的标签/槽/集存储在内存中,它们会被 CPU 缓存。尽可能多地使用缓存很重要,以防止 CPU 缓存被破坏。为了获得额外性能,某些元素的缓存输出可以存储在临时变量中,但存储整个缓存内容 50%-100% 的数组会导致 CPU 缓存逐出软件缓存以服务临时数组。然后下次使用软件缓存时,它将来自下一级 CPU 缓存甚至 RAM。

此外,使用非常大的软件缓存不利于 CPU 缓存。任何遍历整个缓存内容(100MB、500MB 等大小)的操作也会将 CPU 缓存中的所有元素逐出。这会使其慢得多,除非有足够的内存访问处于飞行状态。

如果软件缓存的大小大于 RAM,会导致交换文件使用量增加,并且如果从软件缓存访问的元素位于交换文件中(不在 RAM 中,由 OS 选择),那么缓存命中会比缓存未命中更差,因为获取数据必须将其他内容逐出到交换文件中,而这些被逐出的数据可能是缓存中的下一个元素,本应是缓存命中。然后它会在交换文件中导致一系列缓存未命中反应,这对需要性能稳定性的应用程序来说非常不幸。

为了获得更好的优化,L1 软件缓存的大小可以设置为适合 CPU L1 缓存,L2 软件缓存的大小可以设置为适合 CPU L2 缓存,LLC 软件缓存的大小可以设置为适合 CPU LLC。这可以潜在地减少 CPU 端的缓存冲突,因为它不会将不同的软件缓存混合到 L1 CPU 缓存中。

为了从缓存中获得最大的性能,用户应该考虑 CPU 的能力,如预取、SIMD 计算等功能。如果每个元素只为了分组到 8 个元素的向量中而被访问,那么缓存应该用于收集它们并更好地服务 CPU SIMD 架构。

关注点

通过低延迟访问,可以利用多线程只读直接映射缓存从game文件夹读取静态世界数据,并分发给游戏逻辑的线程来更新图形或其他游戏状态,而无需 100% 依赖于 **HDD/SSD/过程生成计算**。即使使用共享给所有线程的单个 DirectMappedMultiThreadCache 实例,也至少可以减少 HDD 访问次数。

通过 LRU 缓存,甚至像 Redis 这样的快速数据库也可以在客户端缓存,而无需在 Redis 端进行额外设置。

当值类型远大于 int 时,内存效率将通过相对较少的迭代达到最大。例如,虚拟内存分页模拟应用程序可以在直接读写磁盘(或视频内存,如果它是视频内存后备的 RAMDISK)上的文件之前缓存页面(每个 4kB、64kB、1MB)。

对于任何未知的访问模式(例如,在客户端请求时立即访问数据库),每个线程的性能都会下降。这可以通过将请求池化一段时间,然后一次性计算它们来进一步优化。

未来,可以通过原子访问每个私有缓存标签的“isEdited”字段来实现多级缓存的读写能力,这与 CPU 缓存的做法类似。但这可能会降低每个线程的总体性能和可扩展性。还可以测试其他拓扑结构,例如“DirectMappedMultiThreadedCache 位于 LruClockCache 前面,仅使用 setThreadSafegetThreadSafe 方法”以进行扩展。

历史

  • 2021 年 10 月 11 日
    • 使用基本的微基准测试开始文章,并在各种访问模式下进行测试。
  • 2021 年 10 月 15 日
    • 添加了 N 路组相联缓存以实现多线程缓存一致性。
    • 添加了一个关于何时不应使用自定义缓存的示例案例,同时展示了 N 路组相联缓存的性能。
  • 2021 年 10 月 17 日
    • 在“何时使用缓存”部分添加了视频。
    • 修复了视频错误,添加了视频说明。
    • 添加了 CPU 缓存的影响部分。
  • 2021 年 10 月 29 日
    • 通过生产者-消费者模型实现了多级缓存一致性。
    • 通过绑定两个一致的分片缓存实现了多级缓存一致性。
    • 实现了 2D/3D 直接映射缓存。
    • 添加了图片(到:“clock second chance algorithm”、“achieving cache coherence on two level cache”、“direct mapped cache”、“multi level cache”、“image processing (gaussian blur)”、“n-way set-associative multi-thread cache” 部分)。

 

© . All rights reserved.