简单的快速自适应网格,用于加速粒子 AABB 之间的碰撞检测
“粒子在单元格中”问题的网格实现演练,以提高各种场景下轴对齐边界框 (AABB) 碰撞检测的性能
碰撞检测
碰撞检测是指在单个维度或更高维度的空间中查找多个物体相交的动作。尽管听起来很简单,但它在科学、电子游戏和网站的许多领域都非常有用。一个常见的用例是在电子游戏的物理模拟中防止刚体相互穿透。
背景
在为简单的物理模拟编写代码或为图形相关问题编写教程时,开发人员可能需要碰撞检测算法,但许多人会包含一个庞大的、功能齐全的游戏引擎(大小达数百兆字节),仅仅是为了用它作为碰撞检测器,并且还要经历漫长的安装和依赖编译过程。提供一个不到1000行的仅头文件解决方案,只需简单地“#include
”这个头文件,它只做大量粒子的AABB碰撞检测(并且最好足够快以支持实时工作),而无其他功能,这会很有用。
基线
在开始优化之前,有一个给定的挑战(处理简化的边缘情况)
- 检测10k个点云对象实例之间的所有成对碰撞,
- 每个点云对象由125个点组成(每个点包含3个浮点变量)
- 随机大小
- 随机分布在3D空间中
- 执行相同的操作,但要求达到60 FPS(实时性能的粗略定义)
- 检测1个动态点云AABB与10k个静态点云AABB之间的碰撞,速度快于1微秒
- 使用低端系统(FX8150 CPU,2.1 GHz,单通道1333MHz DDR3 RAM)
- 使用C++(Ubuntu上的g++-10)
作为最简单的参考解决方案(作为性能比较的基础),程序将包含一个高度简化的暴力交叉测试
// psudo-code
CollisionList collisionPairList;
for(each a object in scene)
for(each b object in scene)
for(each c point in a)
for(each d point in b)
{
if(c is close enough to d)
{
// a intersects b
add id-values of a and b into collisionPairList
end the loops of c and d
}
}
参考测试程序源代码(使用来自此github存储库的FastCollisionDetectionLib.h头文件)
#include"FastCollisionDetectionLib.h"
#include<iostream>
template<typename CoordType>
struct Vector3D
{
CoordType x,y,z;
Vector3D<CoordType> crossProduct(Vector3D<CoordType> vec)
{
Vector3D<CoordType> res;
res.x = y*vec.z - z*vec.y;
res.y = z*vec.x - x*vec.z;
res.z = x*vec.y - y*vec.x;
return res;
}
Vector3D<CoordType> operator - (Vector3D<CoordType> vec)
{
Vector3D<CoordType> result;
result.x = x-vec.x;
result.y = y-vec.y;
result.z = z-vec.z;
return result;
}
Vector3D<CoordType> operator + (Vector3D<CoordType> vec)
{
Vector3D<CoordType> result;
result.x = x+vec.x;
result.y = y+vec.y;
result.z = z+vec.z;
return result;
}
Vector3D<CoordType> operator * (CoordType v)
{
Vector3D<CoordType> result;
result.x = x*v;
result.y = y*v;
result.z = z*v;
return result;
}
// warning: sqrt is not necessary in production code like:
// if ( v1.abs()<v2.abs() ) { .. }
// the sqrt below is intentionally left to simulate "heavy" task in "challenge 1"
CoordType abs()
{
return std::sqrt(x*x+y*y+z*z);
}
};
template<typename CoordType>
struct PointCloud
{
Vector3D<CoordType> point[125];
PointCloud(CoordType x, CoordType y, CoordType z)
{
for(int i=0;i<125;i++)
{
point[i].x=x+i%5-2.5f;
point[i].y=y+(i/5)%5-2.5f;
point[i].z=z+i/25-2.5f;
}
}
};
template<typename CoordType>
bool pointCloudIntersection(PointCloud<CoordType>& cl1, PointCloud<CoordType>& cl2)
{
for(Vector3D<CoordType>& p:cl1.point)
{
for(Vector3D<CoordType>& p2:cl2.point)
{
if((p-p2).abs()<1.0f)
{
return true;
}
}
}
return false;
}
int main()
{
using cotype = float;
bool results[41*41];
PointCloud<cotype> ico1(0,0,0);
// heating the CPU for benchmarking
for(int i=0;i<10000;i++)
{
PointCloud<cotype> ico2(0,0.1f,i*0.1f);
results[0]=pointCloudIntersection(ico1,ico2);
}
// benchmark begin
size_t nano;
{
FastColDetLib::Bench bench(&nano);
for(int j=-20;j<=20;j++)
for(int i=-20;i<=20;i++)
{
PointCloud<cotype> ico2(0,i*0.5f,j*0.5f);
results[i+20+(j+20)*41 ]=pointCloudIntersection(ico1,ico2);
}
}
std::cout<<41*41<<"x collision checks between 2 clouds =
"<<nano<<" nanoseconds ("<<(nano/(41.0*41.0))<<" ns per collision check)"<<std::endl;
for(int i=0;i<41*41;i++)
{
if(i%41==0)
std::cout<<std::endl;
std::cout<<results[i];
}
std::cout<<std::endl;
return 0;
}
正如预期的那样,程序会输出如下结果
1681x collision checks between 2 clouds = 253073988 nanoseconds
(150550 ns per collision check)
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000111111111111111111100000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
每次点云与点云的碰撞检查耗时150微秒。在下一个测试中,将计算100个点云的碰撞检测
int main()
{
using cotype = float;
PointCloud<cotype> ico1(0,0,0);
// heating the CPU for benchmarking
for(int i=0;i<10000;i++)
{
PointCloud<cotype> ico2(0,0.1f,i*0.1f);
pointCloudIntersection(ico1,ico2);
}
const int N = 100;
std::vector<PointCloud<cotype>> objects;
for(int i=0;i<N;i++)
{
objects.push_back(PointCloud<cotype>(i*1.5,i*1.5,i*1.5));
}
// benchmark begin
size_t nano;
std::map<int,std::map<int,bool>> collisionMatrix;
{
FastColDetLib::Bench bench(&nano);
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
{
collisionMatrix[i][j]=pointCloudIntersection(objects[i],objects[j]);
}
}
std::cout<<N*N<<"x collision checks between 2 clouds =
"<<nano<<" nanoseconds ("<<(nano/((double)N*N))<<" ns per collision check)"<<std::endl;
std::cout<<collisionMatrix.size()<<" unique object are in a collision"<<std::endl;
return 0;
}
结果如下
10000x collision checks between 2 clouds = 1002381391 nanoseconds
(100238 ns per collision check)
100 unique object are in a collision
相同测试,N=500
250000x collision checks between 2 icosahedrons = 25975908607 nanoseconds
(103904 ns per collision check)
500 unique object are in a collision
运行时间从1秒增加到26秒(尽管单次碰撞计算降低到了103微秒)。以这种比例( O(N^2) ),10k个点云需要不少于3小时。这离实时性能还差得很远。
第一次优化:AABB
为了避免不必要的碰撞检查(例如,如上一个测试中的对象1与对象100之间),可以计算每个对象的轴对齐边界框(AABB),并使用它进行“近似”碰撞检查,这样就可以根据AABB碰撞测试结果(假)安全地判断“这两个对象肯定没有重叠”。如果AABB碰撞返回true,那么进行更细粒度的点云与点云碰撞检查就很有用。
AABB只需要每个对象在3D空间中的简单最小-最大点,针对X、Y和Z轴。
- 最小X:对象中在X维度上第一个接触点的最小X坐标
- 最大X:对象中在X维度上最后一个接触点的最大X坐标
- Y和Z轴也相同
轴对齐边界盒碰撞检查很简单
- 检查两个对象AABB在X维度上是否碰撞(在X轴上的投影)
- 如果它们不碰撞,则没有碰撞
- 对Y和Z维度执行相同的检查
- 如果所有轴都发生碰撞,则AABB之间发生碰撞
以下函数可重复用于两个AABB的X、Y和Z维度,以查找碰撞
bool intersectDim(
const CoordType minx, const CoordType maxx,
const CoordType minx2, const CoordType maxx2)
{
return !((maxx < minx2) || (maxx2 < minx));
}
新的点云类增加了边界计算
template<typename CoordType>
struct PointCloud
{
CoordType xmin,ymin,zmin;
CoordType xmax,ymax,zmax;
Vector3D<CoordType> point[125];
PointCloud(CoordType x, CoordType y, CoordType z)
{
xmin=x-2.5f;
ymin=y-2.5f;
zmin=z-2.5f;
xmax=x-2.5f;
ymax=y-2.5f;
zmax=z-2.5f;
for(int i=0;i<125;i++)
{
point[i].x=x+i%5-2.5f;
point[i].y=y+(i/5)%5-2.5f;
point[i].z=z+i/25-2.5f;
if(xmin>point[i].x)
xmin=point[i].x;
if(ymin>point[i].y)
ymin=point[i].y;
if(zmin>point[i].z)
zmin=point[i].z;
if(xmax<point[i].x)
xmax=point[i].x;
if(ymax<point[i].y)
ymax=point[i].y;
if(zmax<point[i].z)
zmax=point[i].z;
}
}
};
新的3D碰撞测试,仍然是暴力方法,但通过粗略阶段碰撞检查来避免不必要的检查
std::map<int,std::map<int,bool>> collisionMatrix;
{
FastColDetLib::Bench bench(&nano);
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
{
// check AABB collision on X axis
if(intersectDim(objects[i].xmin,objects[i].xmax,objects[j].xmin,objects[j].xmax))
// check AABB collision on Y axis
if(intersectDim(objects[i].ymin,objects[i].ymax,objects[j].ymin,objects[j].ymax))
// check AABB collision on Z axis
if(intersectDim(objects[i].zmin,objects[i].zmax,
objects[j].zmin,objects[j].zmax))
// only then, real fine-grained collision-checking is made
collisionMatrix[i][j]=pointCloudIntersection(objects[i],objects[j]);
}
}
输出(再次针对500个点云,使用相同的初始设置)如下
250000x collision checks between 2 clouds = 55842744 nanoseconds
(223.371 ns per collision check)
500 unique object are in a collision
将运行时间从26秒减少到55毫秒,这对于实现实时性能非常有帮助,但仍然不够。
此时,可以再次增加基准测试的规模,以下输出来自对10000个粒子执行的相同测试
100000000x collision checks between 2 clouds = 2015666583 nanoseconds
(20.1567 ns per collision check)
10000 unique object are in a collision
10k个对象2秒的运行时间仅相当于59.5 FPS,远未达到实时性能。AABB碰撞检查有助于尽早退出碰撞检查。如果X维度检查失败,所有其他维度计算都会被规避。由于处理已排序数组的速度更快,因为CPU的分支预测,按X维度递增顺序初始化点云也有助于此。对于伪随机分布,预期性能较低并非不合理。
随机分布下的性能(源代码)
100000000x collision checks between 2 clouds = 20054203334 nanoseconds
(200.542 ns per collision check)
10000 unique object are in a collision
其速度只有已排序对象AABB数组的1/10。即使是FX8150这样的旧CPU,在进行投机执行时也能保持10倍的性能,尤其是在运行时间主要由分支决定的情况下。
这意味着,对于实际场景(带有随机性),目前的性能只有0.05 FPS,距离实时性能还有59.95 FPS的差距,这表明需要一个空间加速结构。
第二次优化:网格
有多种类型的加速结构,它们在各自的应用领域表现出色,但也存在特定的缺点。
- 均匀网格:将空间划分为单个3D层单元格,用于存储接触这些单元格的对象。
- 优点:当物体密度在计算区域内均匀分布时,性能良好。
- 缺点:“茶壶在体育场”问题导致均匀网格浪费大量内存且在计算碰撞时效率低下。
- 八叉树:将空间划分为由8个子节点组成的固定分层体积。
- 优点:比均匀网格能更好地处理不规则的物体密度,减少内存占用。
- 缺点:树遍历延迟
- K维树:类似于八叉树,但分裂被维持为每个父节点有两个子节点。
- 优点:最近邻搜索速度更快
- 缺点:添加/删除对象时需要额外的时间进行重新平衡
- BVH(包围体积层次):对物体进行分组,而不是划分空间。
- 优点:节点没有边界,可以映射几乎无限的空间。
- 缺点:节点可能重叠,甚至子节点与父节点之间也可能重叠,这使得计算最近邻或其他使用“空间哈希”的属性更加困难。
本文实现了最简单的空间加速结构——均匀网格,并将其与暴力方法进行了基准测试,并进行了优化以实现**自适应**。
实现网格很简单,有两种不同的版本
- 碰撞掩码网格
- 创建一个整数数组,并将其重新解释为空间中虚构网格的单元格。
- 对于每个对象,将其指针或ID值**放入所有重叠的单元格**中
- 对于其他所有对象,**仅在重叠的单元格**中检查ID值/指针
- 仅在至少包含一个对象ID的单元格内计算碰撞
- 中心点网格
- 再次创建一个整数数组,并将其重新解释为单元格
- 对于每个对象,将ID值**仅放入包含AABB中心点的单元格**中
- 对于其他所有对象,**扫描所有可能包含对象ID的相邻单元格**
这些实现具有不同的性能特征。仅使用中心点插入ID值可以节省更多内存,并且插入操作花费的时间更少。使用碰撞掩码当对象小于单元格时,可以更快地进行碰撞查询。中心点变体在具有不同AABB尺寸的各种形状的对象时效率不高。这迫使网格扫描更远的单元格才能找到较大对象的碰撞。另一方面,碰撞掩码会标记与对象AABB重叠的所有单元格,而不考虑其大小,并且在大多数对象尺寸较小的情况下仍然能保持速度。
简而言之,**碰撞掩码**在网格上提供更高的碰撞扫描性能,而**中心点**版本具有更好的对象插入性能。
测试实现将使用均匀网格的**碰撞掩码**版本
- 找到重叠单元格列表是一个O(1)操作
- 遍历重叠单元格(以插入ID值)是O(k),其中k是重叠单元格的数量
- 通过暴力法计算碰撞是O(m*m),其中m是单元格中粒子的最大数量
- 总体复杂度变为O(N*k*m*m)
- 如果对象小于单元格,则变为O(N*m*m)
- 如果单元格只包含少量对象,则变为O(N)(常数时间)
网格实现
通过将AABB相关的计算从PointCloud
转移到AABBofPointCloud struct
(源代码),可以方便地处理不同对象,但为了简单起见,接口声明推迟到后面的标题,只使用简单的struct
包装器。
新的点云struct
template<typename CoordType>
struct PointCloud
{
Vector3D<CoordType> point[125];
PointCloud(CoordType x, CoordType y, CoordType z)
{
for(int i=0;i<125;i++)
{
point[i].x=x+i%5-2.5f;
point[i].y=y+(i/5)%5-2.5f;
point[i].z=z+i/25-2.5f;
}
}
};
AABBstruct
template<typename CoordType>
struct AABBofPointCloud
{
AABBofPointCloud(int idPrm, PointCloud<CoordType> * pCloudPrm)
{
id=idPrm;
pCloud = pCloudPrm;
xmin=pCloud->point[0].x;
ymin=pCloud->point[0].y;
zmin=pCloud->point[0].z;
xmax=pCloud->point[0].x;
ymax=pCloud->point[0].y;
zmax=pCloud->point[0].z;
for(int i=0;i<125;i++)
{
if(xmin>pCloud->point[i].x)
xmin=pCloud->point[i].x;
if(ymin>pCloud->point[i].y)
ymin=pCloud->point[i].y;
if(zmin>pCloud->point[i].z)
zmin=pCloud->point[i].z;
if(xmax<pCloud->point[i].x)
xmax=pCloud->point[i].x;
if(ymax<pCloud->point[i].y)
ymax=pCloud->point[i].y;
if(zmax<pCloud->point[i].z)
zmax=pCloud->point[i].z;
}
}
int id;
PointCloud<CoordType>* pCloud;
CoordType xmin;
CoordType ymin;
CoordType zmin;
CoordType xmax;
CoordType ymax;
CoordType zmax;
};
Grid
类
template<typename CoordType, int Size, int ObjectsPerCell>
class Grid
{
public:
Grid(CoordType minCor, CoordType maxCor)
{
id=0;
mincorner=minCor;
maxcorner=maxCor;
cellData.resize(Size*Size*Size*(ObjectsPerCell+1));
for(int i=0;i<cellData.size();i++)
cellData[i]=0;
}
template<typename Func>
void forEachCellColliding(AABBofPointCloud<CoordType>* aabb, const Func& func)
{
// calculate cell size (equal for all dimensions for now)
const CoordType step = (maxcorner - mincorner)/Size;
// calculate overlapping region's cell indices
const int mincornerstartx = std::floor((aabb->xmin - mincorner) / step);
const int maxcornerendx = std::floor((aabb->xmax - mincorner) / step);
const int mincornerstarty = std::floor((aabb->ymin - mincorner) / step);
const int maxcornerendy = std::floor((aabb->ymax - mincorner) / step);
const int mincornerstartz = std::floor((aabb->zmin - mincorner) / step);
const int maxcornerendz = std::floor((aabb->zmax - mincorner) / step);
for(int i=mincornerstartz;i<=maxcornerendz;i++)
for(int j=mincornerstarty;j<=maxcornerendy;j++)
for(int k=mincornerstartx;k<=maxcornerendx;k++)
{
if(i<0 || i>=Size || j<0 || j>=Size || k<0 || k>=Size)
continue;
func(k,j,i,aabb);
}
}
void addObject(AABBofPointCloud<CoordType>* aabb)
{
forEachCellColliding
(aabb, [&](int k, int j, int i, AABBofPointCloud<CoordType>* aabb){
const int collidingCellIndex = (k+j*Size+i*Size*Size)*(ObjectsPerCell+1);
const int lastUsedIndex = cellData[collidingCellIndex]++;
cellData[collidingCellIndex+lastUsedIndex+1]=id;
idMapping[id++]=aabb;
});
}
std::vector<AABBofPointCloud<CoordType>*>
checkCollisionsWithSingleAABB(AABBofPointCloud<CoordType>* aabb)
{
std::vector<AABBofPointCloud<CoordType>*> result;
forEachCellColliding
(aabb, [&](int k, int j, int i, AABBofPointCloud<CoordType>* aabb){
const int collidingCellIndex = (k+j*Size+i*Size*Size)*(ObjectsPerCell+1);
const int numObjectsInCell = cellData[collidingCellIndex];
for(int p=0;p<numObjectsInCell;p++)
{
const int idObj = cellData[collidingCellIndex+1+p];
AABBofPointCloud<CoordType>* aabbPtr = idMapping[idObj];
// evade self-collision and duplicated collisions
if( aabb->id < aabbPtr->id)
if(intersectDim(aabb->xmin, aabb->xmax, aabbPtr->xmin, aabbPtr->xmax))
if(intersectDim(aabb->ymin, aabb->ymax, aabbPtr->ymin, aabbPtr->ymax))
if(intersectDim
(aabb->zmin, aabb->zmax, aabbPtr->zmin, aabbPtr->zmax))
{
result.push_back(aabbPtr);
}
}
});
return result;
}
private:
int id;
CoordType mincorner,maxcorner;
std::map<int,AABBofPointCloud<CoordType>*> idMapping;
std::vector<int> cellData;
};
对象插入和碰撞检查都使用相同的基本函数,该函数扫描所有重叠的单元格,并且只有最里面的部分通过lambda函数在插入和检查之间进行更改。
请注意,通过简单的ID比较来避免所有重复计算和自碰撞,这实际上可以带来2倍的性能提升,即使没有任何优化。
由于均匀网格要求均匀分布,物体分布区域扩大到宽度为450的立方体,而每个点云的大小为5个单位。当有10000个点云时
const int N = 10000;
std::vector<PointCloud<cotype>> objects;
oofrng::Generator<64> gen;
for(int i=0;i<N;i++)
{
objects.push_back(PointCloud<cotype>(gen.generate1Float()*450,
gen.generate1Float()*450,gen.generate1Float()*450));
}
std::vector<AABBofPointCloud<cotype>> AABBs;
for(int i=0;i<N;i++)
{
AABBs.push_back(AABBofPointCloud<cotype>(i,&objects[i]));
}
// benchmark begin
size_t nano;
std::map<int,std::map<int,bool>> collisionMatrix;
{
FastColDetLib::Bench bench(&nano);
// uniform grid for 32x32x32 cells each with 30 objects max
// mapped to (0,0,0) - (450,450,450) cube
Grid<cotype,32,30> grid(0,450);
// add AABBs to grid
for(int i=0;i<N;i++)
{
grid.addObject(&AABBs[i]);
}
for(int i=0;i<N;i++)
{
std::vector<AABBofPointCloud<cotype>*> collisions =
grid.checkCollisionsWithSingleAABB(&AABBs[i]);
for(AABBofPointCloud<cotype>* aabb:collisions)
{
if(pointCloudIntersection(*aabb->pCloud, *AABBs[i].pCloud))
{
collisionMatrix[AABBs[i].id][aabb->id]=true;
collisionMatrix[aabb->id][AABBs[i].id]=true;
}
}
}
}
std::cout<<N<<" vs "<<N<<" point-clouds collision checking by uniform grid=
"<<nano<<" nanoseconds"<<std::endl;
int total = 0;
for(auto c:collisionMatrix)
{
for(auto c2:c.second)
{
if(c2.second)
total++;
}
}
std::cout<<total<<" total collisions (half as many for pairs)"<<std::endl;
输出如下
10000 vs 10000 point-clouds collision checking by uniform grid= 32221053 nanoseconds
564 total collisions (half as many for pairs)
每10000对10000碰撞检测32毫秒,非常接近60 FPS的目标(16.6毫秒),并且距离每个动态碰撞检查1微秒(目前每个点云3.2微秒)也不远。
对于简单的网格实现,需要进行调优才能为特定场景获得最佳性能。上述测试使用了32x32x32个单元格的网格,每个单元格最多容纳30个对象。为了了解最佳性能,必须从总延迟中减去pointCloudUntersection(p1,p2)
函数的延迟。注释掉这个函数调用后,运行时间为18毫秒。这意味着在网格方面,有14毫秒的优化空间。这可以通过调整网格参数、内存访问模式、存储在单元格中的数据类型来实现。
填充网格并逐一进行AABB与整个网格的碰撞测试很可能会导致CPU缓存内容被破坏。当第二个AABB被测试时,第一个AABB遍历的单元格数据已经被替换。为了解决这个问题,需要进行分块计算(tiled computing)。分块大小最多应为CPU的L1缓存大小,以利用最快的单元格数据访问。
由于每个单元格最多只能存储30个AABB,在不增加时间复杂度超过不可接受水平的情况下,对每个单元格进行暴力计算可以高效地利用L1缓存,前提是单元格内容没有太多重复项。
通过Grid
类的新方法(源代码),碰撞测试算法迭代单元格而不是AABB,并通过维护一个内部碰撞矩阵来进行基本的重复项移除。
std::map<int,std::map<int,bool>> checkCollisionAllPairs()
{
std::map<int,std::map<int,bool>> collisionMatrix;
for(int k=0;k<Size;k++)
for(int j=0;j<Size;j++)
for(int i=0;i<Size;i++)
{
const int cellIndex = (i+j*Size+k*Size*Size)*(ObjectsPerCell+1);
const int nAABB = cellData[cellIndex];
// no check if only 1 or less AABB found
if(nAABB<2)
continue;
// evading duplicates
for(int o1 = 0; o1<nAABB-1; o1++)
{
for(int o2 = o1+1; o2<nAABB; o2++)
{
AABBofPointCloud<CoordType>* aabbPtr1 =
idMapping[cellData[cellIndex+1+o1]];
AABBofPointCloud<CoordType>* aabbPtr2 =
idMapping[cellData[cellIndex+1+o2]];
if( aabbPtr1->id < aabbPtr2->id)
if(intersectDim(aabbPtr1->xmin, aabbPtr1->xmax,
aabbPtr2->xmin, aabbPtr2->xmax))
if(intersectDim(aabbPtr1->ymin, aabbPtr1->ymax,
aabbPtr2->ymin, aabbPtr2->ymax))
if(intersectDim(aabbPtr1->zmin, aabbPtr1->zmax,
aabbPtr2->zmin, aabbPtr2->zmax))
{
collisionMatrix[aabbPtr1->id][aabbPtr2->id]=true;
}
}
}
}
return collisionMatrix;
}
这使得基准测试算法只需简单地进行更改即可完成相同的工作,速度更快。
size_t nano;
std::map<int,std::map<int,bool>> collisionMatrixTmp;
std::map<int,std::map<int,bool>> collisionMatrix;
{
FastColDetLib::Bench bench(&nano);
// uniform grid for 16x16x16 cells each with 30 objects max
// mapped to (0,0,0) - (450,450,450) cube
Grid<cotype,16,30> grid(0,450);
// add AABBs to grid
for(int i=0;i<N;i++)
{
grid.addObject(&AABBs[i]);
}
collisionMatrixTmp = grid.checkCollisionAllPairs();
for(auto c:collisionMatrixTmp)
{
for(auto c2:c.second)
{
if(c2.second)
if(pointCloudIntersection(*AABBs[c.first].pCloud,*AABBs[c2.first].pCloud))
{
collisionMatrix[c.first][c2.first]=true;
collisionMatrix[c2.first][c.first]=true;
}
}
}
}
输出如下
10000 vs 10000 point-clouds collision checking by uniform grid= 22510565 nanoseconds
564 total collisions (half as many for pairs)
在22.5毫秒内找到相同的碰撞,表明这次已经使用了大部分优化空间,尽管每个具有多个重叠AABB的单元格中都存在重复的AABB ID。
带有多线程单AABB碰撞检查的网格实现
在不改变点云-点云碰撞测试算法的情况下,唯一剩下的加速整体碰撞检测过程的优化是多线程。在当前示例网格实现(库头文件之外)的状态下,单对象AABB碰撞测试在根本上是线程安全的,因为它不创建或编辑任何内部状态。OpenMP可以轻松地加速这部分,而所有对计算则需要一个自定义线程池(如库头文件中所示)。
为了测试多线程碰撞计算(不是一次性进行所有对计算,而是N次单对全部计算),使用了以下OpenMP并行化(源代码)
// a simple global locking point
std::mutex mut;
// fx8150 has 8 integer cores, 4 shared FPUs
#pragma omp parallel for
for(int i=0;i<N;i++)
{
std::vector<AABBofPointCloud<cotype>*> result =
grid.checkCollisionsWithSingleAABB(&AABBs[i]);
for(auto c:result)
{
if(pointCloudIntersection(*c->pCloud,*AABBs[i].pCloud))
{
// modern C++ scope-based locking
std::lock_guard<std::mutex> lg(mut);
collisionMatrix[c->id][AABBs[i].id]=true;
collisionMatrix[AABBs[i].id][c->id]=true;
}
}
}
输出是:
10000 vs 10000 point-clouds collision checking by uniform grid= 13742454 nanoseconds
564 total collisions (half as many for pairs)
13.7毫秒,包括细粒度的碰撞检查,并且没有特殊的多线程优化(所有线程使用全局锁而不是非伪共享的每个元素锁)。
由于这个性能满足了最初的性能限制,测试将继续进行更困难的部分:“茶壶在体育场”问题。
第三次优化:自适应网格
到目前为止,**均匀**网格具有此内存布局
其中每个头部要么是零,要么是正整数。对于上面的示例,如果网格为同一个单元格的体积添加了6个或更多的AABB实例,则会导致段错误(如果它是整数数组中的最后一个单元格)或覆盖其他单元格的头部值,导致未定义行为。
即使没有单元格溢出,物体的聚类模式也会导致网格的大部分区域未被使用,因为在物体簇之间存在大量空单元格。这迫使均匀网格的设置需要每个单元格有更高的AABB容量,并导致更高的复杂度,以及分配更大的整数缓冲区的延迟。
为了重现这种情况,对象的坐标初始化(点云)增加了3次迭代,使用非常接近的X、Y、Z中心坐标,但距离其余点云的质心很远(源代码)。
const int N = 10003;
std::vector<PointCloud<cotype>> objects;
oofrng::Generator<64> gen;
for(int i=0;i<N-3;i++)
{
objects.push_back(PointCloud<cotype>(gen.generate1Float()*450,
gen.generate1Float()*450,gen.generate1Float()*450));
}
// the teapot in stadium problem
objects.push_back(PointCloud<cotype>(10000,10000,10000));
objects.push_back(PointCloud<cotype>(10001,10001,10001));
objects.push_back(PointCloud<cotype>(10002,10002,10002));
和
// uniform grid for 16x16x16 cells each with 30 objects max
// mapped to (0,0,0) - (10005,10005,10005) cube
// due to the teapot-in-stadium problem, has to have a bigger mapping
Grid<cotype,16,30> grid(0,10005);
输出是:
10003 vs 10003 point-clouds collision checking by uniform grid= 2238980417 nanoseconds
570 total collisions (half as many for pairs)
仅需6次额外的碰撞(来自3个对象),性能就下降到0.45 FPS(每次计算2.2秒),使用了8个线程,单线程耗时14秒。这甚至比简单的暴力算法还要慢。
为了优化内存访问模式、分配使用和时间复杂度,每个单元格都可以适应传入的对象数量,只需使用不同的单元格头部逻辑。
在插入过程中,每次单元格溢出时,算法会将一个新的低分辨率、低容量的网格添加到父网格的子网格(动态)数组中,并将溢出单元格中的所有对象AABB ID(以及当前插入的AABB的ID)插入到新网格中,递归计算直到所有溢出的AABB都成功插入。为避免无限递归,每个更深的网格层都分配了每单元格2个额外的AABB容量。
使用FastCollisionDetectionLib
存储库的头文件中的自适应网格实现解决了相同的“茶壶在体育场”问题(源代码)。
std::map<int,std::map<int,bool>> collisionMatrix;
{
FastColDetLib::Bench bench(&nano);
// adaptive grid
FastColDetLib::ThreadPool<cotype> thr;
FastColDetLib::AdaptiveGrid<cotype> grid(thr,0,0,0,10005,10005,10005);
// add AABBs to grid
for(int i=0;i<N;i++)
{
grid.add(&AABBs[i],1);
}
std::mutex mut;
#pragma omp parallel for
for(int i=0;i<N;i++)
{
std::vector<FastColDetLib::IParticle<cotype>*> result =
grid.getDynamicCollisionListFor((FastColDetLib::IParticle<cotype>*)&AABBs[i]);
for(auto c:result)
{
if(c->getId() != AABBs[i].getId())
if(pointCloudIntersection(*AABBs[c->getId()].pCloud,*AABBs[i].pCloud))
{
std::lock_guard<std::mutex> lg(mut);
collisionMatrix[c->getId()][AABBs[i].id]=true;
collisionMatrix[AABBs[i].id][c->getId()]=true;
}
}
}
}
输出如下
10003 vs 10003 point-clouds collision checking by uniform grid= 16226199 nanoseconds
570 total collisions (half as many for pairs)
略低于16.7毫秒(60FPS),使用了8个线程和简单的锁。自适应性使每个单元格的最大对象数保持较低,并分配了更多的整数缓冲区区域,但剩余未使用的部分较少。这两者结合起来,可以更好地缓存(子网格)遍历,并提高每体积密度(假设许多单元格中,重叠对象的最大数量不会达到数百个)。
多线程全对计算方法也显示出类似的性能(源代码)。
10003 vs 10003 point-clouds collision checking by uniform grid= 16811872 nanoseconds
570 total collisions (half as many for pairs)
FastCollisionDetectionLib
的自适应网格实现与本文内容仅有细微差别,并且在性能上进行了增量优化(下一个优化是支持grid.add(..)
的AABB插入方法的多线程)。
挑战的第二部分是查询单个AABB的碰撞列表,速度快于1微秒。通过对基准测试的修改
for(int i=0;i<10003;i++)
{
std::vector<FastColDetLib::IParticle<cotype>*> result;
{
FastColDetLib::Bench bench(&nano);
result = grid.getDynamicCollisionListFor((FastColDetLib::IParticle<cotype>*)&AABBs[i]);
}
std::cout<<1<<" vs "<<N<<" AABB collision checking by adaptive grid=
"<<nano<<" nanoseconds "<<std::endl;
std::cout<<result.size()<<" collisions detected for this AABB"<<std::endl;
}
输出是:
...
1 vs 10003 AABB collision checking by adaptive grid= 1767 nanoseconds
1 collisions detected for this AABB
1 vs 10003 AABB collision checking by adaptive grid= 5340 nanoseconds
1 collisions detected for this AABB
1 vs 10003 AABB collision checking by adaptive grid= 1757 nanoseconds
1 collisions detected for this AABB
1 vs 10003 AABB collision checking by adaptive grid= 1743 nanoseconds
3 collisions detected for this AABB
1 vs 10003 AABB collision checking by adaptive grid= 1276 nanoseconds
3 collisions detected for this AABB
1 vs 10003 AABB collision checking by adaptive grid= 1114 nanoseconds
3 collisions detected for this AABB
在非分支网格单元格中,当碰撞很少时,其延迟至少比挑战要求的1微秒高10%。自适应性由于网格到网格的遍历逻辑而具有更高的延迟。
延迟可以通过多线程或流水线隐藏。使用多线程且无锁(完全独立的碰撞查询),如果使用每单元格锁,10003次查询完成的速度会快于1微秒(比早期基准测试中的16毫秒少)。
std::atomic<int> ctr;
ctr.store(0);
{
{
FastColDetLib::Bench bench(&nano);
#pragma omp parallel for
for(int i=0;i<N;i++)
{
std::vector<FastColDetLib::IParticle<cotype>*> result;
result = grid.getDynamicCollisionListFor
((FastColDetLib::IParticle<cotype>*)&AABBs[i]);
ctr+= result.size();
}
}
std::cout<<N<<" vs "<<N<<" AABB collision checking by adaptive grid=
"<<nano<<" nanoseconds "<<std::endl;
std::cout<<"total = "<<ctr.load()<<std::endl;
}
输出是:
10003 vs 10003 AABB collision checking by adaptive grid= 3760629 nanoseconds
total = 10573
10k个对象3.7毫秒相当于**每对象370纳秒**。
FastColDetLib
库仅处理IParticle
接口的指针(并将用户实现与库实现解耦),以支持同一场景中不同类型的对象。接口要求用户实现以下方法:
struct AABBofPointCloud: public FastColDetLib::IParticle<CoordType>
{
...
const CoordType getMaxX()const {return xmax;}
const CoordType getMaxY()const {return ymax;}
const CoordType getMaxZ()const {return zmax;}
const CoordType getMinX()const {return xmin;}
const CoordType getMinY()const {return ymin;}
const CoordType getMinZ()const {return zmin;}
const int getId()const {return id;}
...
};
getMaxX/Y/Z
和getMinX/Y/Z
方法用于计算碰撞,而getId
方法用于避免所有对计算(grid.getCollisions()
)中的重复项。仅使用指针可以使算法使用更少的临时内存,但会增加间接开销。可以通过使用线程来部分隐藏这种间接开销。
20k AABB在60 FPS下的全对碰撞检测
达到20k AABB全对碰撞检测在60 FPS下的第二个挑战需要对自适应网格代码进行彻底重写,原因如下:
- 树状结构(如自适应网格)在节点之间存在**指针追逐**。追逐指针比遍历线性数组慢,因为缓存效率较低。
- 树的每个节点都是一个实际对象,这意味着每个实例都需要**分配**。分配很慢,并且操作系统可能会在**内存碎片化**期间序列化它们以进行簿记。
- **每个节点包含大量粒子数据**,节点之间没有数据重用。这不利于缓存(导致高缓存未命中率)。
- AABB-AABB碰撞对计算不使用**数据局部性**,并在非**连续**数据上运行,这些数据缓存效率较低,并且无法为CPU核心的SIMD单元进行向量化。
- 节点-AABB碰撞测试未被**提前退出**(碰撞掩码当前是整数数组)。
- 树结构密集,并且**为每个父节点创建所有子节点**,这会导致更差的缓存未命中率。
- 计算模式是**结构数组**(AOS),导致缓存行效率较低。
解决方案
- 指针追逐导致缓存效率低 --> 实现一个**线性树**在缓冲区中。
- 过多的分配(每个节点) --> 使用**内存池**。
- 每个节点过多的对象字段 --> **将字段与节点分开**。
- 非连续+非局部数据访问 --> 在简单数组上进行**分块计算**,让编译器进行向量化。
- 节点<-->AABB碰撞检查中没有提前退出:使用**位图作为碰撞掩码**(每个节点64位,即4x4x4)。
- 未使用的节点:实现一个**稀疏树**,不分配空的子节点。
以上所有解决方案都集中在一个数据结构上:**基于内存池的稀疏线性自适应网格(带分块碰撞计算)**。
稀疏线性自适应网格实现
为了支持具有高吞吐量伪分配的线性树实现,首先需要一个内存池(在同一个实现头文件中找到)。
内存池
对于当前项目,内存池实现开发时考虑了重复重置--分配--重置--分配链式使用。它的目的是仅在内部容量溢出时进行分配,并快速从所有分配方法调用中返回,使碰撞检测算法在延迟和内存消耗方面具有稳定性。
为了内存消耗的稳定性,它只在整数幂次2时进行实际分配,并以1024个元素、2k个元素、4k个元素、8k个元素为步长增加,直到达到分配大小标准。一旦实际分配了100万个元素,它会在单次重置后快速返回任何小于100万个元素的分配。内存分配的指数增长尽量减少了实际分配次数,并在碰撞检测循环的几次迭代后才保持稳定性。
该实现具有基本的get和set数据访问方法,一个重置方法(O(1)复杂度)以恢复所有分配,以及一个分配方法(经常以O(1)复杂度完成)。
template<typename DataType>
class Memory
{
public:
Memory()
{
memory=std::make_shared<std::vector<DataType>>();
allocPtr=std::make_shared<int>();
*allocPtr = 0;
allocPtrPtr=allocPtr.get();
memory->resize(1024);
ptr=memory->data();
}
inline
DataType * getPtr(const int index) const noexcept
{
return ptr+index;
}
inline
DataType& getRef(const int index) const noexcept
{
return ((DataType* __restrict__ const)ptr)[index];
}
inline
const DataType get(const int index) const noexcept
{
return ((DataType* __restrict__ const)ptr)[index];
}
inline
void set(const int index, const DataType data) const noexcept
{
((DataType* __restrict__ const)ptr)[index]=data;
}
inline
void readFrom(Memory<DataType>& mem, const int index, const int indexThis, const int n)
{
std::copy(mem.ptr+index,mem.ptr+index+n,ptr+indexThis);
}
inline
void writeTo(std::vector<DataType>& vec)
{
std::copy(ptr,ptr+*allocPtrPtr,vec.data());
}
inline
const int allocate(const int size)
{
const int result = *allocPtrPtr;
while(size + *allocPtrPtr >= memory->size())
{
memory->resize(memory->size()*2);
}
*allocPtrPtr += size;
ptr=memory->data();
return result;
}
inline
const int capacity()
{
return memory->size();
}
inline
const int size()
{
return *allocPtrPtr;
}
inline
void reset()
{
*allocPtrPtr = 0;
}
private:
DataType* ptr;
std::shared_ptr<int> allocPtr;
int* allocPtrPtr;
std::shared_ptr<std::vector<DataType>> memory;
};
struct MemoryPool
{
void clear()
{
nodeCollisionMask.reset();
childNodeCount.reset();
index.reset();
indexParticle.reset();
orderParticle.reset();
minX.reset();
maxX.reset();
minY.reset();
maxY.reset();
minZ.reset();
maxZ.reset();
nodeMinX.reset();
nodeMinY.reset();
nodeMinZ.reset();
nodeInvWidth.reset();
nodeInvHeight.reset();
nodeInvDepth.reset();
}
// node-particle collision
Memory<uint64_t> nodeCollisionMask;
Memory<char> childNodeCount;
Memory<int> index;
Memory<int> indexParticle;
Memory<int> orderParticle;
Memory<float> nodeMinX;
Memory<float> nodeMinY;
Memory<float> nodeMinZ;
Memory<float> nodeInvWidth;
Memory<float> nodeInvHeight;
Memory<float> nodeInvDepth;
Memory<float> minX;
Memory<float> maxX;
Memory<float> minY;
Memory<float> maxY;
Memory<float> minZ;
Memory<float> maxZ;
Memory<int> idTmp[64];
Memory<int> orderTmp[64];
Memory<std::pair<int,int>> allPairsColl;
Memory<FastUnique<int32_t, testParticleLimit>> allPairsCollmapping;
};
与std::vector
相比,该类通过只保留同一实例的所有克隆的单个内部数据来优化内存消耗,并具有较少的分配次数,以及相对廉价的分配和清除(重置)方法。
对于具有未知构造模式的非线性数据结构,内存池必须具有复杂的分配控制逻辑,而不是这种简单的实现。对于线性自适应网格,这足以快速服务所有分配。
针对整数数组的内存组件性能测试
FastColDetLib::Memory<int> mem;
size_t measure;
for(int j=0;j<15;j++)
{
FastColDetLib::Bench bench(&measure);
mem.reset();
for(int i=0;i<100000+j*200000;i++)
{
int index = mem.allocate(3);
mem.set(index,0);
mem.set(index+1,1);
mem.set(index+2,2);
}
std::cout<<"Memory (size="<<100000+j*200000<<"): "<<measure<<" ns"<<std::endl;
}
std::vector<int> mem2;
for(int j=0;j<15;j++)
{
FastColDetLib::Bench bench(&measure);
mem2.clear();
for(int i=0;i<100000+j*200000;i++)
{
mem2.push_back(0);
mem2.push_back(1);
mem2.push_back(2);
}
std::cout<<"Vector (size="<<100000+j*200000<<"): "<<measure<<" ns"<<std::endl;
}
std::vector
和Memory
实例都针对按需分配(未知内存需求)和单次分配(已知内存需求)进行了测试。
Memory (size=100000): 94674792711856 ns // cold benchmark start = low performance
Memory (size=300000): 6486661 ns
Memory (size=500000): 11162508 ns
Memory (size=700000): 22083055 ns
Memory (size=900000): 42743681 ns
Memory (size=1100000): 6999557 ns
Memory (size=1300000): 8683617 ns
Memory (size=1500000): 10757706 ns
Memory (size=1700000): 83456551 ns
Memory (size=1900000): 14201669 ns
Memory (size=2100000): 16036462 ns
Memory (size=2300000): 17733826 ns
Memory (size=2500000): 18858536 ns
Memory (size=2700000): 20413091 ns
Memory (size=2900000): 23847971 ns
Vector (size=100000): 159671661 ns
Vector (size=300000): 4493545 ns
Vector (size=500000): 10269529 ns
Vector (size=700000): 19388296 ns
Vector (size=900000): 34455068 ns
Vector (size=1100000): 11864543 ns
Vector (size=1300000): 14701698 ns
Vector (size=1500000): 16956325 ns
Vector (size=1700000): 71052797 ns
Vector (size=1900000): 21863594 ns
Vector (size=2100000): 23666274 ns
Vector (size=2300000): 26714740 ns
Vector (size=2500000): 28240470 ns
Vector (size=2700000): 31571745 ns
Vector (size=2900000): 32961081 ns
在Memory
实例在最初几次基准测试迭代中实际分配后,以下测试几乎总是在非分配代码路径上运行,并且在未知内存需求且总分配大小逐渐增加的情况下,性能比std::vector
高10%-50%。
当分配大小已知时,两者性能相同(对于vector
是resize + [ ]运算符,对于Memory
是只分配一次)。
线性稀疏自适应网格的内部字段
由于MemoryPool
用于所有节点的数据,网格对象本身只需要维护根节点的空间属性。
- 根节点的边界框角点:minCornerX/Y/Z和maxCornerX/Y/Z
- 根节点的单元格大小:cellWidth/Height/Depth
- 逆单元格大小,以减少整数除法运算次数:cellWidthInv/HeightInv/DepthInv
struct AdaptiveGridV2Fields
{
AdaptiveGridV2Fields(MemoryPool mPool, const float minx, const float miny, const float minz,
const float maxx, const float maxy, const float maxz):mem(mPool),
minCornerX(minx),minCornerY(miny),minCornerZ(minz),maxCornerX(maxx),maxCornerY(maxy),maxCornerZ(maxz),
cellWidth ((maxx-minx)*0.25f),
cellHeight ((maxy-miny)*0.25f),
cellDepth ((maxz-minz)*0.25f),
cellWidthInv (1.0f/((maxx-minx)*0.25f)),
cellHeightInv(1.0f/((maxy-miny)*0.25f)),
cellDepthInv (1.0f/((maxz-minz)*0.25f))
{
}
MemoryPool mem;
const float minCornerX;
const float minCornerY;
const float minCornerZ;
const float maxCornerX;
const float maxCornerY;
const float maxCornerZ;
const float cellWidth;
const float cellHeight;
const float cellDepth;
const float cellWidthInv;
const float cellHeightInv;
const float cellDepthInv;
};
此结构体用作AdaptiveGridV2
类的主要内部数据,其名称表明已针对挑战2(20k AABB全对计算,16.7毫秒)进行了优化。
class AdaptiveGridV2
{
public:
// constructs adaptive grid for a given AABB volume
AdaptiveGridV2(MemoryPool mem,
const float minx, const float miny, const float minz,
const float maxx, const float maxy, const float maxz);
// clears memory pool, to be reused later (for rebuilding the tree)
void clear();
template<typename Derived>
inline void addParticles(const int numParticlesToAdd,
Derived * const __restrict__ particles)
{ .. adds AABBs to grid using IParticle<float> interface of Derived class .. }
// test collisions with an AABB, returns vector of id values of colliding AABBs
std::vector<int> findCollisions(const float minx, const float miny,
const float minz, const float maxx,
const float maxy, const float maxz);
// test all collisions between all AABBs inside grid
std::vector<std::pair<int,int>> findCollisionsAll();
void buildTree();
private:
std::shared_ptr<AdaptiveGridV2Fields> fields; // single shared data
}
这使得任何实例都可以复制或移动,而不会发生意外复制,同时保持可读性。
缓冲区上的树节点表示
使用新的数据表示,除了根节点之外,没有其他节点的实际对象。节点包含三个主要的整数字段:粒子列表指针、粒子列表大小和子节点指针。这三个字段在同一个整数缓冲区中连续存储,而侧边字段(min-corner x/y/z和inverse cell size x/y/z)存储在6个不同的浮点类型缓冲区中。
将粒子添加到根节点
原始粒子数据仅插入到根节点,并在创建子节点时,树的构建仅从根节点的粒子缓冲区中获取引用(顺序值)。
添加粒子
template<typename Derived>
inline void addParticles(const int numParticlesToAdd, Derived * const __restrict__ particles)
{
const int pId = fields->mem.indexParticle.allocate(numParticlesToAdd);
const int oId = fields->mem.orderParticle.allocate(numParticlesToAdd);
const int maxXId = fields->mem.maxX.allocate(numParticlesToAdd);
const int maxYId = fields->mem.maxY.allocate(numParticlesToAdd);
const int maxZId = fields->mem.maxZ.allocate(numParticlesToAdd);
const int minXId = fields->mem.minX.allocate(numParticlesToAdd);
const int minYId = fields->mem.minY.allocate(numParticlesToAdd);
const int minZId = fields->mem.minZ.allocate(numParticlesToAdd);
fields->mem.index.set(1,fields->mem.index.get(1)+numParticlesToAdd);
for(int i=0;i<numParticlesToAdd;i++)
{
const IParticle<float> * const curPtr = static_cast<const IParticle<float>* const>(particles+i);
fields->mem.indexParticle.set(pId+i,curPtr->getId());
fields->mem.orderParticle.set(oId+i,oId+i);
fields->mem.maxX.set(maxXId+i,curPtr->getMaxX());
fields->mem.maxY.set(maxYId+i,curPtr->getMaxY());
fields->mem.maxZ.set(maxZId+i,curPtr->getMaxZ());
fields->mem.minX.set(minXId+i,curPtr->getMinX());
fields->mem.minY.set(minYId+i,curPtr->getMinY());
fields->mem.minZ.set(minZId+i,curPtr->getMinZ());
}
}
当以大批量(n>1)添加粒子时,在2.1GHz Fx8150和单通道1333MHz ddr3内存的系统上,插入20000个随机放置的粒子大约需要一毫秒。
树构建过程
构建自适应网格树的线性(或序列化)表示很简单。基本工作流程的伪代码如下:
current = root last item = root + 3 [3=node size in number of integers] while current offset < last item offset if current has more particles than 32 this is now an internal node map particles to 1-64 child nodes (create child nodes sparsely) for each created child node compute bitmask (uint64_t) for early quitting from collision computation (4x4x4 = uint64_t, 2x2x2 = uint8_t, 1 bit per child node position) copy the mapped particle-id values to child node last item = child node else this is now a leaf node number of child nodes = 0 save offset to leaf-list (for fast-traversing all leaves in all-pairs detection)
以下是实现
void buildTree()
{
int particleStart = fields->mem.index.get(0);
int numParticle = fields->mem.index.get(1);
int nodeOffset = 0;
float minCornerX = fields->mem.nodeMinX.get(0);
float minCornerY = fields->mem.nodeMinY.get(0);
float minCornerZ = fields->mem.nodeMinZ.get(0);
float cellWidthInv = fields->mem.nodeInvWidth.get(0);
float cellHeightInv = fields->mem.nodeInvHeight.get(0);
float cellDepthInv = fields->mem.nodeInvDepth.get(0);
float cellWidth = 1.0f/cellWidthInv;
float cellHeight = 1.0f/cellHeightInv;
float cellDepth = 1.0f/cellDepthInv;
int ctr=0;
int maxNodeOffset = 3;
while(nodeOffset <= maxNodeOffset)
{
ctr++;
int ctrTmp[64]={0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0};
// if child node pointer not set up
if(fields->mem.index.get(nodeOffset+2)<nodeOffset && fields->mem.index.get(fields->mem.index.get(nodeOffset+2)+2)>=0)
{
fields->mem.index.set(fields->mem.index.get(nodeOffset+2)+2,-(nodeOffset+1));
}
int childNodeCount = 0;
if(numParticle > testParticleLimit)
{
{
for(int zz = 0; zz<4; zz++)
for(int yy = 0; yy<4; yy++)
for(int xx = 0; xx<4; xx++)
{
// allocate node
const int index0 = xx+yy*4+zz*16;
fields->mem.orderTmp[index0].reset();
fields->mem.orderTmp[index0].allocate(numParticle);
}
}
{
for(int ii=0;ii<numParticle;ii++)
{
const int orderParticle = fields->mem.orderParticle.get(particleStart+ii);
const float& minx = fields->mem.minX.getRef(orderParticle);
const float& miny = fields->mem.minY.getRef(orderParticle);
const float& minz = fields->mem.minZ.getRef(orderParticle);
const float& maxx = fields->mem.maxX.getRef(orderParticle);
const float& maxy = fields->mem.maxY.getRef(orderParticle);
const float& maxz = fields->mem.maxZ.getRef(orderParticle);
const int indexStartX = std::floor((minx - minCornerX)*cellWidthInv);
const int indexEndX = std::floor((maxx - minCornerX)*cellWidthInv);
const int indexStartY = std::floor((miny - minCornerY)*cellHeightInv);
const int indexEndY = std::floor((maxy - minCornerY)*cellHeightInv);
const int indexStartZ = std::floor((minz -minCornerZ)*cellDepthInv);
const int indexEndZ = std::floor((maxz - minCornerZ)*cellDepthInv);
// prepare cell indicator mask (1 bit = has object, 0 bit = empty))
for(int k=indexStartZ; k<=indexEndZ; k++)
{
if(k<0 || k>=4)
continue;
for(int j=indexStartY; j<=indexEndY; j++)
{
if(j<0 || j>=4)
continue;
for(int i=indexStartX; i<=indexEndX; i++)
{
if(i<0 || i>=4)
continue;
const int index0 = i+j*4+k*16;
fields->mem.orderTmp[index0].set(ctrTmp[index0],orderParticle);
ctrTmp[index0]++;
}
}
}
}
}
// add all particles in order (from first child node to last child node)
childNodeCount=0;
for(int zz = 0; zz<4; zz++)
for(int yy = 0; yy<4; yy++)
for(int xx = 0; xx<4; xx++)
{
const int index0 = xx+yy*4+zz*16;
const int sz = ctrTmp[index0];
if(sz>0)
{
childNodeCount++;
const int nodeIndexOfs = fields->mem.index.allocate(3);
const int particleStartCur = nodeIndexOfs;
const int numParticleCur = nodeIndexOfs+1;
const int childNodeStartCur = nodeIndexOfs+2;
const int tmpIndex = fields->mem.childNodeCount.allocate(1);
const int nodeBoundMinXFloat = fields->mem.nodeMinX.allocate(1);
const int nodeBoundMinYFloat = fields->mem.nodeMinY.allocate(1);
const int nodeBoundMinZFloat = fields->mem.nodeMinZ.allocate(1);
const int nodeInvWidthFloat = fields->mem.nodeInvWidth.allocate(1);
const int nodeInvHeightFloat = fields->mem.nodeInvHeight.allocate(1);
const int nodeInvDepthFloat = fields->mem.nodeInvDepth.allocate(1);
fields->mem.nodeMinX.set(nodeBoundMinXFloat,minCornerX+xx*cellWidth);
fields->mem.nodeMinY.set(nodeBoundMinYFloat,minCornerY+yy*cellHeight);
fields->mem.nodeMinZ.set(nodeBoundMinZFloat,minCornerZ+zz*cellDepth);
fields->mem.nodeInvWidth.set(nodeInvWidthFloat,cellWidthInv*4.0f);
fields->mem.nodeInvHeight.set(nodeInvHeightFloat,cellHeightInv*4.0f);
fields->mem.nodeInvDepth.set(nodeInvDepthFloat,cellDepthInv*4.0f);
const int nodeMaskIndex = fields->mem.nodeCollisionMask.allocate(1);
uint64_t nodeMask = 0;
storeBit(nodeMask,1,index0);
fields->mem.nodeCollisionMask.set(nodeMaskIndex,nodeMask);
const int allocOffset = fields->mem.orderParticle.allocate(sz);
fields->mem.orderParticle.readFrom(fields->mem.orderTmp[index0],0,allocOffset,sz);
fields->mem.index.set(particleStartCur,allocOffset);
fields->mem.index.set(numParticleCur,sz);
fields->mem.index.set(childNodeStartCur,nodeOffset);
maxNodeOffset=particleStartCur;
}
}
fields->mem.childNodeCount.set(nodeOffset/3,childNodeCount);
}
else
{
fields->mem.childNodeCount.set(nodeOffset/3,0);
const int idx = fields->mem.leafOffset.allocate(1);
fields->mem.leafOffset.set(idx,nodeOffset);
}
nodeOffset += 3;
numParticle=0;
if(nodeOffset <= maxNodeOffset)
{
particleStart = fields->mem.index.get(nodeOffset);
numParticle = fields->mem.index.get(nodeOffset+1);
minCornerX = fields->mem.nodeMinX.get(nodeOffset/3);
minCornerY = fields->mem.nodeMinY.get(nodeOffset/3);
minCornerZ = fields->mem.nodeMinZ.get(nodeOffset/3);
cellWidthInv = fields->mem.nodeInvWidth.get(nodeOffset/3);
cellHeightInv = fields->mem.nodeInvHeight.get(nodeOffset/3);
cellDepthInv = fields->mem.nodeInvDepth.get(nodeOffset/3);
cellWidth = 1.0f/cellWidthInv;
cellHeight = 1.0f/cellHeightInv;
cellDepth = 1.0f/cellDepthInv;
}
}
}
由于序列化形式和通过回溯计算子节点偏移值的额外簿记,它看起来比实际复杂。
树构建+粒子插入的基准测试计时
随机起始位置和随机移动,0碰撞,20000个粒子,650个时间步
Y轴:秒(初始分配导致峰值高达6.75毫秒)
X轴:时间步(最多650个)
晶体状起始位置且无移动,0碰撞,20000个粒子
Y轴:秒
X轴:时间步
随机开始,随机移动,20000个粒子,每次时间步10000+次碰撞
Y轴:毫秒
X轴:时间步
碰撞:从15000开始,迅速接近9000,然后逐渐增加到14000。
在具有超过一半粒子重叠的随机碰撞样本中,树构建时间由于内部树节点数量的增加而增加。每几百个时间步,构建时间达到9毫秒,这是由于粒子数量与树深度的次优平衡。这留下了约7.7毫秒的空间来计算全对碰撞。
为了保持全对计算的运行时间足够短,实现将每个叶节点中的粒子数量限制为32,并使用隐式向量化代码模式对每个叶节点进行暴力计算(以兼容未来的处理器,如Intel的Cascadelake;请参阅下面的基准测试结果以了解AVX512效率)。
全对碰撞检测
由于粒子数据不像第一个挑战那样为每个叶节点线性存储,而是引用根节点的原始粒子缓冲区,因此需要一个临时的线性存储来允许编译器向量化计算。首先,计算区域(每个叶节点最多32个粒子)以结构数组(SOA)的方式复制到堆栈上的数组中(ID值有自己的缓冲区,坐标有自己的缓冲区)。其次,计算区域被分成更小的块(每个块长度为4)。最后,将每个4元素块与所有其他4元素块进行比较(排除重复工作)。这样,1个块与块的比较仅针对4个粒子,2个块与块的比较针对8个粒子,6个块与块的比较针对12个粒子,最多28个块比较针对32个粒子。
每个块与块的比较都是通过各自循环内的简单有序操作完成的,以帮助编译器找到向量化的简单解决方案。这类似于CUDA/OpenCL内核在锁定步调模式下在warp/wavefront上运行,唯一的区别是这里支持16位SIMD。
碰撞检查首先进行ID比较以屏蔽(在每个向量化块中)每个SIMD操作中的重复工作(并减少向量化计算后所需的工作)。然后,它对AABB的每个坐标执行“AND”操作,以找到每对结果。由于所有操作都在16位向量上进行,因此计算可以很好地映射到从MMX到AVX512的任何指令集。
4x4元素块比较函数
inline
const int intersectDim(const float minx, const float maxx, const float minx2, const float maxx2) noexcept
{
return !((maxx < minx2) || (maxx2 < minx));
}
void comp4vs4( const int * const __restrict__ partId1, const int * const __restrict__ partId2,
const float * const __restrict__ minx1, const float * const __restrict__ minx2,
const float * const __restrict__ miny1, const float * const __restrict__ miny2,
const float * const __restrict__ minz1, const float * const __restrict__ minz2,
const float * const __restrict__ maxx1, const float * const __restrict__ maxx2,
const float * const __restrict__ maxy1, const float * const __restrict__ maxy2,
const float * const __restrict__ maxz1, const float * const __restrict__ maxz2,
int * const __restrict__ out
)
{
alignas(32)
int result[16]={
// 0v0 0v1 0v2 0v3
// 1v0 1v1 1v2 1v3
// 2v0 2v1 2v2 2v3
// 3v0 3v1 3v2 3v3
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0
};
alignas(32)
int tileId1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
partId1[0],partId1[1],partId1[2],partId1[3],
partId1[0],partId1[1],partId1[2],partId1[3],
partId1[0],partId1[1],partId1[2],partId1[3],
partId1[0],partId1[1],partId1[2],partId1[3]
};
alignas(32)
int tileId2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
partId2[0],partId2[0],partId2[0],partId2[0],
partId2[1],partId2[1],partId2[1],partId2[1],
partId2[2],partId2[2],partId2[2],partId2[2],
partId2[3],partId2[3],partId2[3],partId2[3]
};
alignas(32)
float tileMinX1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
minx1[0],minx1[1],minx1[2],minx1[3],
minx1[0],minx1[1],minx1[2],minx1[3],
minx1[0],minx1[1],minx1[2],minx1[3],
minx1[0],minx1[1],minx1[2],minx1[3]
};
alignas(32)
float tileMinX2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
minx2[0],minx2[0],minx2[0],minx2[0],
minx2[1],minx2[1],minx2[1],minx2[1],
minx2[2],minx2[2],minx2[2],minx2[2],
minx2[3],minx2[3],minx2[3],minx2[3]
};
alignas(32)
float tileMinY1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
miny1[0],miny1[1],miny1[2],miny1[3],
miny1[0],miny1[1],miny1[2],miny1[3],
miny1[0],miny1[1],miny1[2],miny1[3],
miny1[0],miny1[1],miny1[2],miny1[3]
};
alignas(32)
float tileMinY2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
miny2[0],miny2[0],miny2[0],miny2[0],
miny2[1],miny2[1],miny2[1],miny2[1],
miny2[2],miny2[2],miny2[2],miny2[2],
miny2[3],miny2[3],miny2[3],miny2[3]
};
alignas(32)
float tileMinZ1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
minz1[0],minz1[1],minz1[2],minz1[3],
minz1[0],minz1[1],minz1[2],minz1[3],
minz1[0],minz1[1],minz1[2],minz1[3],
minz1[0],minz1[1],minz1[2],minz1[3]
};
alignas(32)
float tileMinZ2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
minz2[0],minz2[0],minz2[0],minz2[0],
minz2[1],minz2[1],minz2[1],minz2[1],
minz2[2],minz2[2],minz2[2],minz2[2],
minz2[3],minz2[3],minz2[3],minz2[3]
};
alignas(32)
float tileMaxX1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
maxx1[0],maxx1[1],maxx1[2],maxx1[3],
maxx1[0],maxx1[1],maxx1[2],maxx1[3],
maxx1[0],maxx1[1],maxx1[2],maxx1[3],
maxx1[0],maxx1[1],maxx1[2],maxx1[3]
};
alignas(32)
float tileMaxX2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
maxx2[0],maxx2[0],maxx2[0],maxx2[0],
maxx2[1],maxx2[1],maxx2[1],maxx2[1],
maxx2[2],maxx2[2],maxx2[2],maxx2[2],
maxx2[3],maxx2[3],maxx2[3],maxx2[3]
};
alignas(32)
float tileMaxY1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
maxy1[0],maxy1[1],maxy1[2],maxy1[3],
maxy1[0],maxy1[1],maxy1[2],maxy1[3],
maxy1[0],maxy1[1],maxy1[2],maxy1[3],
maxy1[0],maxy1[1],maxy1[2],maxy1[3]
};
alignas(32)
float tileMaxY2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
maxy2[0],maxy2[0],maxy2[0],maxy2[0],
maxy2[1],maxy2[1],maxy2[1],maxy2[1],
maxy2[2],maxy2[2],maxy2[2],maxy2[2],
maxy2[3],maxy2[3],maxy2[3],maxy2[3]
};
alignas(32)
float tileMaxZ1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
maxz1[0],maxz1[1],maxz1[2],maxz1[3],
maxz1[0],maxz1[1],maxz1[2],maxz1[3],
maxz1[0],maxz1[1],maxz1[2],maxz1[3],
maxz1[0],maxz1[1],maxz1[2],maxz1[3]
};
alignas(32)
float tileMaxZ2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
maxz2[0],maxz2[0],maxz2[0],maxz2[0],
maxz2[1],maxz2[1],maxz2[1],maxz2[1],
maxz2[2],maxz2[2],maxz2[2],maxz2[2],
maxz2[3],maxz2[3],maxz2[3],maxz2[3]
};
for(int i=0;i<16;i++)
result[i] = (tileId1[i] < tileId2[i]);
for(int i=0;i<16;i++)
result[i] = result[i] &&
intersectDim(tileMinX1[i], tileMaxX1[i], tileMinX2[i], tileMaxX2[i]) &&
intersectDim(tileMinY1[i], tileMaxY1[i], tileMinY2[i], tileMaxY2[i]) &&
intersectDim(tileMinZ1[i], tileMaxZ1[i], tileMinZ2[i], tileMaxZ2[i]);
for(int i=0;i<16;i++)
out[i]=result[i];
};
初始化两个瓦片的输入数组后,只需两个循环即可完成16次不同的比较,如果使用AVX512,可能只使用几个CPU指令。根据godbolt.org,初始化也是向量化的。
由于每个叶节点中的粒子数量限制为32,该实现仅适用于重叠少于33个对象的场景。有了这个限制,SIMD计算后的重复项移除通过一个快速的向量化唯一值累加器(每个插入10-20个周期)得到减少。
// keeps record of unique values inserted
// works for positive integers (-1 reserved for first comparisons)
template<typename SignedIntegralType, int n>
struct FastUnique
{
public:
FastUnique()
{
it=0;
for(int i=0;i<n;i++)
dict[i]=-1;
}
inline
void reset()
{
it=0;
for(int i=0;i<n;i++)
dict[i]=-1;
}
inline
void insert(const SignedIntegralType val)
{
const bool result = testImpl(val);
dict[it]=(result?val:dict[it]);
it+=(result?1:0);
}
inline
const SignedIntegralType get(const int index) const noexcept
{
return dict[index];
}
inline
const bool test(const SignedIntegralType val) noexcept
{
return testImpl(val);
}
inline
const void iterateSet(const SignedIntegralType val) noexcept
{
dict[it++]=val;
}
const int size()
{
return it;
}
SignedIntegralType * begin()
{
return dict;
}
SignedIntegralType * end()
{
return dict + it;
}
private:
SignedIntegralType dict[n];
SignedIntegralType c[n];
int it;
inline
bool testImpl(const int val) noexcept
{
for(int i=0;i<n;i++)
c[i]=(dict[i]==val);
SignedIntegralType s = 0;
for(int i=0;i<n;i++)
s+=c[i];
return s==0;
}
};
尽管时间复杂度为O(N),但由于缓存感知和SIMD感知的代码路径(已被编译器成功向量化以每周期进行4/8/16次重复项检查),这比std::unordered_set和std::unordered_map快得多。
通过向量化的4x4块碰撞测试和快速唯一值容器,实现了全对自适应网格碰撞检测算法。
伪代码
Iterate leaf-node array gather all particle data into contiguous memory divide the work into 4-sized chunks for each 4-sized chunk for each 4-sized chunk compute collision for 16 different combinations at once gather results into a deduplication container (multiple cells have same part) write deduplicated id values into memory pool copy memory pool content to result vector
实际代码
// visits all leaf nodes (their offsets already found)and computes nxn collision pairs std::vector<std::pair<int,int>> findCollisionsAll() { const int resetN = fields->mem.indexParticle.size(); fields->mem.allPairsCollmapping.allocate(resetN); for(int i=0;i<resetN;i++) { fields->mem.allPairsCollmapping.getRef(i).reset(); } fields->mem.allPairsCollmapping.reset(); fields->mem.allPairsColl.reset(); std::vector<std::pair<int,int>> result; // using the leaf-list from tree-building phase const int numLeaf = fields->mem.leafOffset.size(); for(int leaf=0;leaf<numLeaf;leaf++) { // prepare linear memory for all particles { const int leafOfs = fields->mem.leafOffset.get(leaf); const int ptr = fields->mem.index.get(leafOfs); const int n = fields->mem.index.get(leafOfs+1); alignas(32) int index[testParticleLimit]; alignas(32) int orderId[testParticleLimit]; alignas(32) int partId[testParticleLimit]; alignas(32) float minx[testParticleLimit]; alignas(32) float miny[testParticleLimit]; alignas(32) float minz[testParticleLimit]; alignas(32) float maxx[testParticleLimit]; alignas(32) float maxy[testParticleLimit]; alignas(32) float maxz[testParticleLimit]; constexpr int simd = 4; constexpr int simd1 = simd-1; const int n8 = n-(n&simd1); for(int i=0;i<n8;i+=simd) { for(int j=0;j<simd;j++) index[i+j] = ptr + i + j; for(int j=0;j<simd;j++) orderId[i+j] = fields->mem.orderParticle.get(index[i+j]); for(int j=0;j<simd;j++) partId[i+j] = fields->mem.indexParticle.get(orderId[i+j]); for(int j=0;j<simd;j++) minx[i+j] = fields->mem.minX.get(orderId[i+j]); for(int j=0;j<simd;j++) miny[i+j] = fields->mem.minY.get(orderId[i+j]); for(int j=0;j<simd;j++) minz[i+j] = fields->mem.minZ.get(orderId[i+j]); for(int j=0;j<simd;j++) maxx[i+j] = fields->mem.maxX.get(orderId[i+j]); for(int j=0;j<simd;j++) maxy[i+j] = fields->mem.maxY.get(orderId[i+j]); for(int j=0;j<simd;j++) maxz[i+j] = fields->mem.maxZ.get(orderId[i+j]); } for(int i=n8;i<n;i++) { index[i] = ptr + i; orderId[i] = fields->mem.orderParticle.get(index[i]); partId[i] = fields->mem.indexParticle.get(orderId[i]); minx[i] = fields->mem.minX.get(orderId[i]); miny[i] = fields->mem.minY.get(orderId[i]); minz[i] = fields->mem.minZ.get(orderId[i]); maxx[i] = fields->mem.maxX.get(orderId[i]); maxy[i] = fields->mem.maxY.get(orderId[i]); maxz[i] = fields->mem.maxZ.get(orderId[i]); } for(int i=n;i<testParticleLimit;i++) { index[i] = -1; orderId[i] = -1; partId[i] = -1; minx[i] = 1000000000000000000.0f; miny[i] = 1000000000000000000.0f; minz[i] = 1000000000000000000.0f; maxx[i] = 1000000000000000000.0f; maxy[i] = 1000000000000000000.0f; maxz[i] = 1000000000000000000.0f; } // SIMD computation (tiled computing) { alignas(32) int out[16]={ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0 }; for(int i=0;i<testParticleLimit-simd;i+=simd) { if(i>=n) break; FastUnique<int32_t, testParticleLimit> * map[simd] = { partId[i]>=0?fields->mem.allPairsCollmapping.getPtr(partId[i]):nullptr, partId[i+1]>=0?fields->mem.allPairsCollmapping.getPtr(partId[i+1]):nullptr, partId[i+2]>=0?fields->mem.allPairsCollmapping.getPtr(partId[i+2]):nullptr, partId[i+3]>=0?fields->mem.allPairsCollmapping.getPtr(partId[i+3]):nullptr }; for(int j=i;j<testParticleLimit;j+=simd) { if(j>=n) break; // 0v0, 0v1, 0v2, 0v3, // 1v0, 1v1, 1v2, 1v3, // 2v0, 2v1, 2v2, 2v3, // 3v0, 3v1, 3v2, 3v3, comp4vs4( partId+i, partId+j, minx+i, minx+j, miny+i, miny+j, minz+i, minz+j, maxx+i, maxx+j, maxy+i, maxy+j, maxz+i, maxz+j, out ); for(int k=0;k<16;k++) { const int k3 = k&3; const int id2 = j+(k>>2); if(out[k]) { if(map[k3]) map[k3]->insert(partId[id2]); } } } } } } } // scan all unique values and insert into the memory pool for(int i=0;i<resetN;i++) { FastUnique<int32_t, testParticleLimit>& map = fields->mem.allPairsCollmapping.getRef(i); const int ms = map.size(); const int allocIdx = fields->mem.allPairsColl.allocate(ms); for(int j=0;j<ms;j++) { fields->mem.allPairsColl.set(allocIdx+j,std::pair<int,int>(fields->mem.indexParticle.get(i),fields->mem.indexParticle.get(map.get(j)))); } } // copy content to result vector result.resize(fields->mem.allPairsColl.size()); fields->mem.allPairsColl.writeTo(result); return result; }
随机开始,随机移动,20000个粒子,每次时间步10000+次碰撞
Y轴:毫秒
X轴:时间步
在1500个时间步中,只有两个时间步比60 FPS慢,**单线程**稀疏线性自适应网格解决方案在增加碰撞数量(从9000到14000)时表现出更好的可扩展性,优于多线程和简单方法。API交互类似。
// instantiate memory pool (not to be shared between multiple grids) FastColDetLib::MemoryPool memPool; // map cube volume of (-1, -1, -1) - (1, 1, 1) for collision computations FastColDetLib::AdaptiveGridV2 grid(memPool,-1,-1,-1,1,1,1); // physics simulation loop while(true) { // O(1) complexity grid.clear(); // par is vector of objects that implement IParticle<float> interface grid.addParticles(n,par->data()); // O(num_particle * num_average_duplicates) complexity grid.buildTree(); // O(num_particle * num_average_duplicates * 32 * 32) complexity std::vector<std::pair<int,int>> collisions = grid.findCollisionsAll(); // make collided particles bounce back physics_calc(collisions); }
尽管每个线程的性能比第一个解决方案有很大提高,但将多线程添加到树构建阶段仍然更加困难,这使其成为扩展的限制因素。另一种方法是,每个线程可以拥有自己的自适应网格,映射到目标体积的不同部分。当粒子分布不均匀到所有象限时,这可能导致核心之间的工作负载不平衡。
SOA节点表示(每个字段都有自己的数组)可以修改为在树遍历期间使用SIMD来计算粒子与当前节点的所有子节点的碰撞。这可能会或可能不会为非全对碰撞检测方法增加性能。
std::vector<int> findCollisions( const float minx, const float miny, const float minz, const float maxx, const float maxy, const float maxz );
目前,它没有优化为使用任何SIMD或多线程,并且比20000个粒子的全对计算慢(每个粒子超过2微秒)。
全对计算方法不遍历树,而是简单地迭代叶数组,因此目前不需要树遍历优化。
新CPU的在线测试
开发计算机的处理器(FX8150,2.1GHz)用于模拟非常老的CPU,其SIMD单元的本地宽度为SSE。为了更好地了解新CPU代,对以下代码使用了在线C++编译器服务:
#include"FastCollisionDetectionLib.h"
#include<atomic>
#include<iostream>
template<typename CoordType>
struct Vector3D
{
CoordType x,y,z;
Vector3D<CoordType> crossProduct(Vector3D<CoordType> vec)
{
Vector3D<CoordType> res;
res.x = y*vec.z - z*vec.y;
res.y = z*vec.x - x*vec.z;
res.z = x*vec.y - y*vec.x;
return res;
}
Vector3D<CoordType> operator - (Vector3D<CoordType> vec)
{
Vector3D<CoordType> result;
result.x = x-vec.x;
result.y = y-vec.y;
result.z = z-vec.z;
return result;
}
Vector3D<CoordType> operator + (Vector3D<CoordType> vec)
{
Vector3D<CoordType> result;
result.x = x+vec.x;
result.y = y+vec.y;
result.z = z+vec.z;
return result;
}
Vector3D<CoordType> operator * (CoordType v)
{
Vector3D<CoordType> result;
result.x = x*v;
result.y = y*v;
result.z = z*v;
return result;
}
// warning: sqrt is not necessary in production code like:
// if ( v1.abs()<v2.abs() ) { .. }
// the sqrt below is intentionally left to simulate "heavy" task in "challenge 1"
CoordType abs()
{
return std::sqrt(x*x+y*y+z*z);
}
};
template<typename CoordType>
struct PointCloud
{
Vector3D<CoordType> point[125];
PointCloud(CoordType x, CoordType y, CoordType z)
{
for(int i=0;i<125;i++)
{
point[i].x=x+i%5-2.5f;
point[i].y=y+(i/5)%5-2.5f;
point[i].z=z+i/25-2.5f;
}
}
};
template<typename CoordType>
bool pointCloudIntersection(PointCloud<CoordType>& cl1, PointCloud<CoordType>& cl2)
{
for(Vector3D<CoordType>& p:cl1.point)
{
for(Vector3D<CoordType>& p2:cl2.point)
{
if((p-p2).abs()<1.0f)
{
return true;
}
}
}
return false;
}
template<typename CoordType>
bool intersectDim(const CoordType minx, const CoordType maxx, const CoordType minx2, const CoordType maxx2)
{
return !((maxx < minx2) || (maxx2 < minx));
}
#include"Generator.h"
template<typename CoordType>
struct AABBofPointCloud: public FastColDetLib::IParticle<CoordType>
{
AABBofPointCloud(int idPrm, PointCloud<CoordType> * pCloudPrm)
{
id=idPrm;
pCloud = pCloudPrm;
xmin=pCloud->point[0].x;
ymin=pCloud->point[0].y;
zmin=pCloud->point[0].z;
xmax=pCloud->point[0].x;
ymax=pCloud->point[0].y;
zmax=pCloud->point[0].z;
for(int i=0;i<125;i++)
{
if(xmin>pCloud->point[i].x)
xmin=pCloud->point[i].x;
if(ymin>pCloud->point[i].y)
ymin=pCloud->point[i].y;
if(zmin>pCloud->point[i].z)
zmin=pCloud->point[i].z;
if(xmax<pCloud->point[i].x)
xmax=pCloud->point[i].x;
if(ymax<pCloud->point[i].y)
ymax=pCloud->point[i].y;
if(zmax<pCloud->point[i].z)
zmax=pCloud->point[i].z;
}
}
int id;
PointCloud<CoordType>* pCloud;
CoordType xmin;
CoordType ymin;
CoordType zmin;
CoordType xmax;
CoordType ymax;
CoordType zmax;
const CoordType getMaxX()const {return xmax;}
const CoordType getMaxY()const {return ymax;}
const CoordType getMaxZ()const {return zmax;}
const CoordType getMinX()const {return xmin;}
const CoordType getMinY()const {return ymin;}
const CoordType getMinZ()const {return zmin;}
const int getId()const {return id;}
};
int main()
{
using cotype = float;
PointCloud<cotype> ico1(0,0,0);
// heating the CPU for benchmarking
for(int i=0;i<10000;i++)
{
PointCloud<cotype> ico2(0,0.1f,i*0.1f);
pointCloudIntersection(ico1,ico2);
}
const int N = 20004;
std::vector<PointCloud<cotype>> objects;
oofrng::Generator<64> gen;
for(int i=0;i<N-3;i++)
{
objects.push_back(PointCloud<cotype>(gen.generate1Float()*900,gen.generate1Float()*900,gen.generate1Float()*900));
}
// the teapot in stadium problem
objects.push_back(PointCloud<cotype>(9000,9000,9000));
objects.push_back(PointCloud<cotype>(9001,9001,9001));
objects.push_back(PointCloud<cotype>(9002,9002,9002));
std::vector<AABBofPointCloud<cotype>> AABBs;
for(int i=0;i<N;i++)
{
AABBs.push_back(AABBofPointCloud<cotype>(i,&objects[i]));
}
FastColDetLib::MemoryPool memPool;
FastColDetLib::AdaptiveGridV2 grid2_0(memPool,0,0,0,10005,10005,10005);
// benchmark begin
for(int j=0;j<15;j++)
{
size_t nano;
std::map<int,std::map<int,bool>> collisionMatrix;
{
std::atomic<int> ctr;
ctr.store(0);
{
{
FastColDetLib::Bench bench(&nano);
{
size_t t1,t2,t3;
{
FastColDetLib::Bench b(&t1);
grid2_0.clear();
}
{
FastColDetLib::Bench b(&t2);
grid2_0.addParticles(N,AABBs.data());
}
{
FastColDetLib::Bench b(&t3);
grid2_0.buildTree();
}
std::cout<<t1<<" "<<t2<<" "<<t3<<std::endl;
auto vec = grid2_0.findCollisionsAll();
ctr += vec.size();
}
}
std::cout<<N<<" vs "<<N<<" AABB collision checking by adaptive grid= "<<nano<<" nanoseconds "<<std::endl;
std::cout<<"total = "<<ctr.load()<<std::endl;
}
}
}
return 0;
}
FX8150,2.1GHz,单线程
1287 1715939 17096316 20004 vs 20004 AABB collision checking by adaptive grid= 42595898 nanoseconds total = 137 1808 494565 6036179 20004 vs 20004 AABB collision checking by adaptive grid= 11770599 nanoseconds total = 137 1388 519662 5873310 20004 vs 20004 AABB collision checking by adaptive grid= 11552612 nanoseconds total = 137 1423 552296 5892499 20004 vs 20004 AABB collision checking by adaptive grid= 11380779 nanoseconds total = 137 1163 450733 6186093 20004 vs 20004 AABB collision checking by adaptive grid= 11733104 nanoseconds total = 137 1251 491513 5952507 20004 vs 20004 AABB collision checking by adaptive grid= 11333729 nanoseconds total = 137 1121 448833 5891781 20004 vs 20004 AABB collision checking by adaptive grid= 11225727 nanoseconds total = 137 1291 476826 6144553 20004 vs 20004 AABB collision checking by adaptive grid= 11609930 nanoseconds total = 137 1236 480449 5921570 20004 vs 20004 AABB collision checking by adaptive grid= 11278574 nanoseconds total = 137 1167 450557 5862713 20004 vs 20004 AABB collision checking by adaptive grid= 11278160 nanoseconds total = 137 1234 472445 6146572 20004 vs 20004 AABB collision checking by adaptive grid= 11617215 nanoseconds total = 137 1304 479957 5932065 20004 vs 20004 AABB collision checking by adaptive grid= 11248743 nanoseconds total = 137 1271 463188 5910676 20004 vs 20004 AABB collision checking by adaptive grid= 11265749 nanoseconds total = 137 5271 508218 6104409 20004 vs 20004 AABB collision checking by adaptive grid= 11581773 nanoseconds total = 137 1243 478395 5910560 20004 vs 20004 AABB collision checking by adaptive grid= 11396155 nanoseconds total = 137
在只有137次碰撞的情况下,清除、粒子插入、树构建和全对计算的总运行时间平均不到11.5毫秒,其中一半时间用于构建树。
AVX2(Xeon E-2286G 4GHz),单线程
331 269690 4106238 20004 vs 20004 AABB collision checking by adaptive grid= 11095752 nanoseconds total = 137 610 133017 1431199 20004 vs 20004 AABB collision checking by adaptive grid= 3702480 nanoseconds total = 137 203 137703 1327516 20004 vs 20004 AABB collision checking by adaptive grid= 3615986 nanoseconds total = 137 231 126511 1332285 20004 vs 20004 AABB collision checking by adaptive grid= 3599439 nanoseconds total = 137 231 122653 1334227 20004 vs 20004 AABB collision checking by adaptive grid= 3601913 nanoseconds total = 137 316 112752 1300873 20004 vs 20004 AABB collision checking by adaptive grid= 3559040 nanoseconds total = 137 256 125327 1298195 20004 vs 20004 AABB collision checking by adaptive grid= 3544539 nanoseconds total = 137 265 121138 1241168 20004 vs 20004 AABB collision checking by adaptive grid= 3505718 nanoseconds total = 137 205 125888 1292359 20004 vs 20004 AABB collision checking by adaptive grid= 3505236 nanoseconds total = 137 277 129064 1310662 20004 vs 20004 AABB collision checking by adaptive grid= 3548348 nanoseconds total = 137 333 124409 1309687 20004 vs 20004 AABB collision checking by adaptive grid= 3546196 nanoseconds total = 137 249 125756 1313174 20004 vs 20004 AABB collision checking by adaptive grid= 3544577 nanoseconds total = 137 227 123104 1335686 20004 vs 20004 AABB collision checking by adaptive grid= 3565008 nanoseconds total = 137 330 97679 1296398 20004 vs 20004 AABB collision checking by adaptive grid= 3490028 nanoseconds total = 137 282 97633 1294944 20004 vs 20004 AABB collision checking by adaptive grid= 3518168 nanoseconds total = 137
以3.5毫秒完成所有任务,性能提升3.3倍,几乎所有的加速都来自于新处理器更高的每秒指令数性能和更高的工作频率。只有137次碰撞不足以充分利用SIMD能力。
AVX512(Xeon Gold 5115和Xeon Gold 5215),单线程(-Wall -std=c++14 -O3 -march=native -mavx512f -ftree-vectorize -ffast-math -o a.out source_file.cpp)
(-Wall -std=c++14 -O3 -march=native -mavx512f -ftree-vectorize -ffast-math -o a.out source_file.cpp)
712 923026 13887800 20004 vs 20004 AABB collision checking by adaptive grid= 39407342 nanoseconds total = 137 1818 404930 5271792 20004 vs 20004 AABB collision checking by adaptive grid= 20791697 nanoseconds total = 137 1976 374333 6295718 20004 vs 20004 AABB collision checking by adaptive grid= 12459388 nanoseconds total = 137 1820 433518 5018476 20004 vs 20004 AABB collision checking by adaptive grid= 13907910 nanoseconds total = 137 1836 394659 5168525 20004 vs 20004 AABB collision checking by adaptive grid= 11257770 nanoseconds total = 137 1360 721575 5573255 20004 vs 20004 AABB collision checking by adaptive grid= 12274256 nanoseconds total = 137 1283 872188 5019439 20004 vs 20004 AABB collision checking by adaptive grid= 11718171 nanoseconds total = 137 1065 370593 3192312 20004 vs 20004 AABB collision checking by adaptive grid= 7165426 nanoseconds total = 137 2243 222700 3017543 20004 vs 20004 AABB collision checking by adaptive grid= 7580070 nanoseconds total = 137 12110 252045 3169418 20004 vs 20004 AABB collision checking by adaptive grid= 7609170 nanoseconds total = 137 1902 483356 3613861 20004 vs 20004 AABB collision checking by adaptive grid= 8001453 nanoseconds total = 137 1161 270468 2927408 20004 vs 20004 AABB collision checking by adaptive grid= 6765911 nanoseconds total = 137 1534 721745 3550246 20004 vs 20004 AABB collision checking by adaptive grid= 9669111 nanoseconds total = 137 1465 374488 4659102 20004 vs 20004 AABB collision checking by adaptive grid= 10340249 nanoseconds total = 137 1061 226212 3145650 20004 vs 20004 AABB collision checking by adaptive grid= 7007133 nanoseconds total = 137
最后两次测试表明,当碰撞次数不多时,CPU频率和IPC更为重要。
通过更改几行代码,使同一测试具有密集碰撞
objects.push_back(
PointCloud<cotype>(
gen.generate1Float()*900,gen.generate1Float()*900,gen.generate1Float()*900)
);
to:
objects.push_back(
PointCloud<cotype>(
gen.generate1Float()*150,gen.generate1Float()*150,gen.generate1Float()*150)
);
FX8150,2.1GHz,单线程
809 1137072 16250322 20004 vs 20004 AABB collision checking by adaptive grid= 45634949 nanoseconds total = 29326 11628 774879 6375167 20004 vs 20004 AABB collision checking by adaptive grid= 17535811 nanoseconds total = 29326 1323 489172 6242101 20004 vs 20004 AABB collision checking by adaptive grid= 16964760 nanoseconds total = 29326 1439 503646 6378985 20004 vs 20004 AABB collision checking by adaptive grid= 17205537 nanoseconds total = 29326 1430 507256 6200894 20004 vs 20004 AABB collision checking by adaptive grid= 16871454 nanoseconds total = 29326 1292 480034 6326381 20004 vs 20004 AABB collision checking by adaptive grid= 17126909 nanoseconds total = 29326 1471 494320 6234695 20004 vs 20004 AABB collision checking by adaptive grid= 16951342 nanoseconds total = 29326 1438 507194 6376321 20004 vs 20004 AABB collision checking by adaptive grid= 17215743 nanoseconds total = 29326 1439 510082 6213300 20004 vs 20004 AABB collision checking by adaptive grid= 16906657 nanoseconds total = 29326 1294 588172 6218907 20004 vs 20004 AABB collision checking by adaptive grid= 17530016 nanoseconds total = 29326 1512 479942 6271097 20004 vs 20004 AABB collision checking by adaptive grid= 17305008 nanoseconds total = 29326 1364 491258 6211765 20004 vs 20004 AABB collision checking by adaptive grid= 17117965 nanoseconds total = 29326 1400 484784 6234349 20004 vs 20004 AABB collision checking by adaptive grid= 18445534 nanoseconds total = 29326 2177 915668 7516056 20004 vs 20004 AABB collision checking by adaptive grid= 18903198 nanoseconds total = 29326 1610 658380 6986867 20004 vs 20004 AABB collision checking by adaptive grid= 20395348 nanoseconds total = 29326
在29326次碰撞的情况下,速度比60 FPS慢10%(平均每个粒子触摸1.5个粒子),其中树构建仅占运行时间的约1/3。
AVX2(Xeon E-2286G 4GHz),单线程(https://wandbox.org/permlink/tdGggzzAkHN3VkUz)
702 270682 4482552 20004 vs 20004 AABB collision checking by adaptive grid= 18340629 nanoseconds total = 29326 15704 134363 1598335 20004 vs 20004 AABB collision checking by adaptive grid= 10920371 nanoseconds total = 29326 377 140370 1546184 20004 vs 20004 AABB collision checking by adaptive grid= 10877674 nanoseconds total = 29326 271 118858 1518309 20004 vs 20004 AABB collision checking by adaptive grid= 10822341 nanoseconds total = 29326 582 128086 1539374 20004 vs 20004 AABB collision checking by adaptive grid= 10880007 nanoseconds total = 29326 293 128848 1534561 20004 vs 20004 AABB collision checking by adaptive grid= 10841986 nanoseconds total = 29326 606 129491 1546189 20004 vs 20004 AABB collision checking by adaptive grid= 10877790 nanoseconds total = 29326 215 127919 1545995 20004 vs 20004 AABB collision checking by adaptive grid= 10781822 nanoseconds total = 29326 214 126475 1528446 20004 vs 20004 AABB collision checking by adaptive grid= 10830309 nanoseconds total = 29326 296 115932 1547976 20004 vs 20004 AABB collision checking by adaptive grid= 10863433 nanoseconds total = 29326 273 127565 1529545 20004 vs 20004 AABB collision checking by adaptive grid= 10829295 nanoseconds total = 29326 247 127836 1554261 20004 vs 20004 AABB collision checking by adaptive grid= 10938447 nanoseconds total = 29326 241 100625 1541077 20004 vs 20004 AABB collision checking by adaptive grid= 10936425 nanoseconds total = 29326 251 100587 1545350 20004 vs 20004 AABB collision checking by adaptive grid= 10903145 nanoseconds total = 29326 526 101546 1542023 20004 vs 20004 AABB collision checking by adaptive grid= 10916791 nanoseconds total = 29326
AVX2 CPU以2倍频率达到91 FPS。1.5毫秒的树构建和9.3毫秒的全对计算。
AVX512(Xeon Gold 5115和Xeon Gold 5215),单线程(https://rextester.com/YKDN52107)
463 638674 8377763 20004 vs 20004 AABB collision checking by adaptive grid= 28530573 nanoseconds total = 29326 1995 222204 2649192 20004 vs 20004 AABB collision checking by adaptive grid= 12960779 nanoseconds total = 29326 1538 222590 2658591 20004 vs 20004 AABB collision checking by adaptive grid= 12791377 nanoseconds total = 29326 1378 218950 2662630 20004 vs 20004 AABB collision checking by adaptive grid= 13005473 nanoseconds total = 29326 1256 217816 2639028 20004 vs 20004 AABB collision checking by adaptive grid= 12879503 nanoseconds total = 29326 1272 232140 2625833 20004 vs 20004 AABB collision checking by adaptive grid= 12799225 nanoseconds total = 29326 956 225068 2634176 20004 vs 20004 AABB collision checking by adaptive grid= 12939098 nanoseconds total = 29326 1292 217455 2665934 20004 vs 20004 AABB collision checking by adaptive grid= 12769266 nanoseconds total = 29326 1386 220565 2686391 20004 vs 20004 AABB collision checking by adaptive grid= 12779806 nanoseconds total = 29326 1735 219911 2656253 20004 vs 20004 AABB collision checking by adaptive grid= 12820786 nanoseconds total = 29326 1153 216455 2638674 20004 vs 20004 AABB collision checking by adaptive grid= 12742196 nanoseconds total = 29326 1123 236914 2633897 20004 vs 20004 AABB collision checking by adaptive grid= 12872167 nanoseconds total = 29326 1971 219628 2642318 20004 vs 20004 AABB collision checking by adaptive grid= 12864358 nanoseconds total = 29326 1181 215090 2676025 20004 vs 20004 AABB collision checking by adaptive grid= 12990054 nanoseconds total = 29326 1291 215507 2653345 20004 vs 20004 AABB collision checking by adaptive grid= 12841870 nanoseconds total = 29326
Xeon Gold 5215以2.9GHz的睿频达到78 FPS(比AVX2测试低27.5%的CPU频率),比4GHz的AVX2测试慢15%。2.6毫秒的树构建和10.2毫秒的全对计算。
测试表明,除了向量化的全对计算之外,仍然存在严重的瓶颈。最降速的影响之一是RAM延迟,当重复项过多时,缓存未命中率仍然很高。尽管移除了重复项,访问重复数据仍然会从缓存中逐出太多有用的数据到RAM,导致不同CPU之间的扩展性较差。
该实现(目前)在树构建中没有从SIMD中受益。这使得AVX512测试中的树构建时间比4GHz AVX2测试多出1毫秒。这使得AVX512测试中的全对碰撞计算方法每周期性能提升了近25%。
在12000和6400次碰撞下,Xeon Gold(AVX512)分别达到97 FPS和100 FPS,这表明由于树构建中的瓶颈,性能存在更高的上限(100 FPS),粒子插入(以及可能的全对计算)。
所有AVX2和AVX512测试都是通过在线C++编译器服务器运行的,这些服务器由多个客户端共享,实际性能可能因服务器负载而异。
20k AABB在60 FPS下的全对碰撞检测
为了在不显著改变原始自适应网格实现的情况下增加性能,编写了一个外部头文件,采用简单的八叉树状结构来查找多个粒子簇并将它们绑定到不同的专用计算线程。
新算法的伪代码
put particles into root node
recursive_compute(this):
if number of particles in this node > 10000
create 64 children nodes (4x4x4 grid)
put particles into all possibly-overlapping child nodes
for all child nodes
recursive_compute(node)
else
create new task for a dedicated thread
create adaptive grid for particles of this node
compute all-pairs collision
if this node is root
wait for all dedicated threads to finish tasks
blend all results from all adaptive grids (deduplication/insertion)
return unified results
collisions = recursive_compute(root)
源代码中的树结构源代码半线性化和半传统化,以保持比自适应网格实现更低的复杂性和比传统树更高的性能。
由于负载均衡树(AdaptiveGridTree<numThreads>
)是单线程构建的,负载均衡仅在粒子树深度不高时才有效。当太多粒子构成“茶壶在体育场”问题时,这会导致单线程的树遍历和构建过多,使得多线程性能扩展效率低下。
目前,该实现将每节点粒子数10000作为分支的关键点,以避免因CPU核心缓存大小有限而损失过多性能。
#ifndef ADAPTIVEGRIDTREE_H_
#define ADAPTIVEGRIDTREE_H_
#include<atomic>
#include<thread>
#include<mutex>
#include<functional>
#include"FastCollisionDetectionLib.h"
// simple dense octree
constexpr int numNodesPerNode = 64;
struct GridTreeNode
{
GridTreeNode()
{
for(int i=0;i<numNodesPerNode;i++)
childNodes[i]=nullptr;
}
std::shared_ptr<GridTreeNode> childNodes[numNodesPerNode];
std::vector<int> parOrder;
};
template<int concurrency>
class AdaptiveGridTree
{
public:
AdaptiveGridTree(const float minx, const float miny, const float minz, const float maxx, const float maxy, const float maxz):
minX(minx),minY(miny),minZ(minz),maxX(maxx),maxY(maxy),maxZ(maxz), workCtr(0)
{
for(int i=0;i<concurrency;i++)
{
thr.emplace_back([&](){
FastColDetLib::MemoryPool mem;
bool work=true;
while(work)
{
auto f = taskQueue.pop();
if(f)
{
f(mem);
taskCompleteQueue.push2(true);
}
else
break;
}
taskCompleteQueue.push(true);
});
}
}
~AdaptiveGridTree()
{
for(int i=0;i<concurrency*2;i++)
taskQueue.push(nullptr);
for(int i=0;i<concurrency;i++)
{
thr[i].join();
}
}
void clear()
{
std::lock_guard<std::mutex> lg(mut);
parId.reset();
parMinX.reset();
parMinY.reset();
parMinZ.reset();
parMaxX.reset();
parMaxY.reset();
parMaxZ.reset();
}
template<typename Derived>
void addParticles(const int N, Derived * ptr)
{
std::lock_guard<std::mutex> lg(mut);
const int ofsId = parId.allocate(N);
const int ofsMinx = parMinX.allocate(N);
const int ofsMiny = parMinY.allocate(N);
const int ofsMinz = parMinZ.allocate(N);
const int ofsMaxx = parMaxX.allocate(N);
const int ofsMaxy = parMaxY.allocate(N);
const int ofsMaxz = parMaxZ.allocate(N);
for(int i=0;i<N;i++)
{
parId.set(ofsId+i,static_cast<FastColDetLib::IParticle<float>*>(ptr+i)->getId());
parMinX.set(ofsMinx+i,static_cast<FastColDetLib::IParticle<float>*>(ptr+i)->getMinX());
parMinY.set(ofsMiny+i,static_cast<FastColDetLib::IParticle<float>*>(ptr+i)->getMinY());
parMinZ.set(ofsMinz+i,static_cast<FastColDetLib::IParticle<float>*>(ptr+i)->getMinZ());
parMaxX.set(ofsMaxx+i,static_cast<FastColDetLib::IParticle<float>*>(ptr+i)->getMaxX());
parMaxY.set(ofsMaxy+i,static_cast<FastColDetLib::IParticle<float>*>(ptr+i)->getMaxY());
parMaxZ.set(ofsMaxz+i,static_cast<FastColDetLib::IParticle<float>*>(ptr+i)->getMaxZ());
}
}
std::vector<std::pair<int,int>> computeAllPairs(GridTreeNode * parent = nullptr, float minx=0.0f, float miny=0.0f, float minz=0.0f, float maxx=0.0f, float maxy=0.0f, float maxz=0.0f,
std::vector<std::pair<int,int>> * result=nullptr, FastColDetLib::Memory<FastColDetLib::FastUnique<int32_t,FastColDetLib::testUniqueLimit>> * mapping = nullptr)
{
GridTreeNode root;
std::vector<std::pair<int,int>> resultRoot;
if(!parent)
{
workCtr=0;
const int N = parId.size();
mappingRoot.reset();
mappingRoot.allocate(N);
minx = minX;
miny = minY;
minz = minZ;
maxx = maxX;
maxy = maxY;
maxz = maxZ;
parent = &root;
for(int i=0;i<N;i++)
{
parent->parOrder.push_back(i);
mappingRoot.getRef(i).reset();
}
result = &resultRoot;
mapping = &mappingRoot;
}
// build tree until 10k or less particles left for each leaf
if(parent->parOrder.size()>10000)
{
const int N = parent->parOrder.size();
for(int i=0;i<numNodesPerNode;i++)
{
parent->childNodes[i]=std::make_shared<GridTreeNode>();
}
const float stepx = 4.0f/(maxx - minx);
const float stepy = 4.0f/(maxy - miny);
const float stepz = 4.0f/(maxz - minz);
for(int i=0;i<numNodesPerNode;i++)
{
accumulator[i].reset();
}
for(int i=0;i<N;i++)
{
// calculate overlapping region's cell indices
const int parOrdI = parent->parOrder[i];
const int mincornerstartx = std::floor((parMinX.get(parOrdI) - minx) * stepx);
const int maxcornerendx = std::floor((parMaxX.get(parOrdI) - minx) * stepx);
const int mincornerstarty = std::floor((parMinY.get(parOrdI) - miny) * stepy);
const int maxcornerendy = std::floor((parMaxY.get(parOrdI) - miny) * stepy);
const int mincornerstartz = std::floor((parMinZ.get(parOrdI) - minz) * stepz);
const int maxcornerendz = std::floor((parMaxZ.get(parOrdI) - minz) * stepz);
for(int ii=mincornerstartz;ii<=maxcornerendz;ii++)
for(int j=mincornerstarty;j<=maxcornerendy;j++)
for(int k=mincornerstartx;k<=maxcornerendx;k++)
{
if(ii<0 || ii>=4 || j<0 || j>=4 || k<0 || k>=4)
continue;
auto & acc = accumulator[k+j*4+ii*16];
acc.set(acc.allocate(1), parOrdI);
}
}
{
for(int i=0;i<numNodesPerNode;i++)
{
const int n = accumulator[i].size();
for(int j=0;j<n;j++)
{
parent->childNodes[i]->parOrder.push_back(accumulator[i].get(j));
}
}
}
for(int i=0;i<numNodesPerNode;i++)
{
if(parent->childNodes[i]->parOrder.size()>1)
{
computeAllPairs(parent->childNodes[i].get(),
minx+(i&3)/stepx, miny+((i/4)&3)/stepy, minz+(i/16)/stepz,
minx+(i&3)/stepx + 1.0f/stepx, miny+((i/4)&3)/stepy + 1.0f/stepy, minz+(i/16)/stepz + 1.0f/stepz, result,mapping);
}
}
}
else
{
if(parent->parOrder.size()>1)
{
// offload to another thread as a sparse-linear-adaptive-grid
workCtr++;
taskQueue.push2([&,parent,minx,miny,minz,maxx,maxy,maxz,result](FastColDetLib::MemoryPool mem)
{
FastColDetLib::AdaptiveGridV2 subGrid(mem,minx,miny,minz,maxx,maxy,maxz);
subGrid.clear();
{
subGrid.addParticlesWithoutInterface(parent->parOrder.size(), 0, parent->parOrder,
parId,
parMinX, parMinY, parMinZ,
parMaxX, parMaxY, parMaxZ
);
subGrid.buildTree();
}
const std::vector<std::pair<int,int>> coll = subGrid.findCollisionsAll();
if(coll.size()>0)
{
std::lock_guard<std::mutex> lg(mut);
result->insert(result->end(),coll.begin(),coll.end());
}
});
}
}
if(parent == &root)
{
constexpr int mutN = 1024;
constexpr int mutN1 = mutN-1;
FastColDetLib::MutexWithoutFalseSharing mutArr[mutN];
int endQueue = 0;
while(endQueue<workCtr)
{
if(taskCompleteQueue.pop())
{
endQueue++;
}
}
workCtr = 0;
std::lock_guard<std::mutex> lg(mut);
resultRootMem.reset();
{
{
const int nr = resultRoot.size();
for(int i=0;i<nr;i+=1000)
{
workCtr++;
taskQueue.push2([&,i](FastColDetLib::MemoryPool & mem)
{
unsigned int lastId = 2000000000; // so you don't have 2 billion particles in a game right?
mutArr[lastId&mutN1].mut.lock();
for(int j=0;j<1000;j++)
{
if(i+j>=nr)
break;
const unsigned int rfirst = resultRoot[i+j].first;
const int rsecond = resultRoot[i+j].second;
if(lastId != rfirst)
{
//std::lock_guard<std::mutex> lg(mutArr[rfirst&255].mut);
mutArr[lastId&mutN1].mut.unlock();
mutArr[rfirst&mutN1].mut.lock();
mapping->getRef(rfirst).insert(rsecond);
lastId = rfirst;
}
else
{
//std::lock_guard<std::mutex> lg(mutArr[rfirst&255].mut);
mapping->getRef(rfirst).insert(rsecond);
}
}
mutArr[lastId&mutN1].mut.unlock();
});
}
endQueue = 0;
while(endQueue<workCtr)
{
if(taskCompleteQueue.pop())
{
endQueue++;
}
}
workCtr = 0;
}
resultRoot.clear();
const int N = mapping->size();
int allocSize = 0;
std::vector<std::pair<int,int>> allocOfs;
for(int i=0;i<N;i++)
{
//std::lock_guard<std::mutex> lg(mutArr[i&255].mut);
const int isz = mapping->getRef(i).size();
if(isz > 0)
{
allocOfs.push_back(std::pair<int,int>(i,allocSize));
allocSize += isz;
}
}
resultRootMem.allocate(allocSize);
const int N2 = allocOfs.size();
for(int i0=0;i0<N2;i0+=10000)
{
workCtr++;
taskQueue.push2([&,i0](FastColDetLib::MemoryPool & mem)
{
for(int j=0;j<10000;j++)
{
const int i = i0+j;
if(i>=N2)
break;
auto & ref = mapping->getRef(allocOfs[i].first);
const int n = ref.size();
const int ofs0 = allocOfs[i].second;
for(int j=0;j<n;j++)
{
resultRootMem.set(ofs0+j,std::pair<int,int>(allocOfs[i].first,ref.get(j)));
}
}
});
}
endQueue = 0;
while(endQueue<workCtr)
{
if(taskCompleteQueue.pop())
{
endQueue++;
}
}
workCtr = 0;
resultRoot.resize(resultRootMem.size());
resultRootMem.writeTo(resultRoot);
}
}
return resultRoot;
}
private:
std::mutex mut;
std::vector<std::thread> thr;
FastColDetLib::SyncQueue<std::function<void(FastColDetLib::MemoryPool &)>> taskQueue;
FastColDetLib::SyncQueue<bool> taskCompleteQueue;
FastColDetLib::Memory<std::pair<int,int>> resultRootMem;
FastColDetLib::Memory<FastColDetLib::FastUnique<int32_t,FastColDetLib::testUniqueLimit>> mappingRoot;
const float minX,minY,minZ,maxX,maxY,maxZ;
int workCtr;
FastColDetLib::Memory<int> parId;
FastColDetLib::Memory<float> parMinX;
FastColDetLib::Memory<float> parMinY;
FastColDetLib::Memory<float> parMinZ;
FastColDetLib::Memory<float> parMaxX;
FastColDetLib::Memory<float> parMaxY;
FastColDetLib::Memory<float> parMaxZ;
FastColDetLib::Memory<int> accumulator[numNodesPerNode];
};
#endif /* ADAPTIVEGRIDTREE_H_ */
在FX8150 2.1GHz + 4GB DDR3 1333MHz RAM上,使用40000个随机移动和随机放置的粒子进行1100次总碰撞的粒子插入+树构建和全对计算的基准测试时间
随机分布的粒子不会导致负载均衡树的深度超过1。根节点有64个子节点,每个子节点拥有不到10000个粒子(平均40000/64),它们的任务被推送到一个公共任务队列中,该队列被7个线程(同时留下1个线程用于基准测试的控制逻辑)并发消耗。
浅层树在2毫秒内构建完成,计算在18毫秒内完成,平均为14.5毫秒。平均而言,这是60 FPS,但由于粒子混乱移动引起的混乱工作分配,存在性能峰值。
当碰撞数量从1100增加到29000时,性能下降到50 FPS以下。
这主要是由负载均衡树的去重部分引起的,因为所有自适应网格的不同结果在没有重复碰撞对的情况下进行合并。40000个粒子中29000次碰撞相当于3/4的粒子具有单个碰撞对,并且下面有渲染图像。
(红点是碰撞的粒子,白点是非碰撞的粒子。)
相同的性能一直保持到49000次碰撞。碰撞次数达到118000次时,FPS下降到35。
在“茶壶在体育场”问题中,树深度较浅
(粒子适合在一个维度更宽的区域内,与最新的单线程基准测试相比)。
#include"FastCollisionDetectionLib.h"
#include<atomic>
#include<iostream>
template<typename CoordType>
struct Vector3D
{
CoordType x,y,z;
Vector3D<CoordType> crossProduct(Vector3D<CoordType> vec)
{
Vector3D<CoordType> res;
res.x = y*vec.z - z*vec.y;
res.y = z*vec.x - x*vec.z;
res.z = x*vec.y - y*vec.x;
return res;
}
Vector3D<CoordType> operator - (Vector3D<CoordType> vec)
{
Vector3D<CoordType> result;
result.x = x-vec.x;
result.y = y-vec.y;
result.z = z-vec.z;
return result;
}
Vector3D<CoordType> operator + (Vector3D<CoordType> vec)
{
Vector3D<CoordType> result;
result.x = x+vec.x;
result.y = y+vec.y;
result.z = z+vec.z;
return result;
}
Vector3D<CoordType> operator * (CoordType v)
{
Vector3D<CoordType> result;
result.x = x*v;
result.y = y*v;
result.z = z*v;
return result;
}
// trying to simulate an expensive collision-test here
// otherwise it would not require sqrt at all
// if( v1.abs() < v2.abs() ){ } is not efficient. Just simulating heavy work.
CoordType abs()
{
return std::sqrt(x*x+y*y+z*z);
}
};
template<typename CoordType>
struct PointCloud
{
Vector3D<CoordType> point[125];
PointCloud(CoordType x, CoordType y, CoordType z)
{
for(int i=0;i<125;i++)
{
point[i].x=x+i%5-2.5f;
point[i].y=y+(i/5)%5-2.5f;
point[i].z=z+i/25-2.5f;
}
}
};
template<typename CoordType>
bool pointCloudIntersection(PointCloud<CoordType>& cl1, PointCloud<CoordType>& cl2)
{
for(Vector3D<CoordType>& p:cl1.point)
{
for(Vector3D<CoordType>& p2:cl2.point)
{
if((p-p2).abs()<1.0f)
{
return true;
}
}
}
return false;
}
template<typename CoordType>
bool intersectDim(const CoordType minx, const CoordType maxx, const CoordType minx2, const CoordType maxx2)
{
return !((maxx < minx2) || (maxx2 < minx));
}
#include"Generator.h"
template<typename CoordType>
struct AABBofPointCloud: public FastColDetLib::IParticle<CoordType>
{
AABBofPointCloud(int idPrm, PointCloud<CoordType> * pCloudPrm)
{
id=idPrm;
pCloud = pCloudPrm;
xmin=pCloud->point[0].x;
ymin=pCloud->point[0].y;
zmin=pCloud->point[0].z;
xmax=pCloud->point[0].x;
ymax=pCloud->point[0].y;
zmax=pCloud->point[0].z;
for(int i=0;i<125;i++)
{
if(xmin>pCloud->point[i].x)
xmin=pCloud->point[i].x;
if(ymin>pCloud->point[i].y)
ymin=pCloud->point[i].y;
if(zmin>pCloud->point[i].z)
zmin=pCloud->point[i].z;
if(xmax<pCloud->point[i].x)
xmax=pCloud->point[i].x;
if(ymax<pCloud->point[i].y)
ymax=pCloud->point[i].y;
if(zmax<pCloud->point[i].z)
zmax=pCloud->point[i].z;
}
}
int id;
PointCloud<CoordType>* pCloud;
CoordType xmin;
CoordType ymin;
CoordType zmin;
CoordType xmax;
CoordType ymax;
CoordType zmax;
const CoordType getMaxX()const {return xmax;}
const CoordType getMaxY()const {return ymax;}
const CoordType getMaxZ()const {return zmax;}
const CoordType getMinX()const {return xmin;}
const CoordType getMinY()const {return ymin;}
const CoordType getMinZ()const {return zmin;}
const int getId()const {return id;}
};
#include "AdaptiveGridTree.h"
int main()
{
using cotype = float;
PointCloud<cotype> ico1(0,0,0);
// heating the CPU for benchmarking
for(int i=0;i<10000;i++)
{
PointCloud<cotype> ico2(0,0.1f,i*0.1f);
pointCloudIntersection(ico1,ico2);
}
const int N = 40004;
std::vector<PointCloud<cotype>> objects;
oofrng::Generator<64> gen;
for(int i=0;i<N-3;i++)
{
// this time X range is 0-1500 instead of 0-150
objects.push_back(
PointCloud<cotype>(
gen.generate1Float()*1500,gen.generate1Float()*150,gen.generate1Float()*150)
);
}
// the teapot in stadium problem
objects.push_back(PointCloud<cotype>(9000,9000,9000));
objects.push_back(PointCloud<cotype>(9001,9001,9001));
objects.push_back(PointCloud<cotype>(9002,9002,9002));
std::vector<AABBofPointCloud<cotype>> AABBs;
for(int i=0;i<N;i++)
{
AABBs.push_back(AABBofPointCloud<cotype>(i,&objects[i]));
}
AdaptiveGridTree<7> test(0,0,0,10005,10005,10005);
for(int i=0;i<100;i++)
{
size_t nano;
{
FastColDetLib::Bench bench(&nano);
test.clear();
test.addParticles(N,AABBs.data());
const auto coll = test.computeAllPairs();
std::cout<<"c="<<coll.size()<<std::endl;
}
std::cout<<"t="<<nano<<std::endl;
}
return 0;
}
输出
c=11990 t=18833176 c=11990 t=18373216 c=11990 t=18464937 c=11990 t=18453505
对于近12000次碰撞,性能比60 FPS低10%。
性能扩展问题主要是由去重(瓶颈在于RAM带宽和互斥锁吞吐量)引起的,部分原因是负载均衡树的单线程构建。具有睿频的CPU可以提升这些串行运行的部分,以实现更好的多线程性能扩展,而更快的RAM可以提供更好的整体性能。
当CPU频率提高到3.6 GHz时,
c=11990 t=16230516 c=11990 t=16768193 c=11990
性能仅提高10%-15%。去重中的RAM访问/互斥锁吞吐量限制了扩展性。
当粒子聚集在多个区域时,性能扩展会增加,因为构建的自适应网格数量较少,计算的去重次数也较少。
4个簇(每个10k粒子)
只需更改点云初始化部分,即可创建多个簇。
// cluster 1
for(int i=0;i<10000;i++)
{
objects.push_back(
PointCloud<cotype>(
gen.generate1Float()*150,5000+gen.generate1Float()*1500,gen.generate1Float()*150)
);
}
// cluster 2
for(int i=0;i<10000;i++)
{
objects.push_back(
PointCloud<cotype>(
gen.generate1Float()*150,gen.generate1Float()*150, 5000+gen.generate1Float()*1500)
);
}
// cluster 3
for(int i=0;i<10000;i++)
{
objects.push_back(
PointCloud<cotype>(
5000 + gen.generate1Float()*150,gen.generate1Float()*150, 5000+gen.generate1Float()*1500)
);
}
// cluster 4
for(int i=30000;i<N-3;i++)
{
objects.push_back(
PointCloud<cotype>(
gen.generate1Float()*1500,gen.generate1Float()*150,gen.generate1Float()*150)
);
}
// the teapot in stadium problem
objects.push_back(PointCloud<cotype>(9000,9000,9000));
objects.push_back(PointCloud<cotype>(9001,9001,9001));
objects.push_back(PointCloud<cotype>(9002,9002,9002));
结果
c=2998 t=16395008 c=2998 t=16848125 c=2998 t=16174032 c=2998 t=16507923 c=2998 t=15759892 c=2998 t=16646543 c=2998 t=15984087 c=2998 t=15312543
平均每秒计算超过60 FPS(约16毫秒)。
16个簇(每个2500粒子)
for(int j=0;j<16;j++) for(int i=0;i<2500;i++) { objects.push_back( PointCloud<cotype>( (j%2)*5000 + gen.generate1Float()*150, ((j/2)%2)*5000 +gen.generate1Float()*150, (j/4)*2500 +gen.generate1Float()*150) ); }
结果
c=7190 t=13297025 c=7190 t=13186931 c=7190 t=14142431 c=7190 t=13259546 c=7190 t=14085992 c=7190 t=13298677 c=7190 t=12839448
平均每秒计算约13毫秒(即77 FPS),速度至少快15%。相同大小的簇也有助于负载均衡,但当负载均衡树非常深时不可见。
64个簇版本的慢速(树深度为1的均匀随机分布)源于去重工作的增加。簇越多,去重性能越差,但负载均衡越好。对于测试系统,16个簇具有最佳性能,而均匀分布具有最差性能,但能维持高碰撞扩展性(性能下降速度不如碰撞对数量增加的速度快)。
Goldbolt.org上的基准测试(Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz,编译器选项= -std=c++2a -O3 -march=native -mavx512f -mavx512bw -ftree-vectorize -ffast-math -fno-math-errno -lpthread )
16个簇+2个线程:14.4毫秒
16个簇+4个线程:8.3毫秒(https://godbolt.org/z/8TE4znzhE)
paiza.io上的基准测试(Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz)
16个簇+2个线程:10.8毫秒
16个簇+4个线程:10.7毫秒
在共享服务器上使用多线程存在性能扩展的复杂性。使用另一个线程会增加与其他客户端共享同一核心的机会。
wandbox.org(Intel(R) Xeon(R) E-2286G CPU @ 4.00GHz(4个线程,16个簇)):9.3毫秒
rextester.com(Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz(4个线程,16个簇)):10.0毫秒
当碰撞次数从7000增加到181000次(通过增加每个簇的粒子密度)时,使用7个线程的fx8150需要23毫秒,而使用4个线程的Xeon Gold 5215则在19毫秒内完成计算。
对于不需要合并所有节点结果的算法,树结构足以在多个线程之间进行负载均衡,以应对未知的粒子密度。
关注点
使用均匀网格作为空间加速结构很有趣,优化它可以适应更多场景。目前,它的插入/查询性能比例不太均衡。由于插入很简单,下一个更新中很容易添加多线程支持。
下一个挑战(已完成):为球形对象的简单物理模拟计算20k AABB的全对碰撞,帧率为60 FPS。
第三个挑战(已完成):在60 FPS下计算40k AABB的全对碰撞检测,最好是在同一CPU(2.1GHz FX8150 + 单通道1333MHz ddr3 RAM)的单线程上完成。
第四个挑战:在GT1030 GPU上,以60 FPS计算80k AABB的全对碰撞检测。
所有图表均在此网站制作:https://app.diagrams.net/。
所有图表均在此网站可视化:https://app.datawrapper.de
历史
- 2022年3月18日星期五:初次提交,性能勉强达到文章开头设定的性能要求。
- 2022年4月4日星期一:实现了4x4x4子节点设置的稀疏线性自适应网格并进行了测试,完成了20k粒子全对碰撞检测在60 FPS下的第二项挑战。添加了AVX2和AVX512测试。
- 2022年4月16日星期六:以“自适应网格树”的形式实现了多线程负载均衡版本,并进行了基准测试,完成了40k粒子全对碰撞检测在60FPS下的第三项挑战。