MASM - 堆栈内存对齐






3.62/5 (11投票s)
SIMD 指令集可能需要特殊的内存对齐,但当内存位于堆栈上时,MASM 不提供对齐机制。
引言
在 MASM 中,ALIGN 指令不会对局部(或堆栈)变量进行对齐,即你在过程中使用 LOCAL 指令声明的变量。对于局部变量,你唯一的保证是 32 位 Windows 会将其对齐到 4 字节边界,而 64 位 Windows 会将其对齐到 8 字节边界。
当然,MASM 会对源代码 .DATA 段中声明的变量进行对齐,但这些变量是静态的,可能不是你所需要的,特别是当代码需要线程安全时。
直到 SSE 指令集的出现,堆栈数据对齐机制的缺乏才变得真正关键。许多从内存读取数据的 SSE 指令要求数据在 16 字节边界上对齐,否则会产生错误。
大多数最新的 C/C++ 编译器都有对齐堆栈数据的指令,但我们处理的是 MASM。如果你将 C/C++ 与汇编语言链接,或者用汇编语言编写应用程序,你需要了解潜在的问题。
SSE 提供了将可能未对齐的数据加载到寄存器或将 SSE 寄存器中的数据存储到可能未对齐的内存中的指令,即“movups”和“movdqu”指令。在现代 CPU 上,性能损失不像在老款 Pentium 3 和 4 上那么明显,这通常是最佳选择。
然而,了解如何在 MASM 中对齐堆栈内存仍然很有用。例如,你可能需要从汇编语言调用外部函数,这些函数期望接收到的数据是 16 字节对齐的。
使用代码
多年来,在各种论坛上都讨论过在汇编语言中对齐堆栈内存的问题,但我一直没有找到真正可行的解决方案,所以我决定提出我自己的方法。
此解决方案允许无限数量的 16 字节(如果需要,可以轻松修改为 32 字节或更高,此处留作练习)对齐的内存变量。
它按以下阶段工作:
1) 保存当前的堆栈位置,以便稍后恢复。
这是通过一个宏实现的(这是 32 位版本。32 位和 64 位版本都可以从上面的链接下载)
SAVE_STACK_POSITION macro mov TopOfAllocatedStackMem, esp ;; Save the current top of stack endm
2) 在堆栈上为某些变量保留一块 16 字节对齐的内存。一个包含指向它的指针的变量已经通过 LOCAL 指令声明,以便我们稍后可以访问它。
另一个宏负责这项工作(这是 32 位版本)
CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT macro memsize, PtrToStackMem
and rsp, -10h ;; Align to 16 byte boundary
sub rsp, memsize ;; Make room for the new variable
mov PtrToStackMem, rsp
endm
3) 现在我们有了变量的内存,我们可以将一些数据保存在那里。
第三个宏(这里是 32 位版本)负责这项工作
SAVE_XMM_IN_ALIGNED_STACK_MEMORY macro PtrToStackMem, reg
push eax ;; Now we are safe to "push" and "pop" registers.
mov eax, PtrToStackMem
movaps [eax], reg ;; Were the memory not aligned and an exception would occur here.
pop eax
endm
4) 当我们需要检索变量时,我们使用第四个宏(这里是 32 位版本)
RETRIEVE_XMM_FROM_ALIGNED_STACK_MEMORY macro PtrToStackMem, reg
push eax
mov eax, PtrToStackMem
movaps reg, [eax] ;; Were the memory not aligned and an exception would occur here.
pop eax
endm
5) 当过程返回给调用者时,我们需要释放我们在堆栈上分配的所有内存。
因此,在“ret”指令之前插入以下宏(这里是 32 位版本)(MASM 编译器将在“ret”之前发出“leave”指令)
RESTORE_STACK_POSITION macro
mov esp, TopOfAllocatedStackMem
endm
我们的演示
我们的演示由一个可调用的 ASM 函数 (AsmMemAlignDemo) 和一个包含其调用者的迷你 C++ 项目组成。AsmMemAlignDemo 接收 2 个参数,一个 __m128,在 ASM 中对应 XMMWORD,以及一个 float,在 ASM 中对应 REAL4。它返回一个 __m128。
其 C++ 声明为
__m128 AsmMemAlignDemo(__m128 param1, float param2);
AsmMemAlignDemo 用 param1 包含一个包含 4 个 float(1.0, 2.0, 3.0, 4.0)的向量,param2 包含 float 10.0 来调用。
在 ASM 函数中,将执行 4 个操作以获得最终结果。
1) float 与向量相乘,得到中间结果
(10.0, 20.0, 30.0, 40.0)
2) 向向量添加值 17.0,得到中间结果
(27.0, 37.0, 47.0, 57.0)
3) 向量除以 3,得到最终结果
(9.000000, 12.333333,15.666667,19.000000)
当然,也有机会演示我们用于 16 字节堆栈内存对齐的方法,这毕竟是本文的主要目的(这里是 32 位版本)。
; __m128 AsmMemAlignDemo(__m128 param1, float param2); ; *See important comments in the "C++" project ; This function will: ; 1) multiply the param1 vector by the float param2 ; 2) add 17.0 then divide the result by 3.0 AsmMemAlignDemo proc public par2:REAL4 ; These are the stack variables. On 64-bit Windows they aligned on 8-byte boundaries, ; on 32-bit Windows they are aligned on 4-byte boundaries. LOCAL valueToAdd : DWORD LOCAL valueToDivideFor : DWORD LOCAL TopOfAllocatedStackMem : DWORD; LOCAL PointerTo16ByteAlignedvalueToAdd : ptr XMMWORD LOCAL PointerTo16ByteAlignedvalueToDivideFor : ptr XMMWORD movss xmm5, par2 ; Move the passed float to the first 32 bits of xmm5 shufps xmm5, xmm5, 0 ; Replicate it across the register to obtain 4 identical floats ; cdecl: the _m128 param1 came in xmm0 mulps xmm0, xmm5 ; Part 1) is completed. The partial result is in xmm0 ; Set some data to compose our example. First the 17.0 to add to the partial result mov dword ptr valueToAdd, 17 movss xmm3, valueToAdd shufps xmm3, xmm3,0 ; replicate across cvtdq2ps xmm3, xmm3 ; convert to a vector of 4 floats ; Set the value to divide for, which is 3.0 mov dword ptr valueToDivideFor, 3 movss xmm2, valueToDivideFor shufps xmm2, xmm2,0 ; replicate across cvtdq2ps xmm2, xmm2 ; convert to a vector of 4 floats ; Now, we will begin the part that demonstrates how to align stack memory. ; This is the real purpose of the article, till now, everything was just a "mise-en-scene" ; First, save the current stack position. SAVE_STACK_POSITION ; Reserve a chunk of 16-byte aligned memory on the stack for the addition vector CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT sizeof XMMWORD, PointerTo16ByteAlignedvalueToAdd ; Save the xmm3 contents in there. SAVE_XMM_IN_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToAdd, xmm3 ; Reserve a chunk of 16-byte aligned memory on the stack for the division vector CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT sizeof XMMWORD, PointerTo16ByteAlignedvalueToDivideFor ; Save the xmm2 contents in there. SAVE_XMM_IN_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToDivideFor, xmm2 xorps xmm3, xmm3 ; zero out the xmm registers to prove we are not cheating :) xorps xmm2, xmm2 ; Check with a debugger that the stored vectors will be loaded back sucessfully using the "movaps" instructions. But we will not be using xmm3 and xmm2 for the final calculation. RETRIEVE_XMM_FROM_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToAdd, xmm3 RETRIEVE_XMM_FROM_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToDivideFor, xmm2 ; Instead we will be doing the calculations directly with the aligned memory mov eax, PointerTo16ByteAlignedvalueToAdd addps xmm0, [eax] ; Add Packed Single-Precision Floating-Point Values mov eax, PointerTo16ByteAlignedvalueToDivideFor divps xmm0, [eax] ; Divide Packed Single-Precision Floating-Point Values ; That's it, the result will be returned in xmm0 ; Finally deallocate our stack memory, all you need to do is restore the stack pointer. RESTORE_STACK_POSITION ret ; The MASM compiler will issue a leave instruction before the ret AsmMemAlignDemo endp
在“C++”项目中,我们对 __m128 参数的传递方式以及在 Visual Studio 下使用 cdecl 和 Microsoft x64 调用约定时 __m128 结果的接收方式进行了一些注释。您会意识到它们偏离了其他编译器供应商所理解的规范。
extern "C" { __m128 AsmMemAlignDemo(__m128 param1, float param2); } int _tmain(int argc, _TCHAR* argv[]) { // Through some compiler magic the __128 data type is automatically aligned on 16-byte boundaries. __m128 mappedXMMRegister = { 1.0, 2.0, 3.0, 4.0 }; // On x86, we are using the cdecl calling convention, still what happens in this particular case contradicts most of the common wisdom // and beaware, not every compiler do it this way: // 1) The __m128 variable is sent in xmm0 (yes, it is!) // 2) the float is sent on the stack (no surprises here) // 3) The return value comes in xmm0 (yes, it does!) // On x64, the Microsot x64 calling convention is standard. But the __m128 return value comes in xmm0, and some people people will become surprised because // this information is somewhat hidden. Beaware, not every compiler do it this way. So: // 1) The return value will come in xmm0 (yes, it comes, usually structures are returned to what is pointed to by the RCX register, but not __m128). // 2) the float parameter is in a XMM register, since xxm0 is already reserved for the return value, the float will be passed in xmm1. // 3) Data structures are sent on the stack, so our __m128 parameter will be sent on the stack, and RCX will point to where it is. // As a conclusion, it appears that cdecl deals better with __m128 parameters than the x64 calling convention. // But with the new "Vector Calling Convention" the things do get better. __m128 result = AsmMemAlignDemo(mappedXMMRegister, 10.0); printf("The test was a success!\n"); printf("Results: %f, %f, %f, %f", result.m128_f32[0], result.m128_f32[1], result.m128_f32[2], result.m128_f32[3]); getchar(); return 0; }
最后,ASM 编译
To compile the 32-bit ASM, you run MASM with:
"Path to Visual Studio"\VC\bin\ml /c asmtest32.asm
Note: You can also compile with JWasm without any change:
"Path to JWasm"\jwasm" -coff asmtest32.asm
To compile for 64-bit ASM, you run MASM with:
"Path to Visual Studio"\VC\bin\amd64\ml64" -c asmtest64.asm
Note: You can also compile with JWasm without any change:
"Path to JWasm"\jwasm -c -win64 asmtest64.asm
编译后,你需要将你的 Visual Studio 32 位版本与 asmtest32.obj 链接,将 Visual Studio 64 位版本与 asmtest64.obj 链接。
重要
此方法的可靠性基于所有展开将在调用 RESTORE_STACK_POSITION 后进行的假设。
这在我们演示中发生了,MASM 编译器将在 RESTORE_STACK_POSITION 之后发出 'leave',然后是 'ret'。
如果 ASM 模块需要处理 SEH(结构化异常处理)或在整个过程中保留某些寄存器(例如,“在 SAVE_STACK_POSITION 之前将寄存器推入堆栈”),则需要格外小心。如果 ASM 模块不是叶节点(调用其他过程),则也是如此。JWasm 可以轻松处理这些情况,但 MASM 要求您确切知道自己在做什么。
历史
2016 年 9 月 6 日 - CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT 宏