align stack properly for calling global ctors/dtors on x86[_64]
failure to do so was causing crashes on x86_64 when ctors used SSE,
which was first observed when ctors called variadic functions due to
the SSE prologue code inserted into every variadic function.