using process.h and running in-place is 19% faster on this machine. from there, using intrinsics yields another 94%, for a total speedup of 130%.