Use std::ostringstream instead. Eventually I'd like to see the output stream passed into the function of interest. It will avoid problems on some mobile OSes that don't have standard inputs and outputs.
These changes made it in by accident at Commit b74a6f4445. We were going to try to let them ride but they broke versioning. They may be added later but we should avoid the change at this time.
We also switched to "IsAligned<HashWordType>(input)". Using word64 was due to debug testing on Solaris (the alignment check is needed). Hard coding word64 should not have been checked in.
It looks like GetAlignmentOf was returning the "UnsignedMin(4U, sizeof(T))" for SunCC. It was causing SIGBUSes on Sparc when T=word64. OpenCSW provided access to their build farm and we were able to test "__alignof__(T)" back to an early SunCC on Solaris 9.
Man, Sparc does not mess around with unaligned buffers. Without -xmemalign=4i the hardware wants 8-byte aligned word64's so it can use the high performance 64-bit move or add.
Since we do not use -xmemalign we get the default behavior of either -xmemalgin=8i or -xmemalgin=8s. It shoul dnot matter to us since we removed unaligned data access at GH #682.
IteratedHashBase::Update aligns the buffer, but IteratedHashBase::HashBlock does not. It was causing a fair number of asserts to fire when the code was instrumented with alignment checks. Linux benchmarks shows the code does not run materially slower on i686 or x86_64.
I believe Andrew Marlow first reported it. At the time we could not get our hands on hardware to fully test things. Instead we were using -xmemalign=4i option as a band-aide to avoid running afoul of the Sparc instruction that moves 64-bits of data in one shot.
Local scopes and loading the constants with _mm_set_epi32 saves about 0.03 cpb. It does not sound like much but it improves GMAC by about 500 MB/s. GMAC is just shy of 8 GB/s.