Performance increased by about 115% on a 980 MHz BananaPi dev-board. Throughput went from about 46.2 cpb to about 21.5 cpb.
Cleaned up whitespace
Performance increased by about 100% on a 3.1 GHz Core i5 Skylake. Throughput went from about 7.3 cpb to about 3.5 cpb. Not bad for a software-based implementation of a block cipher