Recent trends in SIMD architecture have tended toward longer vector lengths and more enhanced SIMD features have been introduced in the newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit from the long-SIMD architecture, which supports improved parallelism and enhanced vector primitives, and thus only achieve a small fraction of potential peak performance. This paper presents a dynamic binary translation technique that enables short-SIMD binaries to exploit the benefits of the new SIMD architecture by rewriting short-SIMD loop code. We propose a general approach that translates loops consisting of short-SIMD instructions to machine-independent IR, conducts SIMD loop transformation/optimization at this IR level, and finally translates to long-SIMD instructions. Two solutions are presented to enforce SIMD load/store alignment, one for the problem caused by the binary translator's internal translation condition and one general approach using loop peeling optimization. The benchmark results show that an average speedup of 1.45X is achieved for NEON to AVX2 loop transformation.
02-33664888 ext. 404