Deep neural networks have been widely applied in many areas, such as computer vision, natural language processing and information retrieval. However, due to the high computation and memory demands, deep learning applications have not been adopted in edge learning. In this paper, we exploit the sparsity in tensors to reduce the computation overheads and memory demands. Unlike other approaches which rely on hardware accelerator designs or sacrifice model accuracy for the performance by pruning parameters, we adaptively partition and deploy the workload to heterogeneous devices to reduce computation and memory requirements and increase computing efficiency. We had implemented our partitioning algorithms in Google's TensorFlow and evaluated on an AMD Kaveri system, which is an HSA-based heterogeneous computing system. Our method has effectively reduced the computation time, cache accesses, and cache miss rates, without impacting the accuracy of the learning models. Our approach achieves 66% and 88% speedup for the lenet-5 model and the lenet-1024-1024 model, respectively. For reducing memory traffic, our approach reduces 71% instruction cache references, 32% data cache references. Our system has also improved cache miss rate from 1.6% to 0.5% during the training of the lenet-1024-1024 model.
02-33664888 ext. 404