Departing from traditional quantization for a fixed quantization resolution, we describe a novel architecture approach to support inference at multiple resolution deployment points. A single meta multi-resolution model with a small footprint can select from multiple resolutions at runtime to satisfy given resource constraints. The proposed scheme relies on term quantization to enable flexible bit annihilation at any position for a value in a context of a group of values. This is in contrast to conventional uniform quantization which always truncates the lowest-order bits. We present multi-resolution training of the meta model and field-configurable multi-resolution MultiplierAccumulator (mMAC) design. We compare our design against a traditional MAC design and evaluate the inference performance on a variety of datasets including ImageNet, COCO, and WikiText-2.
We introduce exBERT, a training method to extend BERT pre-trained models from a general domain to a new pre-trained model for a specific domain with a new additive vocabulary under constrained training resources (i.e., constrained computation and data). exBERT uses a small extension module to learn to adapt an augmenting embedding for the new domain in the context of the original BERT’s embedding of a general vocabulary. The exBERT training method is novel in learning the new vocabulary and the extension module while keeping the weights of the original BERT model fixed, resulting in a substantial reduction in required training resources. We pre-train exBERT with biomedical articles from ClinicalKey and PubMed Central, and study its performance on biomedical downstream benchmark tasks using the MTLBioinformatics-2016 dataset. We demonstrate that exBERT consistently outperforms prior approaches when using limited corpus and pretraining computation resources.
Network quantization has rapidly become one of the most widely used methods to compress and accelerate deep neural networks. Recent efforts propose to quantize weights and activations from different layers with different precision to improve the overall performance. However, it is challenging to find the optimal bitwidth (i.e., precision) for weights and activations of each layer efficiently. Meanwhile, it is yet unclear how to perform convolution for weights and activations of different precision efficiently on generic hardware platforms. To resolve these two issues, in this paper, we first propose an Efficient Bitwidth Search (EBS) algorithm, which reuses the meta weights for different quantization bitwidth and thus the strength for each candidate precision can be optimized directly w.r.t the objective without superfluous copies, reducing both the memory and computational cost significantly. Second, we propose a binary decomposition algorithm that converts weights and activations of different precision into binary matrices to make the mixed precision convolution efficient and practical. Experiment results on CIFAR10 and ImageNet datasets demonstrate our mixed precision QNN outperforms the handcrafted uniform bitwidth counterparts and other mixed precision techniques.
We proposed Additive Powers-of-Two (APoT) quantization, an efficient nonuniform quantization scheme that attends to the bell-shaped and long-tailed distribution of weights in neural networks. By constraining all quantization levels as a sum of several Powers-of-Two terms, APoT quantization enjoys overwhelming efficiency of computation and a good match with weights’ distribution. A simple reparameterization on clipping function is applied to generate better-defined gradient for updating of optimal clipping threshold. Moreover, weight normalization is presented to refine the input distribution of weights to be more stable and consistent. Experimental results show that our proposed method outperforms state-of-the-art methods, and is even competitive with the full-precision models demonstrating the effectiveness of our proposed APoT quantization. For example, our 3-bit quantized ResNet-34 on ImageNet only drops 0.3% Top-1 and 0.2% Top-5 accuracy without bells and whistles, while the computation of our model is approximately 2× less than uniformly quantized neural networks.
To deploy deep neural networks on resource-limited devices, quantization has been widely explored. In this work, we study the extremely low-bit networks which have tremendous speed-up, memory saving with quantized activation and weights. We first bring up three omitted issues in extremely low-bit networks: the squashing range of quantized values; the gradient vanishing during backpropagation and the unexploited hardware acceleration of ternary networks. By reparameterizing quantized activation and weights vector with full precision scale and offset for fixed ternary vector, we decouple the range and magnitude from the direction to extenuate the three issues. Learnable scale and offset can automatically adjust the range of quantized values and sparsity without gradient vanishing. A novel encoding and computation pat-tern are designed to support efficient computing for our reparameterized ternary network (RTN). Experiments on ResNet-18 for ImageNet demonstrate that the proposed RTN finds a much better efficiency between bitwidth and accuracy, and achieves up to 26.76% relative accuracy improvement compared with state-of-the-art methods. Moreover, we validate the proposed computation pattern on Field Programmable Gate Arrays (FPGA), and it brings 46.46x and 89.17x savings on power and area respectively compared with the full precision convolution.
To reduce memory footprint and run-time latency, techniques such as neural network pruning and binarization have been explored separately. However, it is unclear how to combine the best of the two worlds to get extremely small and efficient models. In this paper, we, for the first time, define the filter-level pruning problem for binary neural networks, which cannot be solved by simply migrating existing structural pruning methods for full-precision models. A novel learning-based approach is proposed to prune filters in our main/subsidiary network framework, where the main network is responsible for learning representative features to optimize the prediction performance, and the subsidiary component works as a filter selector on the main network. To avoid gradient mismatch when training the subsidiary component, we propose a layer-wise and bottom-up scheme. We also provide the theoretical and experimental comparison between our learning-based and greedy rule-based methods. Finally, we empirically demonstrate the effectiveness of our approach applied on several binary models, including binarized NIN, VGG-11, and ResNet-18, on various image classification datasets.
Binary neural networks (BNN) have been studied extensively since they run dramatically faster at lower memory and power consumption than floating-point networks, thanks to the efficiency of bit operations. However, contemporary BNNs whose weights and activations are both single bits suffer from severe accuracy degradation. To understand why, we investigate the representation ability, speed and bias/variance of BNNs through extensive experiments. We conclude that the error of BNNs is predominantly caused by the intrinsic instability (training time) and non-robustness (train & test time). Inspired by this investigation, we propose the Binary Ensemble Neural Network (BENN) which leverages ensemble methods to improve the performance of BNNs with limited efficiency cost. While ensemble techniques have been broadly believed to be only marginally helpful for strong classifiers such as deep neural networks, our analyses and experiments show that they are naturally a perfect fit to boost BNNs. We find that our BENN, which is faster and much more robust than state-of-the-art binary networks, can even surpass the accuracy of the full-precision floating number network with the same architecture.
We present a full-stack optimization framework for accelerating in-ference of CNNs (Convolutional Neural Networks) and validate theapproach with field-programmable gate arrays (FPGA) implementa-tions. By jointly optimizing CNN models, computing architectures,and hardware implementations, our full-stack approach achievesunprecedented performance in the trade-off space characterizedby inference latency, energy efficiency, hardware utilization andinference accuracy. As a validation vehicle, we have implementeda 170MHz FPGA inference chip achieving 2.28ms latency for theImageNet benchmark. The achieved latency is among the lowest re-ported in the literature while achieving comparable accuracy. How-ever, our chip shines in that it has 9x higher energy efficiency com-pared to other implementations achieving comparable latency. Ahighlight of our full-stack approach which attributes to the achievedhigh energy efficiency is an efficient Selector-Accumulator (SAC)architecture for implementing the multiplier-accumulator (MAC)operation present in any digital CNN hardware. For instance, com-pared to a FPGA implementation for a traditional 8-bit MAC, SACsubstantially reduces required hardware resources (4.85x fewerLook-up Tables) and power consumption (2.48x).
We present the Maestro memory-on-logic 3D-IC
architecture for coordinated parallel use of a plurality of systolic
arrays (SAs) in performing deep neural network (DNN) inference.
Maestro reduces under-utilization common for a single large SA
by allowing parallel use of many smaller SAs on DNN weight matrices of varying shapes and sizes. In order to buffer immediate
results in memory blocks (MBs) and provide coordinated highbandwidth communication between SAs and MBs in transferring
weights and results Maestro employs three innovations. (1) An SA
on the logic die can access its corresponding MB on the memory
die in short distance using 3D-IC interconnects, (2) through an
efficient switch based on H-trees, an SA can access any MB with
low latency, and (3) the switch can combine partial results from
SAs in an elementwise fashion before writing back to a destination MB. We describe the Maestro architecture, including a
circuit and layout design, detail scheduling of the switch, analyze
system performance for real-time inference applications using
input with batch size equal to one, and showcase applications for
deep learning inference, with ShiftNet for computer vision and
recent Transformer models for natural language processing. For
the same total number of systolic cells, Maestro, with multiple
smaller SAs, leads to 16x and 12x latency improvements over
a single large SA on ShiftNet and Transformer, respectively.
Compared to a floating-point GPU implementation of ShiftNet
and Transform, a baseline Maestro system with 4,096 SAs (each
with 8x8 systolic cells) provides significant latency improvements
of 30x and 47x, respectively.
How to develop slim and accurate deep neural networks has become crucial for real- world applications, especially for those employed in embedded systems. Though previous work along this research line has shown some promising results, most existing methods either fail to significantly compress a well-trained deep network or require a heavy retraining process for the pruned deep network to re-boost its prediction performance. In this paper, we propose a new layer-wise pruning method for deep neural networks. In our proposed method, parameters of each individual layer are pruned independently based on second order derivatives of a layer-wise error function with respect to the corresponding parameters. We prove that the final prediction performance drop after pruning is bounded by a linear combination of the reconstructed errors caused at each layer. By controlling layer-wise errors properly, one only needs to perform a light retraining process on the pruned network to resume its original prediction performance. We conduct extensive experiments on benchmark datasets to demonstrate the effectiveness of our pruning method compared with several state-of-the-art baseline methods. Codes of our work are released at: https://github.com/csyhhu/L-OBS