∙ 0 The proposed system Abstract; Abstract (translated by Google) URL; PDF; Abstract. By contrast, previous works [44, 45, 46, 43] based on either convolutional LSTM or convolutional GRU do not consider such a designing since they operate on consecutive frames instead, where object displacement would be small and neglected. Network of Light Flow is illustrated in Table. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. Transferring image-based object detectors to domain of videos remains a Obviously, it only models short-term dependencies. There has been significant progresses for image object detection in recent years. For example, we achieve 60.2 % mAP score on ImageNet VID validation at speed of 25.6 frame per second on mobiles (e.g., HuaWei Mate 8). In decoder, the feature maps are fed to multiple deconvolution layers to achieve the high resolution flow prediction. Towards High Performance Video Object Detection Abstract: There has been significant progresses for image object detection in recent years. Given a non-key frame i, the feature propagation from key frame k to frame i is denoted as. Li, Z., Gavves, E., Jain, M., Snoek, C.G. 30 object categories are involved, which are a subset of ImageNet DET annotated categories. All images are of 1920 (width) by 1080 (height). This paper describes a light weight network architecture for mobile video object detection. Nevertheless, video object detection has received little attention, although it is more challenging and more important in practical scenarios. Towards High Performance Video Object Detection for Mobiles Xizhou Zhu*, Jifeng Dai, Xingchi Zhu*, Yichen Wei, and Lu Yuan Arxiv Tech Report, 2018. The middle panel of Table 2 compares the proposed Light Flow with existing flow estimation networks on the Flying Chairs test set (384 x 512 input resolution). statistical machine translation. segmentation. Figure 1 presents the the speed-accuracy curves of different systems on ImageNet VID validation. For the feature network, we adopt the state-of-the-art lightweight MobileNet [13] as the backbone network, which is designed for mobile recognition tasks. 10 min read. They all seek to improve the speed-accuracy trade-off by optimizing the image object detection network. The curve is drawn by adjusting the key frame duration l. We can see that the curve with flow guidance surpasses that without flow guidance. mb model size. However, we need to carefully redesign both structures for mobiles by considering speed, size and accuracy. It is primarily built on the two principles – propagating features on majority non-key frames while computing and aggregating features on sparse key frames. Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for ∙ Probably the most well-known problem in computer vision. 2018-04-16 Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan arXiv_CV. Towards High Performance Video Object Detection for Mobiles. Would such a light-weight flow network effectively guide feature propagation? Following the protocol in [32], the accuracy is evaluated by the average end-point error (EPE). It is also unclear whether the key principles of sparse feature propagation and multi-frame feature aggregation apply at very limited computational resources. Current best practice [19, 20, 21] exploits temporal information via sparse feature propagation and multi-frame feature aggregation to address the speed and accuracy issues, respectively. Till very recently, there are two latest works seeking to exploit temporal information for addressing this problem. In each mini-batch of SGD, either n+1 nearby video frames from ImageNet VID, or a single image from ImageNet DET, are sampled at 1:1 ratio. Object detection is a computer vision technique whose aim is to detect objects such as cars, buildings, and human beings, just to mention a few. Nevertheless, video object detection has received little attention, although it is more challenging and more important in practical scenarios. ∙ I am applying tensorflow object detection api to build a model to detect a single object. network for real-time embedded object detection. Shafiee, M.J., Chywl, B., Li, F., Wong, A.: Fast yolo: A fast you only look once system for real-time embedded 04/16/2018 ∙ by Xizhou Zhu, et al. In its improvements, like SSDLite [50] and Tiny SSD [17], more efficient feature extraction networks are also utilized. As for comparison of different curves, we observe that under adequate computational power, networks of higher complexity (α=1.0) would lead to better speed-accuracy tradeoff. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. When l=1, the image recognition network is densely applied on each frame, as in the single frame baseline. ϕ function with ReLU nonlinearity leads to 3.9% higher mAP score compared to tanh nonlinearity. performed. 0 On one hand, sparse feature propagation is used in [19, 21] to save expensive feature computation on most frames. In this paper, we Extending it to exploit sparse key frame features would be non-trival. The key-frame object detector is MobileNet+Light-Head R-CNN. Specifically, given two succeeding key frames k and k′, the aggregated feature at frame k′ is computed by. v.d. 16 Apr 2018 • Xizhou Zhu • Jifeng Dai • Xingchi Zhu • Yichen Wei • Lu Yuan. It reports accuracy on a subset of ImageNet VID, where the split is not publicly known. One of the most popular datasets used in academia is ImageNet, composed of millions of classified images, (partially) utilized in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) annual competition. Our system surpasses all the existing systems by clear margin. 0 Here we choose to integrate Light-head R-CNN into our system, thanks to its outstanding performance. Meanwhile, YOLOv2, SSDLite and Tiny YOLO obtain accuracies of 58.7%, 57.1%, and 44.1% at frame rates of 0.3, 3.8 and 2.2 fps respectively. Feature extraction and aggregation only operate on sparse key frames; while lightweight feature propagation is performed on majority non-key frames. The procedure consists of a matching stage for finding correspondences between reference and output objects, an accuracy score that is sensitive to object shapes as well as boundary and fragmentation errors, and a ranking step for final ordering of the algorithms using multiple performance indicators. (2015) Software available from tensorflow.org. Shufflenet: An extremely efficient convolutional neural network for Second, since Light Flow is very small and has comparable computation with the detection network Ndet, sparse feature propagation is applied on the intermediate feature maps of the detection network (see Section 3.3, the 256-d feature maps in RPN [5], and the 490-d feature maps in Light-Head R-CNN [23]), to further reduce computations for non-key frame. where Fk=Nfeat(Ik) is the feature of key frame k, and W represents the differentiable bilinear warping function. When applying Light Flow for our method, to get further speedup, two modifications are made. It shows better speed-accuracy performance than the single-stage detectors. Mobile video object detection with temporally-aware feature maps. Built on the two principles, the latest work [21], provides a good speed-accuracy tradeoff on Desktop GPUs. Abstract: There has been significant progresses for image object detection in recent years. Batch normalization: Accelerating deep network training by reducing previous approach towards object tracking and detection using video sequences through different phases. YOLO [15] and SSD [10] are one-stage object detectors, where the detection result is directly produced by the network in a sliding window fashion. By default, the key-frame object detector is MobileNet+Light-Head R-CNN, and flow is estimated by Light Flow. Adam: A method for stochastic optimization. Built upon the recent works, this work proposes a unified viewpoint based on the principle of multi-frame end-to-end learning of features and cross-frame motion. It is one order faster than the best previous effort on fast object detection, with on par accuracy (see Figure 1). The rightmost panel of Table 2 presents the results. 2). Bibliographic details on Towards High Performance Video Object Detection for Mobiles. Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday. These structures are general, but not specifically designed for object detection tasks. share. Each convolution operation is followed by batch normalization. communities, © 2019 Deep AI, Inc. | San Francisco Bay Area | All rights reserved. Actually, the original FlowNet is so heavy that the detection system with FlowNet is even 2.7× slower than simply applying the MobileNet+Light-Head R-CNN detector on each frame. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., In our model, to reduce computation of RPN, 256-d intermediate feature maps was utilized, which is half of originally used in [5]. Since contents would be very related between consecutive frames, the exhaustive feature extraction is not very necessary to be computed on most frames. For accuracy, detection accuracy suffers from deteriorated appearances in videos that are seldom observed in still images, such as motion blur, video defocus, rare poses. 0 The detection network Ndet is applied on ^Fk′ to get detection predictions for the key frame k′. In this paper, we present a light weight network architecture for video object detection on mobiles. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Fully convolutional models for semantic segmentation. Following the practice in [48, 49], model training and evaluation are performed on the 3,862 training video snippets and the 555 validation video snippets, respectively. Our method achieves an accuracy of 60.2% at 25.6 fps. Because the recognition on the key frame is still not fast enough. %� Also, there are a lot of noise. The flow estimation accuracy drop is small (15% relative increase in EPE). ∙ share, Transferring image-based object detectors to domain of videos remains a ∙ Our key Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Photo by GeoHey. Also, during training, only a single loss function is applied on the averaged optical flow prediction instead of multiple loss functions after each prediction. Based on the paper: "Towards High Performance Video Object Detection" use Pytorch 0.4.1 and Python 3.6 The model is currently running on Bosch Traffic Light Dataset only, but it will be easy to add another dataset by modifying dataloader. There has been significant progresses for image object detection in recent years. x��YYo��~ׯ�� `H�>��c���vy��ְ݇g�c�H�@��]Ulv��UU��9n�����W/ބ�&��4�7��M{~�n�"��8�ܖ��N�u� ��m�8�6,�{����N97�x��d���v�j����)u���w[7ɜ�����z��i������T���r��+_v���O�W�M�Is/)�M��x���~���X�e_‹�u�y�^��,˕%�Ś�6X���4� `��1DZE��䑮�����B�;o]T�.����~���a��A��*�����J�D��f���� Also, identify the gap and suggest a new approach to improve the tracking of object over video frame. classification, detection and segmentation. Table 5 summarizes the results. Experiments are performed on ImageNet VID [47], a large-scale benchmark for video object detection. First, applying the deep networks on all video frames introduces unaffordable computational cost. A flow-guided GRU module is designed to effectively aggregate TensorFlow: Large-scale machine learning on heterogeneous systems A more cheaper Nflow is so necessary. : Multi-class multi-object tracking using changing point detection. Following the practice in MobileNet [13], two width multipliers, α and β, are introduced for controlling the computational complexity, by adjusting the network width. Two input RGB frames are concatenated to form a 6-channels input. : Microsoft coco: Common objects in context. In SGD, n+1 nearby video frames, Ii, Ik, Ik−l, Ik−2l, …, Ik−(n−1)l, 0≤i−ky��$t�0vH��qҲ葏O�\+H��������ǸÑi��_�K��-? where G is a flow-guided feature aggregation function. 03/21/2019 ∙ by Chaoxu Guo, et al. 05/16/2018 ∙ by Rakesh Mehta, et al. Rigid-motion scattering for image classification. By varying the input image frame size (shorter side in {448, 416, 384, 352, 320, 288, 256, 224} for SSDLite and Tiny YOLO, and {320, 288, 256, 224, 192, 160, 128} for YOLO v2), we can draw their speed-accuracy trade-off curves. The objects can generally be identified from either pictures or video feeds. However, [20] aggregates feature maps from nearby frame in a linear and memoryless way. detection from videos. Then, two sibling fully connected layers are applied on the warped feature to predict RoI classification and regression. During inference, feature maps on any non-key frame i are propagated from its preceding key frame k by. In both training and inference, the images are resized to a shorter side of 224 pixels and 112 pixels, for the image recognition network and the flow network, respectively. Though recursive aggregation [21]. i... For training simply use … Download PDF. Recently, there has been rising interest in building very small, low latency models that can be easily matched to the design requirements for mobile and embedded vision application, for example, SqueezeNet [12], MobileNet [13], and ShuffleNet [14]. Inference pipeline is illustrated in Figure 2. If computation allows, it would be more efficient to increase the accuracy by making the flow-guided GRU module wider (1.2% mAP score increase by enlarging channel width from 128-d to 256-d), other than by stacking multiple layers of the flow-guided GRU module (accuracy drops when stacking 2 or 3 layers). Otherwise, displacements caused by large object motion would cause severe errors to aggregation. To improve detection accuracy, flow-guided feature aggregation (FGFA) [20] aggregates feature maps from nearby frames, which are aligned well through the estimated flow. All the modules in the entire architecture, including Nfeat, Ndet and Nflow, can be jointly trained for video object detection task. object detection in video. Long-term dependency in aggregation is also favoured because more temporal information can be fused together for better feature quality. A virtual object will effectively be superimposed on the image and must respond to the real objects. 1, . In decoder part, each deconvolution operation is replaced by a nearest-neighbor upsampling followed by a depthwise separable convolution. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, However, directly applying these detectors to videos faces new challenges. First, a 3×3 convolution is applied on top to reduce the feature dimension to 128, and then a nearest-neighbor upsampling is utilized to increase feature stride from 32 to 16. The ReLU nonlinearity seems to converge faster than tanh in our network. A Join one of the world's largest A.I. Main difficulty here was to deal with video stream going into and coming from the container. For key frame, we need a lightweight single image object detector, which consists of a feature network and a detection network. In the forward pass, Ik−(n−1)l is assumed as a key frame, and the inference pipeline is exactly performed. where ^Fk is the aggregated feature maps of key frame k, and W represents the differentiable bilinear warping function also used in [19]. ∙ Deep feature flow. And then the detection network Ndet is applied on ^Fk→i to get detection predictions for the non-key frame i. The width multipliers α and β are set as 1.0 and 0.5 respectively, and the key frame duration length l is set as 10. Such design choices are vital towards high performance video object detection. where ⊙ denotes element-wise multiplication, and the weight Wk→i is adaptively computed as the similarity between the propagated feature maps Fk→i and the feature map Fi at frame i. Flow estimation is the key to feature propagation and aggregation. EI. Table 4 further compares the proposed flow-guided GRU method with the feature aggregation approach in [21]. achieves 60.2, Video object detection is more challenging compared to image object Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., In YOLO and its improvements, like YOLOv2 [11] and Tiny YOLO [16], specifically designed feature extraction networks are utilized for computational efficiency. Our approach extends prior works with three new techniques and steadily pushes forward the performance envelope (speed-accuracy tradeoff), towards high performance video object detection. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., share, In this paper, we propose an efficient and fast object detector which ca... representations. On sparse key frame, we present flow-guided Gated Recurrent Unit (GRU) based feature aggregation, an effective aggregation on a memory-limited platform. Second step is detection network, which generates detection result y upon the feature maps F, by performing region classification and bounding box regression over either sparse object proposals [2, 3, 4, 5, 6, 7, 8, 9] or dense sliding windows [10, 15, 11, 31], via a multi-branched sub-network, namely Ndet(F)=y. ∙ To give more detailed information, a 1×1 convolution with 128 filters is applied to the last feature maps with feature stride 16, and then added to the upsampled 128-d feature maps. ∙ Very deep convolutional networks for large-scale image recognition. share, Deep convolutional neutral networks have achieved great success on image... Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. For example, we achieve 60.2% mAP score on ImageNet VID validation at speed of 25.6 frame per second on mobiles (e.g., HuaWei Mate 8). On top of it, our system can further significantly improve the speed-accuracy trade-off curve. << /Filter /FlateDecode /Length 2713 >> It only causes minor drop in accuracy (15% increasing in end-point error) but significantly speeds up by nearly 65× theoretically (see Table. Full Text. Specifically, FlowNet [32] is 11.8× FLOPs of MobileNet [13] under the same input resolutions. It cannot speedup upon the single-frame baseline without sparse key frames. It would be interesting to study this problem in the future. Recently, [39] has showed that Gated Recurrent Unit (GRU) [40] is more powerful in modeling long-term dependencies than LSTM [41] and RNN [42], because nonlinearities are incorporated into the network state updates. Theoretical computation is counted in FLOPs (floating point operations, note that a multiply-add is counted as 2 operations). Final result yi for frame Ii incurs a loss against the ground truth annotation. Object detection is a computer vision technique whose aim is to detect objects such as cars, buildings, and human beings, just to mention a few. Flow estimation would not be a bottleneck in our mobile video object detection system. 802–810. arXiv_CV Object_Detection Sparse Detection. To the best of our knowledge, for the first time, we achieve realtime video object detection on mobile with reasonably good accuracy. Inference on the untrimmed video sequences leads to accuracy on par with that of trimmed, and can be implemented easier. There are also some other endeavors trying to make object detection efficient enough for devices with limited computational power. Given a key frame k′ and its proceeding key frame k, feature maps are first extracted by Fk′=Nfeat(Ik′), and then aggregated with its proceeding key frame aggregated feature maps ^Fk by. Directly applying these detectors to video object detection faces challenges from two aspects. A possible issue with the current approach is that there would be short latency in processing online streaming videos. Lightweight image object detector is an indispensable component for our video object detection system. Moreover, the incidents are detected very fast. Mark. Request PDF | Towards High Performance Video Object Detection for Mobiles | Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. Towards high performance video object detection. We propose a new procedure for quantitative evaluation of object detection algorithms. The MobileNet module is pre-trained on ImageNet classification task [47]. Authors: Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan (Submitted on 16 Apr 2018) Abstract: Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. stream Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video Compared with the original GRU [40], there are three key differences. very small network, Light Flow, is designed for establishing correspondence features on key frames. 0 We do not dive into the details of varying technical designs. ... It is one order faster than the … ∙ networks. Even the smallest FlowNet Inception used in [19] is 1.6× more FLOPs. Towards High Performance Video Object Detection for Mobiles. Third, we apply GRU only on sparse key frames (e.g., every 10th) instead of consecutive frames. Figure 4 presents the speed-accuracy trade-off curve of our method, drawn with varying key frame duration length l from 1 to 20. formance (speed-accuracy trade-o ) envelope, towards high performance video object detection on mobiles. But it neither reports accuracy nor has public code. Lee, B., Erdenee, E., Jin, S., Nam, M.Y., Jung, Y.G., Rhee, P.K. In Light-head R-CNN, position-sensitive feature maps [6] are exploited to relief the burden. Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., This is translated into a low Mean Time to Detect (MTTD) and a low False Alarm Rate (FAR). We cannot compare with it. 151 0 obj precipitation nowcasting. We experiment with α∈{1.0,0.75,0.5} and β∈{1.0,0.75,0.5}. Xception: Deep learning with depthwise separable convolutions. First, 3×3 convolution is used instead of fully connected matrix multiplication, since fully connected matrix multiplication is too costly when GRU is applied to image feature maps. Additionally, we also exploit a light image object detector for computing features on key frame, which leverage advanced and efficient techniques, such as depthwise separable convolution [22] and Light-Head R-CNN [23]. It would involve feature alignment, which is also lacking in [44]. ∙ We tried training on sequences of 2, 4, 8, 16, and 32 frames. Its performance also cannot be easily compared with ours. But direct comparison is difficult, because the paper does not report any accuracy numbers on any datasets for their method, with no public code. With the increasing interests in computer vision use cases like self-driving cars, face recognition, intelligent transportation systems and etc. The aggregation function G in Eq. The technical report of Fast YOLO [51] is also very related. For SSD, the output space of bounding boxes are discretized into a set of anchor boxes, which are classified by a light-weight detection head. For each layer (except the final prediction layers) in Nfeat, Ndet and Nflow, its output channel number is multiplied by α, α and β, respectively. : Imagenet large scale visual recognition challenge. In SGD, 240k iterations are performed on 4 GPUs, with each GPU holding one mini-batch. The above observation holds for the curves of networks of different complexity. Dollár, P., Zitnick, C.L. For non-key frames, sparse feature propagation is They both do not align features across frames. The two dimensional motion field Mi→k between two frames Ii and Ik is estimated through a flow network Nflow(Ik,Ii)=Mi→k, which is much cheaper than Nfeat. The accuracy is 51.2% at a frame rate of 50Hz (α=0.5, β=0.5, l=10). Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. On all frames, we present Light Flow, a very small deep neural network to estimate feature flow, which offers instant availability on mobiles. share, Object detection in videos has drawn increasing attention recently since... ∙ Although they do not report results on ImageNet VID [47], they all public their code fortunately. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. The object in each image is very small, approximately 55 by 15. It is designed in a encoder-decoder mode followed by multi-resolution optical flow predictors. Object detection has achieved significant progress in recent years using deep neural networks. Second, recognition accuracy suffers from deteriorated appearances in videos that are seldom observed in still images, such as motion blur, video defocus, rare poses, etc. With increased key frame duration length, the accuracy drops gracefully as the computation overhead relieves. Keywords: Object tracking, object recognition, statistical analysis, object detection, background subtraction, performance analysis, optical flow 1. 61.5%), and is one order faster. In Fast YOLO [51], a modified YOLOv2 [11] detector is applied on sparse key frames, and the detected bounding boxes are directly copied to the the non-key frames, as their detection results. The snippets are at frame rates of 25 or 30 fps in general. For example, flow estimation, as the key and common component in feature propagation and aggregation. It is worth noting that it achieves higher accuracy than FlowNet Half and FlowNet Inception utilized in [19], with at least one order less computation overhead. To go further and in order to enhance portability, I wanted to integrate my project into a Docker container. share, Despite the recent success of video object detection on Desktop GPUs, its RPN [5] and Light-Head R-CNN [23] are applied on the shared 128-d feature maps. To avoid dense aggregation on all frames, [21] suggested sparsely recursive feature aggregation, which operates only on sparse key frames. No end-to-end training for video object detection is performed. Title: Towards High Performance Video Object Detection for Mobiles. Traditional video cameras as well as thermal cameras can be combined with FLIR’s traffic video analytics. But it is still 2.8% shy in mAP of utilizing flow-guided GRU, at close computational overhead. Figure 3 shows the speed-accuracy curves of our method with and without flow guidance. There are two latest works seeking to exploit flow to align features across frames % relative in... Aggregating features on sparse key frames and inference on frame sequences of varying.... Studies have focused on object recognition based on the two principles, present... Resolution flow prediction as the feature of key frame, and W represents the differentiable bilinear warping.. Propagation is performed effectively be superimposed on the shared 128-d feature maps to. Multiple optical flow with convolutional networks to relief the burden processor of Huawei Mate 8 designed to effectively features! Further significantly improve the speed-accuracy trade-off curve apply at very limited computational resources pyramid in. Neural network for mobile devices for acceleration, no feature aggregation approach in [ 20, ]... And long-term temporal dynamics for a wide variety of sequence learning and prediction tasks { 1.0,0.75,0.5 } × 1.0,0.75,0.5... Head or its previous layer, is proposed for effective feature aggregation images of... Object recognition has also come to the best of our method with the feature aggregation, which are subset. 2018 Bin Xiao, Haiping Wu, Yichen Wei • Lu Yuan frame, we present Light... Dai, Xingchi Zhu • Jifeng Dai • Xingchi Zhu • Yichen,. Liu, Z., Gavves, E., Jin, S., Nam, M.Y., Jung,,! Are presented, which are a subset of ImageNet DET annotated categories video frame interpolation algorithm identify the gap suggest. Estimation and tracking, object detection faces challenges from two aspects 55 by 15, but only the prediction. I am applying tensorflow object detection on mobiles significant progress in recent years using deep neural networks is performed... Are also utilized aggregation is noticeably higher than that of the feature aggregation should be operated on aligned maps... Bounding boxes on the key frame is still far too heavy for mobiles feature is... Cost as well frame is still far too heavy for mobiles we choose to integrate R-CNN! On par accuracy ( see figure 1 ) systems on ImageNet VID.. Some other endeavors trying to make object detection has received little attention although... Set and the fully-connected layer of MobileNet [ 13 ] under the same input resolutions input RGB frames are from... Faster R-CNN: Towards real-time object detection on mobiles applying these detectors to domain of videos remains a... ∙. Counted in FLOPs ( floating point operations, note that a multiply-add is counted in (. Drops if no flow is applied on the above observation holds for the first,! ( 15 % relative increase in EPE ) bottleneck of computation Accelerating deep network training reducing... The future an output stride of 16, and W represents the differentiable warping... Compared to tanh nonlinearity 's most popular data science and artificial intelligence research sent to... One hand, sparse feature propagation and multi-frame feature aggregation apply at very limited computational power, there is literature. Paper, we apply GRU only on sparse key frames that utilizing the heavy-weight FlowNet 61.2! Utilizing Light flow achieves accuracy very close to that utilizing the heavy-weight FlowNet ( 61.2 % v.s curves presented... Is originally proposed for pixel-level optical flow predictors follow each concatenated feature maps in spatial dimensions to 1/64 Liangzhe,. Utilizing Light flow can be implemented easier first, applying the deep networks all! Is 1.6× more FLOPs two modifications are made component takes a photo or selecting already! We further studied several design choices are vital Towards high performance video object detection, with each GPU one... And is one order faster than the single-stage detectors, G., towards high performance video object detection for mobiles,,! Maps [ 6 ] one mini-batch by contrast, we train on sequences of 2,,. Network parameter number and theoretical computation change quadratically with the feature of key k... Α∈ { 1.0,0.75,0.5 } to feature propagation and multi-frame feature aggregation is lacking. To video object detection in static images has achieved significant progress in recent years using deep neural.! Classification and regression across frames, intelligent transportation systems and etc ending average pooling and the fully-connected layer of [. Acceleration, no feature aggregation apply at very limited computational resources extraction aggregation. On sparse key frames ( e.g., every 10th ) instead of consecutive.. Choices are vital Towards high performance video object detection efficient enough for devices with limited computational power, Dai... Aligned feature maps l=10 ) are fed to multiple deconvolution layers to achieve the resolution... } and β∈ { 1.0,0.75,0.5 } ) a way retain the feature maps [ 6.... Rate of 50Hz ( α=0.5 ) would perform better under limited computational resources 32 frames a class of layers. Nam, M.Y., Jung, Y.G., Rhee, P.K traditional video cameras as well we achieve video! The finest prediction is used and perceived by answering our user survey ( taking 10 to 15 )... Prediction, and the ImageNet VID training set and the inference pipeline is exactly performed set are.. Propagation from key frame k, and a low Mean time to detect a single 2.3GHz Cortex-A72 of. Proposed flow-guided GRU module is pre-trained on ImageNet classification task [ 47 ], more efficient feature is. Or flow-guided warping is applied even for sparse feature propagation and multi-frame feature aggregation should explored... Unit ( GRU ) based feature aggregation or flow-guided warping is applied each. Object recognition based on markerless matching interesting to study this problem is small ( 15 relative! Fast object detection as a regression problem, and is one order than! Is designed for object detection api to build a model to detect MTTD... Speedup upon the single-frame baseline without sparse key frames ( height ) the recent success of video detection! Field is downsampled to match the resolution of the feature aggregation or flow-guided warping is applied on to. Therefore, what are principles for mobiles Ndet and Nflow, can be fused together for feature... Not be easily compared with ours of object detection, with each holding. Networks on all frames, the feature of key frame cheaply of input size a! Flow guidance key and common component in feature propagation on the shared 128-d maps. The split is not publicly known 15 minutes ), flow estimation would not a! Dense aggregation on all video frames introduces unaffordable computational cost inference pipeline is exactly performed in each image is,... Are up-sampled to the best of our method, to get further speedup, two modifications made... Are involved, which correspond to networks of different systems on ImageNet VID training set and the VID! A 6-channels input sparse key frames ( e.g., every 10th ) of! The entire architecture, including Nfeat, Ndet and Nflow, can be end-to-end.... Converted into a bundle of feature maps ^Fi at frame rates of 25 or 30 fps in general computed.... K′ is computed by for sparse feature propagation and multi-frame feature aggregation plays an important role improving. Still far too heavy for mobiles envelope, Towards high performance video object in... The proposed flow-guided GRU [ 17 ], we propose a Light weight network architecture for mobile.. Observation holds for the curves of different complexity ( α×β∈ { 1.0,0.75,0.5 } × { }. With each GPU holding one mini-batch the technical report of fast yolo [ 51 is. 49 ], they all public their code fortunately 55 by 15 the single frame baseline trade-off curve which also! In computer vision use cases like self-driving cars, face recognition, analysis... To match the resolution of the feature of key frame cheaply vital high. For establishing correspondence across frames and etc 120k, the feature of key frame k, can... For mobile video object detection α=0.5 ) would perform better under limited computational power, there very... Gpus, its architecture is still 2.8 % shy in mAP of utilizing flow-guided GRU module is proposed aggregation in... Designed for object detection efficient enough for devices with limited computational resources real objects FlowNet [ 32 ] also... Applying Light flow, E., Jain, M., Howard, A., Chen, L.C achieves accuracy... The architecture is not very necessary to be computed on most frames have achieved great success on image... ∙!