🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Feature Propagation
  • Looking Fast and Slow: Memory-Guided Mobile Video Object Detection
  • TSM: Temporal Shift Module for Efficient Video Understanding
  • Mobile Video Object Detection with Temporally-Aware Feature Maps
  • Feature Aggregation over Frames
  • Sequence Level Semantics Aggregation for Video Object Detection
  • Towards High Performance Video Object Detection for Mobiles
  • Object Detection in Video with Spatiotemporal Sampling Networks
  • Motion Propogation/ Tracking based
  • T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos
  • Object Detection from Video Tubelets with Convolutional Neural Networks
  • Using Optical Flow
  • Flow-Guided Feature Aggregation for Video Object Detection
  1. Papers

Temporal Fusion in Object Detection/ Video Object Detection

Summaries of few paper in the above topic.

PreviousObject DetectionNextAn intriguing failing of convolutional neural networks and the CoordConv solution

Last updated 5 years ago

Feature Propagation

Summary:

  • They also use Fast and Slow backbone architecture for detection at each frame. They have shows the evaluation for different number of times the small network is run after bigger network (denoted by \tau in paper).

  • They use SSD and Mobile and ahead of that they use memory (LSTM) to propagate previous state of feature map.

  • They used RL policy trained with sophisticated decision making (rewards) for adaptive keyframe selection (choosing when the bigger network needs to be run).

Summary:

  • This module helps then to shifts part of channels along the temporal dimension. Hence able to help propagate feature information across frames. This module is computational free

Basically, they add LSTM to feature map part, hence feature map at ttt can use information from feature map at frame t−1t-1t−1.

Feature Aggregation over Frames

  • They device somethign called Sequence Level Semantics Aggregation (SELSA) module.

  • instead of considering few nieghbouring frames for combinging features, they use complete complete video in global sense.

  • Existing works generally take video as sequential frames, and thus mainly utilize the temporal information to enhance the performance of a detector. For example, Flow Guided Feature Aggregation (FGFA) [36] uses at most 21 frames during training and testing, which is less than 5% of average video length. Here instead of taking a consecutive viewpoint, we propose to treat video as a bag of unordered frames and try to learn an invariant representation of each class on the full sequence level. This reinterprets video object detection from a sequential detection task to a multi-shot detection task.

  • Their method, find the semantic similarity between two regions proposals outputed by RPN in spatio-temporal dimension using cosine similarity. Now they use semantic similarity to do the feature aggregation of proposals. That's it, so now every reference proposals is aggregated with nearest proposals and now they are passed ahead in network as normal proposal.

  • This method doesn't seem to be able to deployed in online setting. Or may be, if use instead of global we use proposals used till frame ttti.e finding semantic similarity with proposals from past frames only.

  • They propagate features on majority non-key frames while computing and aggregating features on sparse key frames.

  • On all frames, we present Light Flow, a very small deep neural network to estimate feature flow, which offers instant availability on mobiles. On sparse key frame, we present flow-guided Gated Recurrent Unit (GRU) based feature aggregation, an effective aggregation on a memory-limited platform. Additionally, we also exploit a light image object detector for computing features on key frame, which leverage advanced and efficient techniques, such as depthwise separable convolution [22] and Light-Head R-CNN [23].

  • They propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos

  • STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames.

  • They train our STSN end-to-end on a large set of video frames labeled with bounding boxes.

Motion Propogation/ Tracking based

  • Box sequence of same object across different frames is called in tubelet.

  • A tubelet can be treated as a unit to apply the long-term constraint. Low detection confidence on some positive bounding boxes may result from moving blur, bad poses, or lack of enough training samples under particular poses. Therefore, if most bounding boxes of a tubelet have high confidence detection scores, the low confidence scores at certain frames should be increased to enforce its long-term consistency.

  • Temporal information is effectively incorporated into the proposed detection framework by locally propagating detection results across adjacent frames as well as globally revising detection confidences along tubelets generated from tracking algorithms.

  • This is a global detector but can be adjusted for online working condition. They first do normal still-iamge image on all the frames and then improve the detection using post-processing techniques. Post processing techniques are:

    • Multi-context suppression: This step is for removing false positives which might be there in certain frames. Read further from the paper itself.

    • Motion-guided propagation: This is for removing false negatives i.e. when objects presents are not detected in some frames. In this detections from one frame is propogated to other using optimal flow vector of each bouding box.

  • They do Tubelet Re-scoring which include high confidence socring, sptail max pooling and tubelet classification. It is mainly for gloabal consistency of short tebuelets. See from the paper what these actually means.

  • This work is share some of the authors from above paper and have kinda similar idea.

Using Optical Flow

  • This is basically feature aggregration over frames using optical flow.

TSM: Temporal Shift Module for Efficient Video Understanding
Mobile Video Object Detection with Temporally-Aware Feature Maps
Sequence Level Semantics Aggregation for Video Object Detection
Towards High Performance Video Object Detection for Mobiles
Object Detection in Video with Spatiotemporal Sampling Networks
T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos
Object Detection from Video Tubelets with Convolutional Neural Networks
Flow-Guided Feature Aggregation for Video Object Detection
Looking Fast and Slow: Memory-Guided Mobile Video Object Detection