Christoph Feichtenhofer
Graz University of Technology
feichtenhofer _at_

Research Statement

My research interests are in the fields of computer vision and machine learning, with a focus on learning effective video representations for dynamic scene understanding. In particular, I plan to explore computational theories that represent spatiotemporal visual information, within a confluence of machine vision and learning. I aim to find efficient solutions for problems that are grounded in applications such as recognition and detection from video.


2018 -
Research Scientist at Facebook
Facebook AI Research (FAIR), Menlo Park, CA, USA
2013 - 2018
University Assistant at Graz University of Technology
Institute of Electrical Measurement and Measurement Signal Processing (EMT), Graz, Austria
2015 - 2017
Visiting Researcher at University of Oxford
Worked with Prof. Andrew Zisserman
Visual Geometry Group (VGG), Oxford, UK
2014 - 2017
Visiting Researcher at York University
Worked with Prof. Richard P. Wildes
YorkU Vision Lab, Toronto, Canada
2014 - 2017
Graz University of Technology: PhD
Thesis: Deep Learning for Video Recognition
2012 - 2013
Graz University of Technology: MSc
Visiting Researcher at York University
Worked with Prof. Richard P. Wildes
YorkU Vision Lab, Toronto, Canada
2008 - 2011
Graz University of Technology: BSc

News & Highlights

I will be starting to work as a Research Scientist at Facebook AI Research (FAIR) in Menlo Park, CA
Our work on video object detection made it into the BEST OF ICCV of Computer Vision News
"Detect to Track and Track to Detect" - article, website and talk
Award of Excellence granted by the Federal Ministry for Science and Research for an outstanding doctoral thesis
"Deep Learning for Video Recognition"
"What have we learned from deep representations for action recognition?" - slides: pdf or mp4 video
"Camera-based Vehicle Velocity Estimation using Spatiotemporal Depth and Motion Features" - slides: pdf or mp4 video
Moritz Kampelmuehler, Michael Mueller, and Christoph Feichtenhofer
"Convolutional Two-Stream Network Fusion for Video Action Recognition"
Doctoral Fellowship of the Austrian Academy of Sciences granted by the Federal Ministry for Science and Research
"Space-Time Representations for Dynamic Scene Understanding"
Received two Nvidia Academic Hardware Donations for research in
"Deep Convolutional Representations for Spatiotemporal Image Understanding"
"Bags of Spacetime Energies for Dynamic Scene Recognition"


What have we learned from deep representations for action recognition?
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman
Tech report, arXiv, Jan. 2018
As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video. We show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncracies of training data and to explain failure cases of the system.

Detect to Track and Track to Detect
Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
IEEE International Conference on Computer Vision (ICCV) 2017 (spotlight)
Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce novel correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where it achieves state-of-the-art results. Our approach provides better single model performance than the winning method of the last ImageNet challenge while being conceptually much simpler. Finally, we show that by increasing the temporal stride we can dramatically increase the tracker speed.

Spatiotemporal Multiplier Networks for Video Action Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features. Our model combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end. We theoretically motivate multiplicative gating functions for residual networks and empirically study their effect on classification accuracy. To capture long-term dependencies we inject identity mapping kernels for learning temporal relationships. Our architecture is fully convolutional in spacetime and able to evaluate a video in a single forward pass. Empirical investigation reveals that our model produces state-of-the-art results on two standard action recognition datasets.

Temporal Residual Networks for Dynamic Scene Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
This paper combines three contributions to establish a new state-of-the-art in dynamic scene recognition. First, we present a novel ConvNet architecture based on temporal residual units that is fully convolutional in spacetime. Our model augments spatial ResNets with convolutions across time to hierarchically add temporal residuals as the depth of the network increases. Second, existing approaches to video-based recognition are categorized and a baseline of seven previously top performing algorithms is selected for comparative evaluation on dynamic scenes. Third, we introduce a new and challenging video database of dynamic scenes that more than doubles the size of those previously available.This dataset is explicitly split into two subsets of equal size that contain videos with and without camera motion to allow for systematic study of how this variable interacts with the defining dynamics of the scene per se. Our evaluations verify the particular strengths and weaknesses of the baseline algorithms with respect to various scene classes and camera motion parameters. Finally, our temporal ResNet boosts recognition performance and establishes a new state-of-the-art on dynamic scene recognition, as well as on the complementary task of action recognition.

Spatiotemporal Residual Networks for Video Action Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Advances in Neural Information Processing Systems (NIPS) 2016
Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by introducing residual connections in two ways. First, we inject residual connections between the appearance and motion pathways of a two-stream architecture to allow spatiotemporal interaction between the two streams. Second, we transform pretrained image ConvNets into spatiotemporal networks by equipping these with learnable convolutional filters that are initialized as temporal residual connections and operate on adjacent feature maps in time. This approach slowly increases the spatiotemporal receptive field as the depth of the model increases and naturally integrates image ConvNet design principles. The whole model is trained end-to-end to allow hierarchical learning of complex spatiotemporal features. We evaluate our novel spatiotemporal ResNet using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.

Convolutional Two-Stream Network Fusion for Video Action Recognition
Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.
Dynamic Scene Recognition with Complementary Spatiotemporal Features
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 2016
This paper presents Dynamically Pooled Complementary Features, a unified approach to dynamic scene recognition that analyzes a short video clip in terms of its spatial, temporal and color properties. The complementarity of these properties is preserved through all main steps of processing, including primitive feature extraction, coding and pooling. In the feature extraction step, spatial orientations capture static appearance, spatiotemporal oriented energies capture image dynamics and color statistics capture chromatic information. Subsequently, primitive features are encoded into a mid-level representation that has been learned for the task of dynamic scene recognition. Finally, a novel dynamic spacetime pyramid is introduced. This dynamic pooling approach can handle both global as well as local motion by adapting to the temporal structure, as guided by pooling energies. The resulting system provides online recognition of dynamic scenes that is thoroughly evaluated on the two current benchmark datasets and yields best results to date on both datasets. In-depth analysis reveals the benefits of explicitly modeling feature complementarity in combination with the dynamic spacetime pyramid, indicating that this unified approach should be well-suited to many areas of video analysis.
Dynamically Encoded Actions based on Spacetime Saliency
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015
Human actions typically occur over a well localized extent in both space and time. Similarly, as typically captured in video, human actions have small spatiotemporal support in image space. This paper capitalizes on these observations by weighting feature pooling for action recognition over those areas within a video where actions are most likely to occur. To enable this operation, we define a novel measure of spacetime saliency. The measure relies on two observations regarding foreground motion of human actors: They typically exhibit motion that contrasts with that of their surrounding region and they are spatially compact. By using the resulting definition of saliency during feature pooling we show that action recognition performance achieves state-of-the-art levels on three widely considered action recognition datasets. Our saliency weighted pooling can be applied to essentially any locally defined features and encodings thereof. Additionally, we demonstrate that inclusion of locally aggregated spatiotemporal energy features, which efficiently result as a by-product of the saliency computation, further boosts performance over reliance on standard action recognition features alone.
Bags of Spacetime Energies for Dynamic Scene Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014
This paper presents a unified bag of visual word (BoW) framework for dynamic scene recognition. The approach builds on primitive features that uniformly capture spatial and temporal orientation structure of the imagery (e.g., video), as extracted via application of a bank of spatiotemporally oriented filters. Various feature encoding techniques are investigated to abstract the primitives to an intermediate representation that is best suited to dynamic scene representation. Further, a novel approach to adaptive pooling of the encoded features is presented that captures spatial layout of the scene even while being robust to situations where camera motion and scene dynamics are confounded. The resulting overall approach has been evaluated on two standard, publically available dynamic scene datasets. The results show that in comparison to a representative set of alternatives, the proposed approach outperforms the previous state-of-the-art in classification accuracy by 10%.

Fusing RFID and Computer Vision for Probabilistic Tag Localization
Michael Goller, Christoph Feichtenhofer, Axel Pinz
IEEE International Conference on RFID (IEEE RFID) 2014
The combination of RFID and computer vision systems is an effective approach to mitigate the limited tag localization capabilities of current RFID deployments. In this paper, we present a hybrid RFID and computer vision system for localization and tracking of RFID tags. The proposed system combines the information from the two complementary sensor modalities in a probabilistic manner and provides a high degree of flexibility. In addition, we introduce a robust data association method which is crucial for the application in practical scenarios. To demonstrate the performance of the proposed system, we conduct a series of experiments in an article surveillance setup. This is a frequent application for RFID systems in retail where previous approaches solely based on RFID localization have difficulties due to false alarms triggered by stationary tags. Our evaluation shows that the fusion of RFID and computer vision provides robustness to false positive observations and allows for a reliable system operation.
Spacetime Forests with Complementary Features for Dynamic Scene Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
British Machine Vision Conference (BMVC) 2013 (oral)
This paper presents spacetime forests defined over complementary spatial and temporal features for recognition of naturally occurring dynamic scenes. The approach improves on the previous state-of-the-art in both classification and execution rates. A particular improvement is with increased robustness to camera motion, where previous approaches have experienced difficulty. There are three key novelties in the approach. First, a novel spacetime descriptor is employed that exploits the complementary nature of spatial and temporal information, as inspired by previous research on the role of orientation features in scene classification. Second, a forest-based classifier is used to learn a multi-class representation of the feature distributions. Third, the video is processed in temporal slices with scale matched preferentially to scene dynamics over camera motion. Slicing allows for temporal alignment to be handled as latent information in the classifier and for efficient, incremental processing. The integrated approach is evaluated empirically on two publically available datasets to document its outstanding performance.
Spatio-Temporal Good Features to Track
Christoph Feichtenhofer, Axel Pinz
Workshop on Computer Vision for Autonomous Driving, IEEE International Conference on Computer Vision (ICCV) 2013
This paper presents two fundamental contributions that can be very useful for any autonomous system that requires point correspondences for visual odometry. First, the Spatio-Temporal Monitor (STM) is an efficient method to identify good features to track by monitoring their spatiotemporal (x-y-t) appearance without any assumptions about motion or geometry. The STM may be used with any spatial (x-y) descriptor, but it performs best when combined with our second contribution, the Histogram of Oriented Magnitudes (HOM) descriptor, which is based on spatially oriented multiscale filter magnitudes. To fulfil the real-time requirements of autonomous applications, the same descriptor can be used for both, track generation and monitoring, to identify low-quality feature tracks at virtually no additional computational cost. Our extensive experimental validation on a challenging public dataset demonstrates the excellent performance of STM and HOM, where we significantly outperform the well known “Good Features to Track” method and show that our proposed feature quality measure highly correlates with the accuracy in structure and motion estimation.
A Perceptual Image Sharpness Metric Based on Local Edge Gradient Analysis
Christoph Feichtenhofer, Hannes Fassold, Peter Schallauer
IEEE Signal Processing Letters 2013
In this letter, a no-reference perceptual sharpness metric based on a statistical analysis of local edge gradients is presented. The method takes properties of the human visual system into account. Based on perceptual properties, a relationship between the extracted statistical features and the metric score is established to form a Perceptual Sharpness Index (PSI). A comparison with state-of-the-art metrics shows that the proposed method correlates highly with human perception and exhibits low computational complexity. In contrast to existing metrics, the PSI performs well for a wide range of blurriness and shows a high degree of invariance for different image contents.


Image and Video Understanding (2014-2017, together with Prof. Axel Pinz)

My Lectures cover

Image Based Measurement, Laboratory (2016-2017)
Optical Measurement Principles, Laboratory (2016-2017)
Optical Measurement Principles (2017, together with Prof. Axel Pinz)

Student Supervision

Moritz Kampelmuehler, 2017 (MSc level): Camera-based Vehicle Velocity Estimation

Horst Fuchs, 2017 (MSc level): Visualizing and Understanding Deep Driving Models

Roland Mulczet, 2017 (BSc level): Measuring the Invisible with Laser Speckles

Oliver Papst, 2017 (BSc level): 3D Object Detection for Road Scene Understanding

Gerhard Neuhold, 2015 (MSc; now at Mapillary): Pedestrian Detection with Convolutional Neural Networks