Selected Publications

Recent success of semantic segmentation approaches on demanding road driving datasets has spurred interest in many related application fields. Many of these applications involve real-time prediction on mobile platforms such as cars, drones and various kinds of robots. Real-time setup is challenging due to extraordinary computational complexity involved. Many previous works address the challenge with custom lightweight architectures which decrease computational complexity by reducing depth, width and layer capacity with respect to general purpose architectures. We propose an alternative approach which achieves a significantly better performance across a wide range of computing budgets. First, we rely on a light-weight general purpose architecture as the main recognition engine. Then, we leverage light-weight upsampling with lateral connections as the most cost-effective solution to restore the prediction resolution. Finally, we propose to enlarge the receptive field by fusing shared features at multiple resolutions in a novel fashion. Experiments on several road driving datasets show a substantial advantage of the proposed approach, either with ImageNet pre-trained parameters or when we learn from scratch. Our Cityscapes test submission entitled SwiftNetRN-18 delivers 75.5% MIoU and achieves 39.9Hz on 1024x2048 images on GTX1080Ti.
In CVPR, 2019.

We present semantic segmentation experiments with a model capable to perform predictions on four benchmark datasets: Cityscapes, ScanNet, WildDash and KITTI. We employ a ladder-style convolutional architecture featuring a modified DenseNet-169 model in the downsampling datapath, and only one convolution in each stage of the upsampling datapath. Due to limited computing resources, we perform the training only on Cityscapes Fine train+val, ScanNet train, WildDash val and KITTI train. We evaluate the trained model on the test subsets of the four benchmarks in concordance with the guidelines of the Robust Vision Challenge ROB 2018. The performed experiments reveal several interesting findings which we describe and discuss.
In CVPR ROB, 2018.

Recent progress of deep image classification models provides a large potential to improve state-of-the-art performance in related computer vision tasks. However, the transition to semantic segmentation is hampered by strict memory limitations of contemporary GPUs. The extent of feature map caching required by convolutional backprop poses significant challenges even for moderately sized PASCAL images, while requiring careful architectural considerations when the source resolution is in the megapixel range. To address these concerns, we propose a DenseNet-based ladder-style architecture which is able to deliver high modelling power with very lean representations at the original resolution. The resulting fully convolutional models have few parameters, allow training at megapixel resolution on commodity hardware and display fair semantic segmentation performance even without ImageNet pre-training. We present experiments on Cityscapes and Pascal VOC 2012 datasets and report competitive results.
In ICCV, 2017.

We propose an effective technique to address large scale variation in images taken from a moving car by cross-breeding deep learning with stereo reconstruction. Our main contribution is a novel scale selection layer which extracts convolutional features at the scale which matches the corresponding reconstructed depth. The recovered scale-invariant representation disentangles appearance from scale and frees the pixel-level classifier from the need to learn the laws of the perspective. This results in improved segmentation results due to more efficient exploitation of representation capacity and training data. We perform experiments on two challenging stereoscopic datasets (KITTI and Cityscapes) and report competitive class-level IoU performance.
In GCPR, 2016.

Recent Publications

. In Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving Images. In CVPR, 2019.

PDF Code

. Robust Semantic Segmentation with Ladder-DenseNet Models. In CVPR ROB, 2018.

PDF Code

. Convolutional Models for Segmentation and Localization. In HATZ, 2018.


. Ladder-style DenseNets for Semantic Segmentation of Large Natural Images. In ICCV, 2017.

PDF Code

. Convolutional Scale Invariance for Semantic Segmentation. In GCPR, 2016.

PDF Code

. A novel georeferenced dataset for stereo visual odometry. In CCVW, 2013.

PDF Code



The project MULTICLOD adresses recognition in stereoscopic video of various traffic environments. We study local and global image representations based on cues extracted by combining classification and reconstruction approaches. Our goal is to devise novel multi-class and weakly supervised recognition models capable of contributing to various applications in the fields of smart vehicles and intelligent transportation systems.


I am a teaching instructor for the following courses at FER, University of Zagreb: