Skip to main content
To KTH's start page

Beyond Standard Assumptions in Autonomous Driving Perception

Time: Fri 2026-04-17 09.00

Location: Kollegiesalen, Brinellvägen 8, Stockholm

Language: English

Subject area: Computer Science

Doctoral student: Ajinkya Khoche , Robotik, perception och lärande, Traton AB

Opponent: Assistant Professor Holger Caesar, Intelligent Vehicles Lab, TU Delft; Professor Abhinav Valada, Albert-Ludwigs-Universität Freiburg; Adjunct Associate Professor Christoffer Petersson, Chalmers University; Research Fellow Stephany Berrio Perez, Australian Centre for Field Robotics, University of Sydney

Supervisor: Professor Patric Jensfelt, Robotik, perception och lärande; Dr Sina Sharif Mansouri, Traton AB

Export to calendar

Zoom link: https://kth-se.zoom.us/s/68091974260

Abstract

Autonomous driving perception is commonly developed and evaluated under a set of enabling assumptions: that multi-sensor evidence is physically consistent at the frame level, that geometry is sufficiently dense to support reliable inference about other traffic participants and the surrounding environment, and that learning can rely on either abundant human labels or self-supervised objectives derived from the sensor stream. This thesis examines what remains feasible when these assumptions no longer hold, and develops methods and design principles for perception under asynchronous sensing, long-range sparsity, and weak or unreliable supervision.

We first study physical inconsistency in multi-sensor data. We show that rolling and asynchronous acquisition, motion during aggregation, and annotation practices that implicitly assume temporal coherence can render the perception problem ill-posed before any representation choice is made. We therefore treat data preparation, motion compensation, and annotation consistency as integral parts of the perception pipeline, since errors at this stage can propagate directly into annotation, training, and evaluation.

We then examine representation under long-range sparsity. We show that long-range performance is limited not only by model capacity, but by the representations used to encode and expose ambiguous evidence. In particular, object-centric outputs and dense internal representations can force premature commitment when available evidence collapses at distance. To study this, we present results on long-range 3D object detection and sparse long-range scene flow, showing both the limits of object-centric perception under weak observability and the value of motion-centric estimation as range increases.

Finally, we study learning signals when labels and geometry-derived self-supervision become unreliable. We show that motion supervision can be recovered by importing physically grounded constraints from complementary modalities, using radar Doppler to guide LiDAR scene flow learning. We further show that scalable semantic supervision can be obtained from foundation-model priors through curriculum-based synthetic-to-real adaptation, which anchors language-aligned representations to real LiDAR characteristics.

Link to DiVA