Towards safe, aligned, and efficient reinforcement learning from human feedback

Time: Thu 2025-06-05 15.00

Location: Q2, Malvinas väg 10, Stockholm

Language: English

Subject area: Computer Science

Doctoral student: Daniel Marta , Robotik, perception och lärande, RPL

Opponent: Full Professor Mohamed Chetouani, Sorbonne Universite, Paris, France

Supervisor: Associate Professor Iolanda Leite, Robotik, perception och lärande, RPL

Export to calendar

QC 20250519

Abstract

Reinforcement learning policies are becoming increasingly prevalent in robotics and AI-human interactions due to their effectiveness in tackling complex and challenging domains. Many of these policies—also referred to as AI agents—are trained using human feedback through techniques collectively known as Reinforcement Learning from Human Feedback (RLHF). This thesis addresses three key challenges—safety, alignment, and efficiency—that arise when deploying these policies in real-world applications involving actual human users. To this end, it proposes several novel methods. Ensuring the safety of human-robot interaction is a fundamental requirement for their deployment. While most prior research has explored safety within discrete state and action spaces, we investigate novel approaches for synthesizing safety shields from human feedback, enabling safer policy execution in various challenging settings, including continuous state and action spaces, such as social navigation. To better align policies with human feedback, contemporary works predominantly rely on single-reward settings. However, we argue for the necessity of a multi-objective paradigm, as most human goals cannot be captured by a single valued reward function. Moreover, most robotic tasks have baseline predefined goals related to task success, such as reaching a navigation waypoint. Accordingly, we first introduce a method to align policies with multiple objectives using pairwise preferences. Additionally, we propose a novel multi-modal approach that leverages zero-shot reasoning with large language models alongside pairwise preferences to adapt multi-objective goals for these policies. The final challenge addressed in this thesis is improving the sample efficiency and reusability of these policies, which is crucial when adapting policies based on real human feedback. Since requesting human feedback is both costly and burdensome—potentially degrading the quality of human-agent interactions—we propose two distinct methods to mitigate these issues. First, to enhance the efficiency of RLHF, we introduce an active learning method that combines unsupervised learning techniques with uncertainty estimation to prioritize the most informative queries for human feedback. Second, to improve the reusability of reward functions derived from human feedback and reduce the need for redundant queries in similar tasks, we investigate low-rank adaptation techniques for adapting pre-trained reward functions to new tasks.

urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-363515

To the calendar

My employment

Support and service

Education

Research

Organisation and regulations

Towards safe, aligned, and efficient reinforcement learning from human feedback

Abstract

Contact