Intelligent Systems that Perceive, Imagine, and Act Like Humans by Aligning Representations from Multimodal Data

Title
"Intelligent Systems that Perceive, Imagine, and Act Like Humans by Aligning Representations from Multimodal Data"
Abstract
The machine learning community has embraced specialized models tailored to specific data domains. However, relying solely on a singular data type might constrain flexibility and generality, requiring additional labeled data and hindering user interaction. To address these challenges, my research objective is to build efficient intelligent systems that learn from the perception of the physical world and their interactions with humans to execute diverse and complex tasks to assist people. These systems should support seamless interactions with humans and computers in digital software environments and tangible real-world contexts by aligning representations from multimodal data. In this talk, I will elaborate on my approaches across three dimensions: perception, imagination, and action, encompassing methods like recognition, generation, and robotics. These findings effectively mitigate the constraints of existing model setups, opening avenues for multimodal representations to unify a myriad of signals within a singular, comprehensive model.
Bio
Boyi Li is a postdoctoral scholar at UC Berkeley, advised by Prof. Jitendra Malik and Prof. Trevor Darrell. She is also a researcher at NVIDIA Research. She received her Ph.D. at Cornell University, advised by Prof. Serge Belongie and Prof. Kilian Q. Weinberger. Her research interest is in computer vision and machine learning. Her research primarily focuses on developing generalizable algorithms and interactive intelligent systems by aligning representations from multimodal data, such as 2D pixels, 3D geometry, language, audio, touch, and smell, among others.