Talk Title: Multimodal Learning from the Bottom Up

Abstract: Today's machine perception systems rely extensively on human-provided supervision, such as language. I will talk about our efforts to develop systems that instead learn directly about the world from unlabeled multimodal signals, bypassing the need for this supervision. First, I will discuss our work on creating models that learn from analyzing unlabeled videos, particularly self-supervised approaches for learning space-time correspondence. Next, I will present models that learn from the paired audio and visual signals that naturally occur in video, including methods for generating soundtracks for silent videos. I will also discuss methods for capturing and learning from paired visual and tactile signals, such as models that augment visual 3D reconstructions with touch. Finally, I will talk about work that explores the limits of pretrained text-to-image generation models by using them to create visual illusions.

Bio: Andrew Owens is an assistant professor at The University of Michigan in the department of Electrical Engineering and Computer Science. Prior to that, he was a postdoctoral scholar at UC Berkeley, and he obtained a Ph.D. in computer science from MIT in 2016. He is a recipient of an NSF CAREER Award, a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award, and a Microsoft Research Ph.D. Fellowship.