Marrying Vision and Language: A Mutually Beneficial Relationship?

Marrying Vision and Language: A Mutually Beneficial Relationship? (via Zoom)

Abstract: Foundation models that connect vision and language have recently shown great promise for a wide array of tasks such as text-to-image generation. Significant attention has been devoted towards utilizing the visual representations learned from these powerful vision and language models. In this talk, I will present an ongoing line of research that focuses on the other direction, aiming at understanding what knowledge language models acquire through exposure to images during pretraining. We first consider in-distribution text and demonstrate how multimodally trained text encoders, such as that of CLIP, outperform models trained in a unimodal vacuum, such as BERT, over tasks that require implicit visual reasoning. Expanding to out-of-distribution text, we address a phenomenon known as sound symbolism, which studies non-trivial correlations between particular sounds and meanings across languages, and demonstrate the presence of this phenomenon in vision and language models such as CLIP and Stable Diffusion. Our work provides new angles for understanding what is learned by these vision and language foundation models, offering principled guidelines for designing models for tasks involving visual reasoning.

Bio: Hadar Averbuch-Elor is an Assistant Professor at the School of Electrical Engineering in Tel Aviv University. Before that, Hadar was a postdoctoral researcher at Cornell-Tech, working with Noah Snavely. She completed her PhD in Electrical Engineering at Tel-Aviv University, advised by Daniel Cohen-Or. Hadar is a recipient of several awards including the Zuckerman Postdoctoral Scholar Fellowship, the Schmidt Postdoctoral Award for Women in Mathematical and Computing Sciences, and the Alon Fellowship for the Integration of Outstanding Faculty. She was also selected as a Rising Star in EECS in 2020. Hadar's research interests lie in the intersection of computer graphics and computer vision, particularly in combining pixels with more structured modalities, such as natural language and 3D geometry.