- About
- Events
- Calendar
- Graduation Information
- Cornell Learning Machines Seminar
- Student Colloquium
- BOOM
- Fall 2024 Colloquium
- Conway-Walker Lecture Series
- Salton 2024 Lecture Series
- Seminars / Lectures
- Big Red Hacks
- Cornell University - High School Programming Contests 2024
- Game Design Initiative
- CSMore: The Rising Sophomore Summer Program in Computer Science
- Explore CS Research
- ACSU Research Night
- Cornell Junior Theorists' Workshop 2024
- People
- Courses
- Research
- Undergraduate
- M Eng
- MS
- PhD
- Admissions
- Current Students
- Computer Science Graduate Office Hours
- Advising Guide for Research Students
- Business Card Policy
- Cornell Tech
- Curricular Practical Training
- A & B Exam Scheduling Guidelines
- Fellowship Opportunities
- Field of Computer Science Ph.D. Student Handbook
- Graduate TA Handbook
- Field A Exam Summary Form
- Graduate School Forms
- Instructor / TA Application
- Ph.D. Requirements
- Ph.D. Student Financial Support
- Special Committee Selection
- Travel Funding Opportunities
- Travel Reimbursement Guide
- The Outside Minor Requirement
- Diversity and Inclusion
- Graduation Information
- CS Graduate Minor
- Outreach Opportunities
- Parental Accommodation Policy
- Special Masters
- Student Spotlights
- Contact PhD Office
Marrying Vision and Language: A Mutually Beneficial Relationship? (via Zoom)
Abstract: Foundation models that connect vision and language have recently shown great promise for a wide array of tasks such as text-to-image generation. Significant attention has been devoted towards utilizing the visual representations learned from these powerful vision and language models. In this talk, I will present an ongoing line of research that focuses on the other direction, aiming at understanding what knowledge language models acquire through exposure to images during pretraining. We first consider in-distribution text and demonstrate how multimodally trained text encoders, such as that of CLIP, outperform models trained in a unimodal vacuum, such as BERT, over tasks that require implicit visual reasoning. Expanding to out-of-distribution text, we address a phenomenon known as sound symbolism, which studies non-trivial correlations between particular sounds and meanings across languages, and demonstrate the presence of this phenomenon in vision and language models such as CLIP and Stable Diffusion. Our work provides new angles for understanding what is learned by these vision and language foundation models, offering principled guidelines for designing models for tasks involving visual reasoning.
Bio: Hadar Averbuch-Elor is an Assistant Professor at the School of Electrical Engineering in Tel Aviv University. Before that, Hadar was a postdoctoral researcher at Cornell-Tech, working with Noah Snavely. She completed her PhD in Electrical Engineering at Tel-Aviv University, advised by Daniel Cohen-Or. Hadar is a recipient of several awards including the Zuckerman Postdoctoral Scholar Fellowship, the Schmidt Postdoctoral Award for Women in Mathematical and Computing Sciences, and the Alon Fellowship for the Integration of Outstanding Faculty. She was also selected as a Rising Star in EECS in 2020. Hadar's research interests lie in the intersection of computer graphics and computer vision, particularly in combining pixels with more structured modalities, such as natural language and 3D geometry.