Multimodal AI is booming this year with models capable of seeing, reading, hearing. Models advancing in this field unlocks many production use cases in robotics, document AI, computer/web automations and more!
In this talk we will go through everything multimodal and open-source: a bit of background, libraries, very basic APIs to get you started with open-source models, popular open-source models, use cases (multimodal agents, multimodal RAG, automated browser use and more!)
Python, basic understanding of deep learning
Merve works at Hugging Face open-source team on computer vision and multimodal AI, and at times, agents, contributing to/developing Hugging Face libraries. Prior to this, she worked as machine learning engineer working on NLP, chatbots and information retrieval.