Hugging Face CLIP Models integrate image and text embeddings for multi-modal tasks like image-text matching and zero-shot image classification.
https://huggingface.co/models?pipeline_tag=clip