ClipClap

Image Captioning with Clip Encoder and GPT2
 Image Captioning with Clip Encoder and GPT2
Product Information
This tool is verified because it is either an established company, has good social media presence or a distinctive use case
Release date22 June, 2023
PlatformDesktop

ClipClap Features

Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving comparable to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images. In our work, we use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple mapping network over the raw encoding, and then fine-tune our language model to generate a valid caption. In addition, we present another variant, where we utilize a transformer architecture for the mapping network and avoid the fine-tuning of GPT-2. Still, our light model achieve comaparable to state-of-the-art over nocaps dataset. Source: https://github.com/rmokady/CLIP_prefix_caption
Browse AI Tools Similar to ClipClap

Trends prompts: