Images and vision — OpenAI

Summary

Guide on using the OpenAI API to generate images and analyze image inputs via vision capabilities, covering supported models, input methods, and token cost calculations.

Key quotes

Vision is the ability for a model to “see” and understand images.

The state-of-the-art image generation model, gpt-image-2, can understand text and images and use broad world knowledge to generate images.

The documentation details the technical requirements for image inputs, including supported file types (PNG, JPEG, WEBP, GIF) and size limits. It explains different image tokenization methods, such as patch-based and tile-based calculation, across various model families.