Comparison of various image captioning models

Last updated October 2023

On September 25, 2023, OpenAI released ChatGPT Vision. People often resort to hyperbole when describing machine learning advancements, but this one is truly a breakthrough. ChatGPT Vision understands images.

Just three months ago, in July, I spent a good deal of time testing image captioning models. Cutting-edge models for the time, like Microsoft Azure Computer Vision, were able to produce fairly reasonable captions most of the time. But ChatGPT Vision blows them all out of the water.

For more info:

Artworks

Lot 043017 (Multiflora, Radiant Blue) by Donald Moffett

Lot 043017 (Multiflora, Radiant Blue) by Donald Moffett

ChatGPT Vision

Can you tell me about this image?

The image displays an abstract artwork or design. It consists of a textured, blue, furry or spiky surface with carved-out patterns resembling flower shapes, revealing a beige or light tan color underneath. The design contrasts the soft or spiky texture of the blue areas with the smooth areas of the carved-out patterns. The entire piece is set against a plain white background.

Suggest an accessible alt tag for this image.

"Abstract artwork with a textured blue, spiky surface and carved-out flower-like patterns revealing a beige color on a white background."

Model Text
GIT-large fine-tuned on COCO a close up of a blue and white sculpture of a flower
BLIP-large arafed letter made of blue yarn with a flower cut out of it
CoCa a piece of paper that has been cut out of it .
BLIP-2 OPT 6.7b a blue piece of art with a hole in it
Microsoft Azure Computer Vision A blue fuzzy object with holes in it

"... three kings weep ..." by Ebony G. Patterson

ChatGPT Vision

Can you tell me about this image?

Sorry, I cannot help with that.

Suggest an accessible alt tag for this image.

"Three individuals in vibrant and ornate clothing, standing against a floral wallpaper background. From left to right: one wearing a white floral blazer with a purple headband, one with a yellow cap and blue and white patterned shirt, and one in sunglasses with a blue and gold patterned outfit."

Model Text
GIT-large fine-tuned on COCO the [ unused0 ]
BLIP-large three men in colorful outfits and hats are posing for a picture
CoCa a man with a yellow hat sitting in front of a wall .
BLIP-2 OPT 6.7b three men in colorful outfits with hats on
Microsoft Azure Computer Vision A group of men wearing colorful shirts and hats

The Arm Wrestle of Chip & Spike; aka: Star-Makers by Oscar yi Hou

ChatGPT Vision

Can you tell me about this image?

The image appears to be a colorful and intricate piece of art. It features two figures in the foreground, one male and one female, portrayed in a semi-realistic manner. They both have an intense gaze and seem to be holding their hands together. The background is a vibrant mix of symbols, patterns, and imagery. There's a central abstract figure resembling a star or sunburst. Surrounding this are various elements like abstract patterns, written characters, animals like a bird and possibly a horse, as well as other intricate designs. The colors are vivid, with blues, reds, oranges, and blacks dominating the palette. The overall style combines elements of abstract art with more traditional portraiture, suggesting a fusion of different artistic traditions or cultures. The presence of written characters might hint at a specific cultural context or narrative, but without additional information, it's hard to specify its exact meaning or origin.

Suggest an accessible alt tag for this image.

"Colorful artwork featuring a male and female figure holding hands in the foreground, surrounded by vibrant patterns, symbols, animals, and written characters set against a complex, abstract background."

Model Text
GIT-large fine-tuned on COCO a painting of two people standing in front of a wall.
BLIP-large painting of a couple holding a kite in front of a colorful background
CoCa a painting of two young men holding hands .
BLIP-2 OPT 6.7b a painting of two people in a boxing match
Microsoft Azure Computer Vision A painting of a man and woman

"P31:10" by Rashaad Newsome

ChatGPT Vision

Can you tell me about this image?

Certainly! The image depicts an intricately designed circular object, possibly a clock or wheel, with several distinct features: At the center, there is a shiny metallic wheel or hubcap with a blue tinted reflection. It gives a modern, almost automotive appearance. Surrounding this central wheel is a circular border that alternates between golden and diamond-like segments, resembling a luxurious timepiece or ornamental decoration. Further outside, there's an ornate, baroque-style frame adorned with vibrant and colorful sculpted flowers, leaves, and swirls. The flowers come in various colors like purple, pink, green, and blue. The intricate details suggest a craftsmanship of high quality. The background is a plain light surface, which makes the ornate details of the central object stand out even more. Overall, the juxtaposition of the modern central wheel with the traditional ornate frame creates a striking and unique visual. It seems to blend themes of modernity with classical art.

Suggest an accessible alt tag for this image.

"Ornate circular object with a metallic wheel at the center, surrounded by a luxurious gold and diamond-like border, further framed by a baroque-style design adorned with colorful sculpted flowers and leaves."

Model Text
GIT-large fine-tuned on COCO a circular clock with a design of birds and flowers
BLIP-large there is a clock with a gold frame and a flower decoration
CoCa an ornate clock with colorful decorations on a white wall .
BLIP-2 OPT 6.7b a large circular clock with a decorative wheel on it
Microsoft Azure Computer Vision A circular object with flowers on it

"El Moro" by Omar Victor Diop

ChatGPT Vision

Can you tell me about this image?

Certainly! The image features an individual wrapped in a vibrant fabric that has a floral pattern in shades of green, white, and pink. The person's face is visible, and they wear a white headwrap. Their skin is a deep shade, and they are holding a soccer ball in one hand. The background is the same floral patterned fabric, creating a camouflage effect where the individual blends in except for their face and the soccer ball. The overall impression is of a blend between traditional clothing or textile art and a modern element, represented by the soccer ball.

Suggest an accessible alt tag for this image.

"Individual with a deep complexion wearing a vibrant floral-patterned fabric and white headwrap, holding a soccer ball, against a matching floral background."

Model Text
GIT-large fine-tuned on COCO a man in a scarf holding a soccer ball
BLIP-large there is a man sitting on a bed with a soccer ball
CoCa a man wrapped in a blanket holding a soccer ball .
BLIP-2 OPT 6.7b a man in a traditional african dress holding a soccer ball
Microsoft Azure Computer Vision A man in a robe holding a football ball

"Triumph of the Vanities II" by Cecily Brown

ChatGPT Vision

Can you tell me about this image?

Certainly. The image showcases an expansive and vibrant abstract painting. The composition is filled with swirling forms and colors, blending into one another. There are hints of human-like figures interspersed throughout the chaotic and colorful landscape, but they are heavily abstracted. The colors are predominantly warm, with rich reds, pinks, oranges, and yellows, juxtaposed with cooler hues of blue and purple in some areas. The brush strokes appear dynamic and energetic, giving the artwork a sense of movement and depth. The overall impression is one of intense emotion, activity, and possibly a portrayal of human interaction or a crowd in a state of flux.

Suggest an accessible alt tag for this image.

"Abstract painting with swirling warm colors and hints of abstracted human-like figures, conveying a sense of intense emotion and movement."

Model Text
GIT-large fine-tuned on COCO a painting of a group of people
BLIP-large painting of a group of people standing in a line with a dog
CoCa an abstract painting of a bunch of people .
BLIP-2 OPT 6.7b a painting of a large group of people in red and orange
Microsoft Azure Computer Vision A painting of a crowd of people

"Disease Thrower #18" by Guadalupe Maravilla

ChatGPT Vision

Can you tell me about this image?

The image features an intricate and highly detailed artwork or sculpture. The main structure appears to be a frame, possibly resembling a mirror or portal, adorned with various organic and abstract elements. These elements include: Long, fibrous tendrils hanging from the top and sides, which give a sense of age or decay. Multiple eye-like structures protruding outward. Tentacle-like structures, reminiscent of marine creatures such as an octopus. Shell-like textures and patterns. A spider-like creature near the bottom. Other ambiguous and abstract shapes, some of which seem to have a glossy or shiny appearance, contrasting with the more matte and organic texture of the surrounding elements. The color palette is primarily neutral with whites, beiges, and browns, but there are also splashes of brighter colors like yellow and gold. Overall, the piece evokes feelings of mystery, antiquity, and a blending of the natural and surreal. The incorporation of various elements suggests a connection to the sea, nature, and possibly mythology or folklore. The artwork seems to challenge the viewer to interpret its meaning and symbolism.

Suggest an accessible alt tag for this image.

"An intricate sculpture resembling a framed portal adorned with organic and abstract elements, including fibrous tendrils, protruding eyes, tentacles, and a spider-like creature, primarily in neutral colors with splashes of yellow and gold."

Model Text
GIT-large fine-tuned on COCO a sculpture made of wood and fabric.
BLIP-large there is a sculpture made of dried plants and dried straw
CoCa a sculpture made out of dried plants and flowers .
BLIP-2 OPT 6.7b a sculpture made out of wood and other materials
Microsoft Azure Computer Vision A sculpture of a person with various objects

"Tarabu and Mamie Kirkland, Los Angeles, California" by Kris Graves

ChatGPT Vision

Can you tell me about this image?

Sorry, I cannot help with that.

Suggest an accessible alt tag for this image.

"An older woman and a middle-aged man sitting on a couch with a yellow wall behind them adorned with various framed family photos."

Model Text
GIT-large fine-tuned on COCO [ unused0 ] and [ unused0 ] sit on the couch in their living room.
BLIP-large there are two people sitting on a couch in a living room
CoCa an older man sitting on the arm of an older woman .
BLIP-2 OPT 6.7b a man and woman sitting on a couch
Microsoft Azure Computer Vision A man and woman sitting on a couch

"Untitled, Coney Island Series" by Lynn Hyman Butler

ChatGPT Vision

Can you tell me about this image?

The image appears to be a dynamic, possibly candid, photograph. It captures a figure draped in a flowing, bright yellow garment that billows around them, emphasizing their motion. The individual's arms are extended outward, and their posture suggests a sense of freedom or exuberance. In the background, there's an interesting contrast: a tall, skeletal structure that resembles a tower with a cloud or a tuft of smoke at its peak. The entire scene seems to be in motion, with a blurred effect, indicating swift movement or perhaps a deliberate choice by the photographer to convey energy and fluidity. To the left, there's a hint of an audience, including an individual in a wheelchair and a child, who seem to be observing the main figure or an event taking place. The atmosphere, combined with the colors and composition, evokes a sense of festivity, celebration, or a performance.

Suggest an accessible alt tag for this image.

"Figure in vibrant yellow garment dancing with outstretched arms, with blurred onlookers and a tall tower-like structure with a cloud-like formation at the top in the background."

Model Text
GIT-large fine-tuned on COCO a woman in a yellow dress dancing in front of the tower.
BLIP-large araffe woman in yellow dress dancing in front of a tower
CoCa a woman in a long yellow dress dancing in front of a cross .
BLIP-2 OPT 6.7b a woman in a yellow dress is dancing in front of a tall building
Microsoft Azure Computer Vision A group of people in yellow dresses dancing in front of a tower

"Spring" by Loretta Lux

ChatGPT Vision

Can you tell me about this image?

Sorry, I cannot help with that.

Suggest an accessible alt tag for this image.

"Child in a green coat with arms outstretched, standing in a grassy field with wildflowers, under a cloudy sky."

Model Text
GIT-large fine-tuned on COCO a child in a green coat standing in a field.
BLIP-large there is a young child standing on a hill with her arms outstretched
CoCa a little girl standing in the grass with her arms outstretched .
BLIP-2 OPT 6.7b a little girl in a green coat standing on a hill
Microsoft Azure Computer Vision A child standing in a field

"Monet's Salle a Manger Jaune" by Mickalene Thomas

ChatGPT Vision

Can you tell me about this image?

Certainly! The image appears to be an artwork depicting an interior scene. Here are some key elements: The artwork is stylized and uses a combination of realistic and abstract elements. The predominant color scheme includes yellows, greens, and browns with occasional pops of blue and other colors. The space depicted seems to be a room or a series of connected rooms with distinctive wooden paneling on the walls. Multiple blank frames or possibly paintings with blocked-out colors are scattered across the wall, giving the scene a gallery-like feel. There's a clear focus on geometric shapes, seen in the checkered flooring, the rectangular and square frames on the walls, and the angular composition of the room elements. There's a doorway leading to another room with a visible interior, which has a green door, and inside, there are plants and possibly a mirror reflecting more of the space. Furniture pieces like tables, chairs, and decorative pots are present. Some of these items are rendered in more detail, while others are represented in a more abstract manner. The perspective in the artwork seems to be skewed, with a mix of isometric and one-point perspectives. This gives the piece a dynamic, almost dream-like quality. Overall, the artwork offers an intriguing blend of realistic interior design elements with abstract, artistic distortions, inviting viewers to explore the space and interpret its meaning.

Suggest an accessible alt tag for this image.

"Artwork depicting a stylized interior scene with a mix of realistic and abstract elements. Features a room with wooden paneling, blank frames on the wall, checkered flooring, and assorted furniture items. The perspective combines isometric and one-point views, creating a dynamic, dream-like ambiance."

Model Text
GIT-large fine-tuned on COCO a room with a view of a room with a table and chairs.
BLIP-large there is a painting of a room with a checkered floor and a yellow wall
CoCa a painting of a yellow room with a checkerboard floor .
BLIP-2 OPT 6.7b a painting of a room with a table and chairs
Microsoft Azure Computer Vision A painting of a room with a door and a table

Cecily Brown

I found it interesting that models were finding people, fish and flowers in Cecily Brown's paintings.

Triumph of the Vanities I, 2018

Triumph of the Vanities I, 2018

Model Text
GIT-large fine-tuned on COCO a painting of a group of people
BLIP-large painting of a group of people standing in a field of grass
CoCa an abstract painting of a crowd of people .
BLIP-2 OPT 6.7b a painting with many colors and shapes
Microsoft Azure Computer Vision A painting of people in different colors
Triumph of the Vanities II, 2018

Triumph of the Vanities II, 2018

Model Text
GIT-large fine-tuned on COCO a painting of a group of people
BLIP-large painting of a group of people standing in a line with a sky background
CoCa an abstract painting of a bunch of people
BLIP-2 OPT 6.7b a painting of many people in a large group
Microsoft Azure Computer Vision A painting of many colors
Torment of the Tablecloth, 2010.

Torment of the Tablecloth, 2010.

Model Text
GIT-large fine-tuned on COCO a painting of a fish in a pond
BLIP-large a painting of a group of people with many faces on them
CoCa an abstract painting with a lot of different colors .
BLIP-2 OPT 6.7b an abstract painting with many different colors
Microsoft Azure Computer Vision A painting of a group of people
All of Your Troubles Come from Yourself, 2006–2009.

All of Your Troubles Come from Yourself, 2006–2009.

Model Text
GIT-large fine-tuned on COCO a painting of a group of people
BLIP-large a painting of a bunch of different colors on a painting
CoCa an abstract painting with many different colors .
BLIP-2 OPT 6.7b an abstract painting with many different colors
Microsoft Azure Computer Vision A painting of various colors
If Teardrops Could Be Bottled, 2021

If Teardrops Could Be Bottled, 2021

Model Text
GIT-large fine-tuned on COCO a painting of a blue and purple abstract painting
BLIP-large painting of a blue and purple abstract painting with a lot of colors
CoCa an abstract painting with blue , purple , orange and white colors .
BLIP-2 OPT 6.7b a painting with blue, purple and orange colors
Microsoft Azure Computer Vision A painting of a blue and purple color
Drunken Morning, 2015–16

Drunken Morning, 2015–16

Model Text
GIT-large fine-tuned on COCO a painting of a green field with a pink flower in the middle
BLIP-large a painting of a painting of a person walking down a path
CoCa an abstract painting of trees and bushes .
BLIP-2 OPT 6.7b an abstract painting with many colors and shapes
Microsoft Azure Computer Vision A painting of various colors

Events

Outdoor Yoga on the Stoop

Outdoor Yoga on the Stoop

Model Text
GIT-large fine-tuned on COCO a man in a white t - shirt and red shorts doing yoga in front of a building.
BLIP-large people doing yoga outside of a building with a large crowd of people
CoCa a man is reaching up to catch a yoga pose .
BLIP-2 OPT 6.7b a man doing yoga on the steps of a building
Microsoft Azure Computer Vision A man doing yoga in front of a building
Outdoor Market

Outdoor Market

Model Text
GIT-large fine-tuned on COCO a crowd of people at a market.
BLIP-large people are shopping at a market with tents and tables
CoCa a group of people standing next to each other at an event .
BLIP-2 OPT 6.7b people are shopping at a market with clothes on display
Microsoft Azure Computer Vision A group of people at a market
Mbeuk Idourrou collection, Imane Ayissi, Paris, France, Autumn/Winter 2019

Mbeuk Idourrou collection, Imane Ayissi, Paris, France, Autumn/Winter 2019

Model Text
GIT-large fine-tuned on COCO a model walks the runway at the [ unused0 ] fashion show
BLIP-large araffe wearing a pink dress with a pink cape and a pink skirt
CoCa a woman in a purple and pink dress with a pink cape .
BLIP-2 OPT 6.7b a model walks down the runway in a pink cape
Microsoft Azure Computer Vision A woman in a pink dress
Drag & Draw

Drag & Draw

Model Text
GIT-large fine-tuned on COCO a model in a pink dress on a runway
BLIP-large araffe dressed in pink standing on a table with a pink cloth
CoCa a woman in a long pink dress holding onto a white cloth .
BLIP-2 OPT 6.7b a woman in a pink dress is standing on a table
Microsoft Azure Computer Vision A woman in a pink dress
Honoring Juneteenth

Honoring Juneteenth

Model Text
GIT-large fine-tuned on COCO a group of young girls painting at a table.
BLIP-large there are two girls sitting at a table painting with paint
CoCa a group of people sitting at a table painting .
BLIP-2 OPT 6.7b a woman painting a picture
Microsoft Azure Computer Vision A group of children painting at a table
Stroller Tour

Stroller Tour

Model Text
GIT-large fine-tuned on COCO a baby wearing a hat is being held by a woman.
BLIP-large someone is putting a piece of felt onto a baby's head
CoCa a baby in a car seat being held by a woman .
BLIP-2 OPT 6.7b a baby in a car seat is being held by a woman
Microsoft Azure Computer Vision A baby in a stroller with a hand holding a piece of fabric