New machine learning models make AI artists even better
Video game designer Jason Allen made headlines this year with Théâtre D’opéra Spatial, his submission to the Colorado State Fair’s digital arts competition. Judges awarded him first place and $300 prize, but the artwork also received a sudden flurry of global attention when it was discovered Allen had used AI-powered image generator Midjourney to create the work of art.
Midjourney, DALL-E and DALL-E 2 have brought a wealth of weird and wonderful images to the world as users type in natural language descriptions and share the dream-like results.
DALL-E 2 uses a “diffusion model”, which attempts to take the input text in its entirety and generate an image from that. But the output becomes less accurate as that text becomes more complex; the existing model appears to struggle to understand composition of concepts, and confuses attributes and relations between different objects.
Scientists from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) say they looked at the problem from a different angle by adding models together so they could cooperate, which was seen to produce more creative combinations in the final images.
“DALL-E 2 is good at generating natural images but has difficulty understanding object relations sometimes,” says MIT CSAIL PhD student and co-lead author Shuang Li, “Beyond art and creativity, perhaps we could use our model for teaching. If you want to tell a child to put a cube on top of a sphere, and if we say this in language, it might be hard for them to understand. But our model can generate the image and show them.”
Machine learning models help to learn about language
The team’s model – Composable Diffusion – uses diffusion and compositional operators to combine text descriptions without further training, which more accurately captures text details. One example using this model - which called for “a pink sky” and “a blue mountain on the horizon” and “cherry blossoms in front of the mountain” - produced an accurate image, while the original model returned a blue sky, but gave everything in front of the mountains a pink colour.
“The fact that our model is composable means that you can learn different portions of the model, one at a time,” says co-lead author and MIT CSAIL PhD student Yilun Du. “You can first learn an object on top of another, then learn an object to the right of another, and then learn something left of another. Since we can compose these together, you can imagine that our system enables us to incrementally learn language, relations, or knowledge, which we think is a pretty interesting direction for future work.”
The research - supported by Raytheon BBN Technologies Corp., Mitsubishi Electric Research Laboratory, and DEVCOM Army Research Laboratory - has received the approval of DALL-E2’s co-creator Mark Chen.
“This is a nice idea that leverages the energy-based interpretation of diffusion models so that old ideas around compositionality using energy-based models can be applied,” says Chen, who is a research scientist at OpenAI, the company behind DALL-E.