Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.
The text-to-image generator revolution is in full swing with tools such as OpenAI’s DALL-E 2 and GLIDE, as well as Google’s Imagen, gaining massive popularity – even in beta – since each was introduced over the past year.
These three tools are all examples of a trend in intelligence systems: Text-to-image synthesis or a generative model extended on image captions to produce novel visual scenes.
Intelligent systems that can create images and videos have a wide range of applications, from entertainment to education, with the potential to be used as accessible solutions for those with physical disabilities. Digital graphic design tools are widely used in the creation and editing of many modern cultural and artistic works. Yet, their complexity can make them inaccessible to anyone without the necessary technical knowledge or infrastructure.
That’s why systems that can follow text-based instructions and then perform a corresponding image-editing task are game-changing when it comes to accessibility. These benefits can also be easily extended to other domains of image generation, such as gaming, animation and creating visual teaching material.
The rise of text-to-image AI generators
AI has advanced over the past decade because of three significant factors – the rise of big data, the emergence of powerful GPUs and the re-emergence of deep learning. Generator AI systems are helping the tech sector realize its vision of the future of ambient computing — the idea that people will one day be able to use computers intuitively without needing to be knowledgeable about particular systems or coding.
AI text-to-image generators are now slowly transforming from generating dreamlike images to producing realistic portraits. Some even speculate that AI art will overtake human creations. Many of today’s text-to-image generation systems focus on learning to iteratively generate images based on continual linguistic input, just as a human artist can.
This process is known as a generative neural visual, a core process for transformers, inspired by the process of gradually transforming a blank canvas into a scene. Systems trained to perform this task can leverage text-conditioned single-image generation advances.
How 3 text-to-image AI tools stand out
AI tools that mimic human-like communication and creativity have always been buzzworthy. For the past four years, big tech giants have prioritized creating tools to produce automated images.
There have been several noteworthy releases in the past few months – a few were immediate phenomenons as soon as they were released, even though they were only available to a relatively small group for testing.
Let’s examine the technology of three of the most talked-about text-to-image generators released recently – and what makes each of them stand out.
OpenAI’s DALL-E 2: Diffusion creates state-of-the-art images
Released in April, DALL-E 2 is OpenAI’s newest text-to-image generator and successor to DALL-E, a generative language model that takes sentences and creates original images.
A diffusion model is at the heart of DALL-E 2, which can instantly add and remove elements while considering shadows, reflections and textures. Current research shows that diffusion models have emerged as a promising generative modeling framework, pushing the state-of-the-art image and video generation tasks. To achieve the best results, the diffusion model in DALL-E 2 uses a guidance method for optimizing sample fidelity (for photorealism) at the price of sample diversity.
DALL-E 2 learns the relationship between images and text through “diffusion,” which begins with a pattern of random dots, gradually altering towards an image where it recognizes specific aspects of the picture. Sized at 3.5 billion parameters, DALL-E 2 is a large model but, interestingly, isn’t nearly as large as GPT-3 and is smaller than its DALL-E predecessor (which was 12 billion). Despite its size, DALL-E 2 generates resolution that is four times better than DALL-E and it’s preferred by human judges more than 70% of the time both in caption matching and photorealism.
The versatile model can go beyond sentence-to-image generations and using robust embeddings from CLIP, a computer vision system by OpenAI for relating text-to-image, it can create several variations of outputs for a given input, preserving semantic information and stylistic elements. Furthermore, compared to other image representation models, CLIP embeds images and text in the same latent space, allowing language-guided image manipulations.
Although conditioning image generation on CLIP embeddings improves diversity, a specific con is that it comes with certain limitations. For example, unCLIP, which generates images by inverting the CLIP image decoder, is worse at binding attributes to objects than a corresponding GLIDE model. This is because the CLIP embedding itself does not explicitly bind characteristics to objects, and it was found that the reconstructions from the decoder often mix up attributes and objects. At the higher guidance scales used to generate photorealistic images, unCLIP yields greater diversity for comparable photorealism and caption similarity.
GLIDE by OpenAI: Realistic edits to existing images
OpenAI’s Guided Language-to-Image Diffusion for Generation and Editing, also known as GLIDE, was released in December 2021. GLIDE can automatically create photorealistic pictures from natural language prompts, allowing users to create visual material through simpler iterative refinement and fine-grained management of the created images.
This diffusion model achieves performance comparable to DALL-E, despite utilizing only one-third of the parameters (3.5 billion compared to DALL-E’s 12 billion parameters). GLIDE can also convert basic line drawings into photorealistic photos through its powerful zero-sample production and repair capabilities for complicated circumstances. In addition, GLIDE utilizes minor sampling delay and does not require CLIP reordering.
Most notably, the model can also perform image inpainting, or making realistic edits to existing images through natural language prompts. This makes it equal in function to editors such as Adobe Photoshop, but easier to use.
Modifications produced by the model match the style and lighting of the surrounding context, including convincing shadows and reflections. These models can potentially aid humans in creating compelling custom images with unprecedented speed and ease, while significantly reducing the production of effective disinformation or Deepfakes. To safeguard against these use cases while aiding future research, OpenAI’s team also released a smaller diffusion model and a noised CLIP model trained on filtered datasets.
Imagen by Google: Increased understanding of text-based inputs
Google’s Brain Team aimed to generate images with greater accuracy and fidelity by utilizing the short and descriptive sentence method. The model analyzes each sentence section as a digestible chunk of information and attempts to produce an image that is as close to that sentence as possible.
Imagen builds on the prowess of large transformer language models for syntactic understanding, while drawing the strength of diffusion models for high-fidelity image generation. In contrast to prior work that used only image-text data for model training, Google’s fundamental discovery was that text embeddings from large language models, when pretrained on text-only corpora (large and structured sets of texts), are remarkably effective for text-to-image synthesis. Furthermore, through the increased size of the language model, Imagen boosts both sample fidelity and image text alignment much more than increasing the size of the image diffusion model.
Instead of using an image-text dataset for training Imagen, the Google team simply used an “off-the-shelf” text encoder, T5, to convert input text into embeddings. The frozen T5-XXL encoder maps input text into a sequence of embeddings and a 64×64 image diffusion model, followed by two super-resolution diffusion models for generating 256×256 and 1024×1024 images. The diffusion models are conditioned on the text embedding sequence and use classifier-free guidance, relying on new sampling techniques to use large guidance weights without sample quality degradation.
Imagen achieved a state-of-the-art FID score of 7.27 on the COCO dataset without ever being trained on COCO. When assessed on DrawBench with current methods including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE and DALL-E 2, Imagen was found to deliver better both in terms of sample quality and image-text alignment.
Future text-to-image opportunities and challenges
There is no doubt that quickly advancing text-to-image AI generator technology is paving the way for unprecedented opportunities for instant editing and generated creative output.
There are also many challenges ahead, ranging from questions about ethics and bias (though the creators have implemented safeguards within the models designed to restrict potentially destructive applications) to issues around copyright and ownership. The sheer amount of computational power required to train text-to-image models through massive amounts of data also restricts work to only significant and well-resourced players.
But there is also no question that each of these three text-to-image AI models stands on its own as a way for creative professionals to let their imaginations run wild.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.
USB logos finally make sense, thanks to a redesign
Cheaper OLED monitors might be coming soon
NASA Says Hurricane Didn’t Hurt Artemis I Hardware, Sets New Launch Window
NASA’s Artemis I moon mission launch, stalled by Hurricane Ian, has a new target for takeoff. The launch window for step one of NASA’s bold plan to return humans to the lunar surface now opens Nov. 12 and closes Nov. 27, the space agency said Friday.
The news comes after the pending storm caused NASA to scrub the latest Artemis I Iaunch, which had been scheduled for Sunday, Oct. 2. As Hurricane Ian threatened to travel north across Cuba and into Florida, bringing rain and extreme winds to the launch pad’s vicinity, NASA on Monday rolled its monster Space Launch System rocket, and the Orion spacecraft it’ll propel, back indoors to the Vehicle Assembly Building at Florida’s Kennedy Space Center.
The hurricane made landfall in Florida on Wednesday, bringing with it a catastrophic storm surge, winds and flooding that left dozens of people dead, caused widespread power outages and ripped buildings from their foundations. Hurricane Ian is “likely to rank among the worst in the nation’s history,” US President Joe Biden said on Friday, adding that it will take “months, years, to rebuild.”
Initial inspections Friday to assess potential impacts of the devastating storm to Artemis I flight hardware showed no damage, NASA said. “Facilities are in good shape with only minor water intrusion identified in a few locations,” the agency said in a statement.
Next up, teams will complete post-storm recovery operations, which will include further inspections and retests of the flight termination system before a more specific launch date can be set. The new November launch window, NASA said, will also give Kennedy employees time to address what their families and homes need post-storm.
Artemis I is set to send instruments to lunar orbit to gather vital information for Artemis II, a crewed mission targeted for 2024 that will carry astronauts around the moon and hopefully pave the way for Artemis III in 2025. Astronauts on that high-stakes mission will, if all goes according to plan, put boots on the lunar ground, collect samples and study the water ice that’s been confirmed at the moon’s South Pole.
The hurricane-related Artemis I rollback follows two other launch delays, the first due to an engine problem and the second because of a hydrogen leak.
Hurricane Ian has been downgraded to a post-tropical cyclone but is still bringing heavy rains and gusty winds to the Mid-Atlantic region and the New England coast.
4 On-Chain Metrics Show the Bitcoin Price Is Primed for Bullish Explosion
Celsius Network Will Hold Its Final Asset Auction On This Date
Money Flowing Out Of Crypto Funds Is 666M Less Than Previous Quarter Indicating Bearish Investors Are Already Out: Bloomberg
Bitcoin price sees first October spike above $20K as daily gains hit 5%
McDonald’s starts to accept Bitcoin and Tether in Swiss town
‘Continue to ebb and flow over time’: Denny’s chief brand officer on how consumers’ moods inform brand messaging
Bitcoin hits $45K ahead of July inflation report, but one fractal hints at looming correction
Smart Marketing Token (SMT) Is on a Mission to Help Blockchain Projects Reach Their Goals
Identity management org Sailpoint unveils no-code tool
Japan crypto exchange bitbank upgrades performance of its matching engine by 4x
Ethereum3 months ago
OKX Wins Provisional Crypto License In The UAE
Tech3 months ago
Kaseya, one year later: What have we learned?
Tech3 months ago
Best Prime Day deals: Last-minute deals you can shop today
Bit Coin3 months ago
Study: 14% of Saudis Are Crypto Investors, 76% Have Less Than One Year of Experience in Cryptocurrency Investment
Bit Coin3 months ago
Tothesmart Is an Exclusive New Smart Contract Built on the Binance Smart Chain Blockchain
Ethereum3 months ago
Binance.US taps Former Paypal Exec. as New CFO as The Crypto Exchange Formulates an IPO
Tech3 months ago
Lots to consider when buying cyber insurance, so do your homework
Bit Coin3 months ago
Biggest Movers: QNT Close to 2-Month High Despite Recent Declines, NEAR Moves Toward 1-Year Low