BOOK THIS SPACE FOR AD
ARTICLE ADSince the release of DALL-E in 2021, the first AI image-generating model to popularize the tech, much progress has been made in the AI text-to-image generator space with improved quality, speed, and prompt adherence. However, even the fastest image generators typically take a couple of seconds to create an image -- except this one.
Also: Apple's AI doctor will be ready to see you next spring
HART, short for Hybrid Autoregressive Transformer, is an AI text-to-image generator developed by MIT, Nvidia, and Tsinghua University. It features unprecedented speed and generations with 3.1 to 5.9 times lower latency than state-of-the-art diffusion models. The key difference? How HART was trained.
Without getting too technical, instead of using a diffusion model, which is the training method employed by most popular AI image generators, including OpenAI's DALL-E and Google's Imagen 3, HART is an autoregressive (AR) visual generation model, the same as OpenAI's recently released GPT-4o image generator.
AR models offer more control over the final image by generating it step-by-step. However, training these models is costly, and the quality can suffer at higher resolutions. To improve this issue, researchers introduced a hybrid tokenizer that helps process different parts of the image more efficiently. The result: HART is faster and has a higher throughput than diffusion models.
Also: Gartner to CIOs: Prepare to spend more money on generative AI
Since most AI models take at least a few seconds to generate images, which is impressively quick anyway, I didn't expect HART's speed to leave me very impressed. However, I was wrong. The model is accompanied by a stopwatch for timing each generation. After using the model a few times, I noticed it took 1.8 seconds to generate images. For context, that's how long it takes to say 'Mississippi.'
The same prompt I used to render the images at the top of the article took OpenAI's GPT-4o image generator one minute and 45 seconds and Google's Imagen 3 about 10 seconds. The quality of all three generators was comparable, with Google's image taking the lead, combining speed and quality the best.
Prompt: A dog wearing a clown hat on a colorful background. (Left to right: ChatGPT's 4o image model, Gemini's Imagen 3, HART.)
Despite Google's model's speed, it took Imagen 3 about 10 times longer than HART to generate the picture, which shows the pace of HART. I have tested most of the text-to-image models on the market, and HART is the quickest.
Also: AI agents aren't just assistants: How they're changing the future of work today
If you want to try HART, you can access it for free here. The inference code is also open-sourced and accessible via a public GitHub repository, which developers, academics, or AI aficionados can use for further research on image generators.