How We Productionalized Generative AI Models at WOMBO and Served Over 100 Million Users

Do not index

WOMBO Me: AI Avatar Generator

Intro: AI is Mainstream

Today, almost everyone has heard of Midjourney, Dall-e and ChatGPT. It suffices to say that AI - more specifically, generative AI - has become mainstream.

At WOMBO, we were ahead of the curve. In the past 3 years, with 3 app launches and over 150 million downloads, we have been one of the most successful consumer generative AI companies in the world. First, in 2021 we launched WOMBO, and amassed over 74 million users (link) in the first 10 months. Next, Dream, our text-to-image art generation app, won app of the year on the Google Play Store in 2022 (link), amassing over 70 million instals to date.

We are now on to our third app, called WOMBO Me (android, ios), which is also on its way towards achieving viral success, with almost 1 million users in the past 30 days. WOMBO Me allows users to generate images of themselves in different scenarios - think AI generated headshots that are so good that they can be used for LinkedIn. For example, it can generate professional looking headshots that can be used for LinkedIn profile pictures.

The Winning Formula

The formula for WOMBO’s success comes down to:

Identifying AI models that can give users interesting results,

Creating the infrastructure necessary to scale these models to millions of users in a cost effective way, and

Creating really simple and beautiful user interfaces.

This blog post will focus on the first two items - namely, how we were able to effectively scale these AI models, which are often very complex and must run on expensive GPUs.

The Golden Age of Programming

We live in an age of tech culture where open source code is so widely available that it allows any developer with a laptop and internet connection to build an incredible array of apps, tools, websites and products. Now, especially with open source AI models, we live in the golden age of programming.

10 years ago, when I was taking classes in college, I remember a professor telling us that if we could solve the image recognition task, we would be millionaires. Only a few years later, Yann LeCun, along with Yoshua Bengio and Geoffrey Hinton, received $1,000,000 with the 2018 ACM A.M. Turing Award for their work on deep learning that led to image recognition advancements, among other tasks (link).

Those developments made it such that today, a developper can write a program in about 20 lines of code which can detect if a picture contains the eiffel tower or not, using open source frameworks such as Huggingface’s Transformers (link).

Finding New State-of-the-Art Models

By following AI researchers on Twitter, updates on Huggingface, and other sources, we are able to keep up with the latest advancements. We crawl github and find new code and model weights that are released by researchers at universities and large companies like OpenAI, Google, or Meta.

Oftentimes, the new codebases that we find are difficult to understand and to test, and so we have a dedicated AI team that have become experts in exploring these models.

We have gotten really good at finding these models, getting results, setting up demos for the team to try, and either adopting them or moving on quickly to the next one. Sometimes, multiple models from different sources are combined to create interesting results. Like a text to image generator mixed with an image upscaler.

Once the AI team finds a good model, then the backend team gets to work.

The Serving Infrastructure for our AI Models

Bringing AI models to production and giving access to them to millions of users is a hard task, for various reasons. Some of the challenges that we faced when serving these models include long inference times, large GPU memory (VRAM) requirements, expensive GPUs, and variable user traffic.

Image generated with Stable Diffusion XL (SDXL)

Serving the models in a cost effective way - why we chose to self-host

Now, you might be asking, why not just use a service like Google’s Vertex AI, OpenAI’s Dall-e API, or Stability’s API? There are lot’s of service providers who offer easy access to AI models these days.

Although we’ve explored these services, we decided to host our own models on our own GPU clusters (but hosted on the cloud), for reasons that we will explore below.

Cost

This one is probably the most important reason why we run the models on our own servers. If we look at WOMBO Dream, we’ve generated more than 5 billion images. At $0.04 per image, using an external service like Dall-e would have cost us 200 million dollars! We would not have been able to offer such a fast and low cost (or free) experience to our users.

When we launched Dream, we were actually using Google’s Translate API on Google Cloud. However, during cost savings efforts, we realised that we were spending about $15,000 a month on the service. We were able to reduce that cost by 10x to about $1,500 a month by finding an open source translation library called Argos, with which we were quickly able to create an internal translation API.

Availability

At WOMBO we like to be early. Because of this, there often simply aren’t any APIs available to use. When we launched Dream, there were no image generation APIs available. VQGAN, the image generation model that we used, wasn’t even that widely known yet!

With Stable Diffusion, there are often updates by the open source community which might bring new styles, new optimizations, and new quality improvements. For example, most recently, LCM-Loras were implemented for Stable Diffusion, which lowered generation time by almost 10x!

Images generated using VQGAN in WOMBO Dream

Control

Finally, even if we were to switch to APIs, for simplicity and ease of development, the APIs often don’t allow us to play with all the parameters that we need to play with. For example, many of the image generation APIs only offer fixed image dimensions, like 1024x1024. Part of WOMBOs success was our ability to play around with these parameters, and fine tune them for different use cases. For example, we generated images at the same aspect ratio as people’s phones, rather than as normal square images.

GPUs and Queues - How to Handle Long Inference Times

One of the biggest challenges that we face, when dealing with these models, is long generation times. In today’s day and age, users are accustomed to getting results fast - they don’t wait around. That means all of our infrastructure needs to be efficient, as well as the models themselves.

Not only do we need to be efficient, but we need to handle load! We have users around the world, and we can hit peaks of 1 Million concurrent users.

Enter Queues

In order to handle load, we make good use of a queue infrastructure. The idea is the following:

The user sends a request to the backend

The backend creates a new task (image generation for example), and sends the task into a queue. We use Amazon’s SQS queue service.

We have a pool of GPU workers which constantly poll the queue. Once a task is available, a worker picks it up and completes the task (Generates an image, for example).

Once the task is finish, the worker calls the dream backend with the result, and the result is then sent back to the user.

With regards to long inference time, this architecture allows our main backend service to be free to handle other business logic while the GPU’s do the work.

User Traffic and the Virality Cycles - The Importance of Autoscaling

In May 2022, this TikTok trend went viral, sending millions of users to our app. We had to scale from hundreds of thousands to millions of daily users. If you’ve ever dealt with scaling GPUs, you know that this is difficult!

By setting up autoscaling, we are able to automatically increase or decrease the number of GPUs that are in our worker pool waiting for tasks to come from the queue. This not only helped us serve demand, but it also helped us reduce costs by not over-provisioning when the viral load disappeared.

Additionally, thanks to our queue based architecture, we were able to make use of different service providers such as Azure, GCP and AWS to provide us with GPUs, and increase are worker pool even more, when needed.

Containerization and Flexibility

Finally, it’s important to mention our use of Docker containers, which allows us to bring our code to different services without any friction.

Docker makes our life easier by packaging a whole environment into a single runnable unit that can be easily deployed and ran on any machine or cloud service. It also ensures consistency across multiple development and production environments, so that when we have something working locally, we know it will also work in the cloud.

One issue with Docker, however, is that we suffer from cold starts and slow build times. Because AI models can be very big in size (multiple GBs per model), a GPU node needs to download these files and place them into it’s Video Memory (VRAM), which can take quite some time. This GCP blog gives a good overview of this, and possible solutions. Furthermore, services like modal.com look promising in offering containerization improvements.

Conclusion

In summary, WOMBO has become a successful consumer generative AI company thanks to the engineering efforts to scale and productionalize open source models.

Despite the challenges of long inference times, large GPU memory requirements, expensive GPUs, and variable user traffic, at WOMBO we’ve managed to effectively scale these AI models by hosting our own models on our own GPU clusters, utilizing open-source code, and staying updated with the latest advancements in AI.

We look forward to adding new features and bringing more exciting AI tools to mainstream audiences. In the future we see big advancements in mutli-modal models: models that can take as input text, audio, images or video, and generate as output more text, audio, images or video.

The future is exciting, and we are happy to take part in shaping it.