Kybernetika 60 no. 6, 819-833, 2024

DDIMCache: An enhanced text-to-image diffusion model on mobile devices

Wu QifengDOI: 10.14736/kyb-2024-6-0819

Abstract:

On June 11, 2024, OpenAI announced a collaboration with Apple to deeply integrate the ChatGPT generative language model into Apple's product lineup. With support from various generative AI models, devices like smartphones will become more intelligent. The text-to-image diffusion model, known for its stable and superior generative capabilities, has gained wide recognition in image generation and will undoubtedly play a crucial role on mobile devices. However, the large size and complex architecture of diffusion models result in high computational costs and slower execution speeds. As a result, diffusion models require high-end GPUs or cloud-based inference, which often raises personal privacy and data security. This paper presents a multiplicative effect joint optimization method for complex models such as diffusion models, enabling efficient execution on mobile devices. The method integrates multiple optimization strategies, leveraging their interactions to create synergies and enhance overall performance. Building on this multiplicative effect joint optimization approach, we have introduced DDIMCache, an enhanced text-to-image diffusion model. DDIMCache maintains image generation quality while achieving optimal speed, generating 512--512 images in approxima\-tely 6 seconds. This provides powerful image generation capabilities and an enhanced user experience for mobile users.In addition, as a foundation model, Stable Diffusion supports more applications such as image editing, inpainting, style transfer, and super-resolution, all of which can have a significant impact. The ability to run the model entirely on mobile devices without an internet connection will open up endless possibilities.

Keywords:

diffusion model, text-to-image, mobile devices

Classification:

68T01

References:

  1. R. Rombach, A. Blattmann and D. Lorenz et al.: High-resolution image synthesis with latent diffusion models. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, pp. 10684-10695.   DOI:10.1109/CVPR52688.2022.01042
  2. J. Hou and Z. Asghar: World's first on-device demonstration of stable diffusion on an android phone. Qualcomm 24 (2023).   https://www.qualcomm.com/news/onq/2023/02/worlds-first-on-device-demonstration-of-stable-diffusion-on-android
  3. Y. H. Chenm R. Sarokin and J. Lee et al.: Speed is all you need: On-device acceleration of large diffusion models via gpu-aware optimizations. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023. pp. 4651-4655.   DOI:10.1109/CVPRW59228.2023.00490
  4. Y. Shang and Z. Yuan et al.: Post-training quantization on diffusion models. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023. pp. 1972-1981.   DOI:10.1109/CVPR52729.2023.00196
  5. X. Li, Y. Liu and L. Lian et al.: Q-diffusion: Quantizing diffusion models. In: Proc. IEEE/CVF International Conference on Computer Vision 2023: pp. 17535-17545.   DOI:10.1109/ICCV51070.2023.01608
  6. X. Ma, G. Fang and X.Wang: Llm-pruner: On the structural pruning of large language models. Adv. Neural Inform. Process. Systems 36 (2023), 21702-21720.   CrossRef
  7. Y. Li, G. Yuan and Y. Wen et al.: Efficientformer: Vision transformers at mobilenet speed. Adv. Neural Inform. Process. Systems 35 (2022), 12934-12949.   CrossRef
  8. J. Sohl-Dickstein, E. Weiss and N. Maheswaranathan et al.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning PMLR, 2015, pp. 2256-2265.   CrossRef
  9. Jiaming Song, Chenlin Meng and Stefano Ermon: Denoising diffusion implicit models. 2020. In: arXiv preprint:   2010.02502
  10. S. M. Jain: Hugging face. Introduction to transformers for NLP: With the hugging face library and models to solve problems. Apress, Berkeley 2022, 51-67.   DOI:10.1007/978-1-4842-8844-3\_4
  11. O. Ronneberger, P. Fischer and T. Brox U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted interventional MICCAI 2015. In: Proc. 18th international conference, Munich 2015, part III 18. Springer International Publishing, pp. 234-241.   CrossRef
  12. T. Y. Lin, M. Maire and S. Belongie et al.: Microsoft coco: Common objects in context. Computer Vision'ECCV 2014. In: Proc. 13th European Conference, Zurich 2014, Part V 13. Springer International Publishing 2014, pp. 740-755.   CrossRef
  13. A. Q. Nichol and P. Dhariwal: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR 2021, pp. 8162-8171.   CrossRef