Apple Unveils MM1.5: A 300-Billion-Parameter AI Model for Advanced Image Recognition and Language Reasoning

October 13, 2024 – Apple has recently unveiled its latest multi-modal AI model, MM1.5, boasting an impressive 300 billion parameters. This advanced model is an evolution of its predecessor, MM1, built upon the same fundamental architecture.

Continuing with a data-driven training approach, MM1.5 explores the impact of mixing various data types during different training cycles on model performance. Detailed model documentation has been made available on Hugging Face.

Offering a range of parameter scales from 1 billion to 300 billion, MM1.5 possesses capabilities in both image recognition and natural language reasoning. Apple’s research team has refined the data mixing strategies in this updated version, significantly enhancing the model’s abilities in multi-text image comprehension, visual grounding and localization, as well as multi-image reasoning.

According to the accompanying research paper, the introduction of high-quality OCR data and synthetic image captions during the continuous pre-training phase of MM1.5 has notably improved the model’s understanding of images containing substantial text.

Furthermore, through an in-depth analysis of the effects of different data types on model performance during the supervised fine-tuning stage, researchers have optimized the mixing of visually instructed fine-tuning data. This optimization allows even smaller-scale models, such as the 1 and 3 billion parameter versions, to exhibit remarkable performance, achieving greater efficiency.

Apple has also introduced specialized models: MM1.5-Video, tailored for video comprehension, and MM1.5-UI, designed for mobile device user interface understanding. The MM1.5-UI model, in particular, shows promise as a future “Apple-branded” AI, capable of handling various visual grounding and localization tasks, summarizing on-screen functions, and interacting with users through dialogue.

Despite MM1.5’s excellent performance across multiple benchmarks, Apple’s team plans to further enhance the model’s understanding of mobile device UIs by integrating text, image, and user interaction data more tightly and designing more complex architectures. These advancements aim to make the “Apple-branded” AI even more powerful.

Leave a Reply