Inside Phi-4: How Microsoft is Embracing SLM Development

Share this article
Share this article
Prioritise Us on Google
Microsoft has unveiled the latest in its Phi series of small language models
Microsoft has launched the latest models in its Phi family, offering speech, vision and text capabilities to empower developers with AI capabilities

When it comes to the AI race, larger isn’t always better. Indeed, the practicalities of developing and deploying powerful but resource-intensive large language models (LLMs) can bring as many challenges as opportunities. 

Complexities including energy costs, higher operational costs and difficulties embedding large models with high computational demand on devices with limited resources, are giving rise to an alternative solution: small language models (SLMs). 

These smaller AI tools are gaining increasing traction by offering companies, developers and individuals key advantages related to efficiency, cost and the ability to be customised to perform well on specific tasks. 

Capitalising on this demand, Microsoft has launched Phi-4, the latest iteration of its series of SLMs specifically designed for efficiency and effectiveness. 

The Phi family was created to empower developers with advanced AI capabilities, offering performance equal to many of its larger counterparts but at low cost and low latency.

Phi-4 builds on these capabilities with two new models: Phi-4-multimodal, which is able to process speech, vision and text simultaneously for the creation of innovative and context-aware applications, and Phi-4 mini, which excels in text-based tasks to provide high accuracy and scalability in a compact form. 

Both are available in Azure Ai Foundry, HuggingFace and the NVIDIA API Catalog.

Weizhu Chen, Vice President of Generative AI at Microsoft. Pic: Microsoft

Phi-4: empowering innovation

“Phi-4-multimodal marks a new milestone in Microsoft’s AI development as our first multimodal language model,” says Weizhu Chen, Vice President, Generative AI at Microsoft, adding that the model is built based on direct customer feedback to integrate speech, vision and text processing into a single, unified architecture. 

“By leveraging advanced cross-modal learning techniques, this model enables more natural and context-aware interactions, allowing devices to understand and reason across multiple input modalities simultaneously,” he says. 

“Whether interpreting spoken language, analysing images, or processing textual information, it delivers highly efficient, low-latency inference – all while optimising for on-device execution and reduced computational overhead.”

Phi-4-multimodal’s new architecture enhanced efficiency and scalability, incorporating a larger vocabulary for improved processing and supporting multilingual capabilities. 

This makes it highly suited for deployment on devices and edge computing platforms, says Chen. 

The model has already demonstrated competitive performance against larger counterparts in several areas. It outperforms other, specialised models in both automatic speech recognition and speech translation and has placed top on the Huggingface OpenASR leaderboard with a word error rate of 6.14%, surpassing the previous record of 6.5% as of February 2025.

In addition, it is among a few open models to successfully implement speech summarisation and achieve performance levels comparable to the GPT-4o model.

Phi-4-mini is a 3.8B parameter model designed for speed and efficiency. It too has outperformed larger models in text-based tasks like reasoning, maths, coding and instruction following. It delivers high accuracy and scalability, making it a powerful solution for advanced AI applications. 

Customisation and use

Phi-4’s smaller size means both models can be used in compute-constrained inference environments, and their lower computational needs make them a lower cost option with better latency, particularly for analytical tasks. 

According to Microsoft, the SLMs are ideal for tasks and applications including being embedded directly into smartphones by manufacturers, integration into vehicles as in-car assistant systems, and automating complex calculations and reporting in the financial services industry.

SLMs on the rise

The launch of Phi-4 demonstrates the growth of SLMs as a viable alternative for simpler tasks and a solution for organisations with limited resources where they can be more easily fine tuned to meet specific needs. 

Speaking on a Microsoft blog in 2024, Principal Product Manager for Generative AI at the company, Sonali Yadav, referenced this growth, saying: “What we’re going to start to see is not a shift from large to small, but a shift from a singular category of models to a portfolio of models where customers get the ability to make a decision on what is the best model for their scenario.”

Phi-4 By The Numbers
  • 5.6B - Parameters in the Phi-4-multimodal model, fewer than most competing multimodal systems
  • 6.14% - Word error rate on the Huggingface OpenASR leaderboard, representing a new benchmark record
  • 128,000 - Maximum token sequence length supported by the Phi-4-mini model, enabling processing of extensive text

In addition, it is among a few open models to successfully implement speech summarisation and achieve performance levels comparable to the GPT-4o model.

Phi-4-mini is a 3.8B parameter model designed for speed and efficiency. It too has outperformed larger models in text-based tasks like reasoning, maths, coding and instruction following. It delivers high accuracy and scalability, making it a powerful solution for advanced AI applications. 

Customisation and use

Phi-4’s smaller size means both models can be used in compute-constrained inference environments, and their lower computational needs make them a lower cost option with better latency, particularly for analytical tasks. 

According to Microsoft, the SLMs are ideal for tasks and applications including being embedded directly into smartphones by manufacturers, integration into vehicles as in-car assistant systems, and automating complex calculations and reporting in the financial services industry.

SLMs on the rise

Microsoft isn’t the only tech company pursuing SLM development. IBM recently unveiled the next generation of its Granite large language model family, focusing on compact and efficient systems designed for real-world business applications.

Youtube Placeholder

The company’s Granite 3.2 models continue its strategy of focusing on smaller models designed to deliver specific capabilities without high computational needs. 

The release includes a new vision language model able to handle document understanding tasks that, according to IBM, performs at a level equivalent to larger competitors. 

The company also integrated ‘chain of thought’ reasoning into the 2B and 8B parameter version of Granite 3.2, which can break down complex tasks into smaller steps similar to human reasoning. 

Sriram Raghavan, Vice President of IBM AI Research

“The next era of AI is about efficiency, integration and real-world impact – where enterprises can achieve powerful outcomes without excessive spend on compute,” says Sriram Raghavan, Vice President of IBM AI Research.

As well as the Granite updates, IBM released the next generation of its TinyTimeMixers models. While only containing fewer than 10 million parameters, these are capable of forecasting time series data up to two years into the future. 

This makes them applicable for tasks such as supply chain planning, retail inventory management and financial trend analysis. 

“IBM's latest Granite developments focus on open solutions demonstrate another step forward in making AI more accessible, cost-effective and valuable for modern enterprises,” notes Raghavan.


Explore the latest edition of AI Magazine and be part of the conversation at our global conference series, Tech & AI LIVE

Discover all our upcoming events and secure your tickets today.


AI Magazine is a BizClik brand

Company portals