OpenAI has launched three new models that represent significant advancements in AI capabilities: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. According to the company’s announcement, these models outperform their predecessors across the board, with major gains in coding, instruction following, and the ability to process up to 1 million tokens of context.
The models come with a refreshed knowledge cutoff of June 2024, providing more up-to-date information than previous versions.
Related post: OpenAI’s Latest AI Model o3-mini.
Breaking Down the Performance Improvements
The flagship GPT-4.1 model demonstrates remarkable progress in several key areas, as measured by industry-standard benchmarks:
On SWE-bench Verified, a measure of real-world software engineering skills, GPT-4.1 completes 54.6% of tasks, compared to 33.2% for GPT-4o. This represents a 21.4 percentage point improvement over GPT-4o and a 26.6 percentage point gain over GPT-4.5.
For instruction following ability, GPT-4.1 scores 38.3% on Scale’s MultiChallenge benchmark, marking a 10.5 percentage point increase over GPT-4o.
The model also sets a new state-of-the-art result on Video-MME, a benchmark for multimodal long context understanding, scoring 72.0% on the long, no subtitles category—a 6.7 percentage point improvement over GPT-4o.
Real-World Applications and Testing
OpenAI’s announcement highlights feedback from companies that participated in alpha testing of the new models:
Windsurf’s internal evaluations found that GPT-4.1 scores 60% higher than GPT-4o on their internal coding benchmark, which correlates strongly with how often code changes are accepted on the first review. According to OpenAI, Windsurf’s users noted that it was 30% more efficient in tool calling and about 50% less likely to repeat unnecessary edits or read code in overly narrow steps.
Blue J tested GPT-4.1 against their most challenging real-world tax scenarios and found it to be 53% more accurate than GPT-4o, according to the OpenAI announcement.
Thomson Reuters, working with their legal AI assistant CoCounsel, was able to improve multi-document review accuracy by 17% when using GPT-4.1 across their internal long-context benchmarks.
Carlyle used GPT-4.1 to extract financial data across multiple documents and found it performed 50% better on retrieval from very large documents with dense data, based on information provided in OpenAI’s announcement.
Strategic Model Differentiation
OpenAI has positioned each of the three new models for different use cases:
GPT-4.1 serves as the flagship model, offering the best overall performance for complex tasks.
GPT-4.1 mini represents what OpenAI describes as “a significant leap in small model performance,” even outperforming GPT-4o on many benchmarks. According to the announcement, it matches or exceeds GPT-4o in intelligence evaluations while reducing latency by nearly half and reducing cost by 83%.
GPT-4.1 nano is described as OpenAI’s fastest and cheapest model available, with its 1 million token context window and performance scores that exceed GPT-4o mini on several benchmarks, including an 80.1% score on MMLU.
Making Advanced AI More Accessible
OpenAI has reduced pricing across all models, with GPT-4.1 being 26% less expensive than GPT-4o for median queries. The company has also increased the prompt caching discount to 75% (up from 50%) and eliminated additional charges for long-context requests.
The official pricing structure as announced by OpenAI:
Model | Input | Cached input | Output | Blended Price* |
---|---|---|---|---|
GPT-4.1 | $2.00 | $0.50 | $8.00 | $1.84 |
GPT-4.1 mini | $0.40 | $0.10 | $1.60 | $0.42 |
GPT-4.1 nano | $0.10 | $0.025 | $0.40 | $0.12 |
Based on typical input/output and cache ratios according to OpenAI.
Phasing Out GPT-4.5 Preview
The announcement also states that OpenAI will begin deprecating GPT-4.5 Preview in the API, as GPT-4.1 offers improved or similar performance on many key capabilities at much lower cost and latency. GPT-4.5 Preview will be turned off on July 14, 2025, giving developers three months to transition.
Coding Capabilities
According to OpenAI, GPT-4.1 is significantly better than GPT-4o at various coding tasks, including solving coding tasks, frontend coding, making fewer extraneous edits, and following diff formats reliably.
In head-to-head comparisons conducted by OpenAI, paid human graders preferred GPT-4.1’s websites over GPT-4o’s 80% of the time. The company also reports that in their internal evaluations, extraneous edits on code dropped from 9% with GPT-4o to 2% with GPT-4.1.
For API developers editing large files, OpenAI states that GPT-4.1 is much more reliable at code diffs across various formats, more than doubling GPT-4o’s score on Aider’s polyglot diff benchmark, and even beating GPT-4.5 by 8 percentage points.
Long Context Processing
All three new models can process up to 1 million tokens of context—equivalent to more than 8 copies of the entire React codebase, according to OpenAI. The company states they trained GPT-4.1 to reliably attend to information across the full 1 million context length and to be more reliable than GPT-4o at noticing relevant text while ignoring distractors.
OpenAI’s internal testing shows that GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano are all able to retrieve specific information (a “needle”) at all positions in contexts up to 1 million tokens.
Vision Capabilities
The announcement notes that the GPT-4.1 family demonstrates strong image understanding capabilities, with GPT-4.1 mini in particular representing a significant leap forward, often outperforming GPT-4o on image benchmarks.
Available Now with Documentation
All three models are available now to developers through OpenAI’s API. The company has also published a prompting guide to help developers maximize the capabilities of these new models.
For more detailed information about model performance across various benchmarks, including academic knowledge, coding, instruction following, long context processing, vision, and function calling evaluations, developers can visit OpenAI’s announcement page.