Introducing GPT-4.1

Presenting a breakthrough generation of GPT models that redefine performance in coding, instruction adherence, and extended context processing—featuring our inaugural nano model.

Introducing GPT-4.1

April 14, 2025 | Product

Presenting a breakthrough generation of GPT models that redefine performance in coding, instruction adherence, and extended context processing—introducing our very first nano variant.

Today, we are excited to announce the launch of three models: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. These new models have been meticulously engineered to outperform the previous GPT-4o and GPT-4o mini versions across all fronts. They excel in complex coding tasks and in closely following detailed instructions. Moreover, they support significantly larger context windows—capable of handling up to 1 million tokens—enabling them to extract deep insights from very long documents. They also feature an updated knowledge cutoff as of June 2024.

Performance Highlights

GPT-4.1 demonstrates outstanding capabilities across several industry-standard benchmarks:

  • Coding Excellence: On SWE‑bench Verified, GPT‑4.1 achieves a 54.6% score—a notable absolute improvement of 21.4% over GPT‑4o and 26.6% over GPT‑4.5. This performance cements its status as an industry leader in programming tasks.

  • Instruction Adherence: Evaluated on Scale's MultiChallenge benchmark, GPT‑4.1 scores 38.3%, marking a 10.5% absolute increase compared to GPT‑4o. This shows its robust capability in following structured and detailed instructions.

  • Long-Context Mastery: In the Video-MME benchmark for multimodal long-context understanding (specifically on the long, no subtitles category), GPT‑4.1 sets a new industry record by scoring 72.0%—an improvement of 6.7% in absolute terms over GPT‑4o.

While benchmark figures offer useful snapshots, our training process was equally focused on real-world applicability. In close partnership with the developer community, we refined these models to perform superbly on the tasks that truly matter.

Performance Comparison (Illustrative Data)

FeatureGPT-4oGPT-4.1Improvement
Coding (SWE)33.2%54.6%+21.4% abs
Instruction27.8%38.3%+10.5% abs
Long Context65.3%72.0%+6.7% abs

Note: The GPT‑4o values are hypothetical and provided for illustration only.

GPT-4.1 Family Intelligence by Latency (Image shows model performance relative to latency. Replace URL if needed.)

Detailed Evaluations

Coding

GPT‑4.1 marks a significant leap in our coding performance. It is proficient at generating production-level code, handling frontend development challenges, and making only essential edits. In real-world software engineering evaluations such as SWE‑bench Verified, it shows marked improvements over GPT‑4o in tasks that involve exploring code repositories, completing assignments, and writing code that runs reliably.

Real World Coding Examples

  • Windsurf: Internal benchmarks indicate that GPT‑4.1 boosts performance by 60% on Windsurf’s coding tests. Testers have noted a 30% improvement in tool utilization and 50% fewer superfluous edits compared to earlier models.

  • Qodo: A head-to-head evaluation involving 200 consistent pull request reviews revealed that GPT‑4.1 produced the superior code suggestion in 55% of cases, emphasizing its balanced proficiency in precision and comprehensive feedback.

Instruction Following

GPT‑4.1 has been finely tuned to follow instructions with exceptional reliability. It consistently adheres to custom formatting, respects negative instructions, processes ordered commands, and accurately includes all required content.

Key Instruction Following Features:

  • Custom Format Compliance: Whether it’s XML, YAML, Markdown, or other formats, GPT‑4.1 follows the provided structure meticulously.

  • Negative Constraints: When asked to avoid specific actions, the model obeys these restrictions reliably.

  • Sequenced Responses: For multi-step instructions (e.g., asking for a name and then an email), the model maintains the required order.

  • Content-specific Directives: GPT‑4.1 ensures that its outputs include all critical details as requested.

  • Measured Confidence: It responds appropriately when uncertain, indicating “I don't know” where required to avoid misleading information.

Multi-turn and Context Retention

Sustaining context over extended conversations is vital. GPT‑4.1 has been optimized for multi-turn interactions—it accurately references previous conversation points, which is reflected in a 10.5% absolute performance boost on the MultiChallenge benchmark compared to GPT‑4o.

Long Context Understanding

A standout feature of the GPT‑4.1 family is its ability to process up to 1 million tokens—substantially more than the 128,000 token limit in previous versions. This capability means it can handle extensive documents or multiple large codebases in a single pass.

In our needle-in-a-haystack evaluations, GPT‑4.1 (along with its mini and nano variants) consistently retrieves small, hidden pieces of information regardless of where they appear in the context, ensuring its utility in tasks ranging from legal analysis to in-depth code reviews.

Vision and Image Understanding

The GPT‑4.1 series also excels in image processing. Particularly, the GPT‑4.1 mini model has achieved notable improvements on image benchmarks, often surpassing the earlier GPT‑4o on tasks involving image interpretation and analysis.

Industry Use Cases

  • Thomson Reuters: Testing with CoCounsel—a professional AI assistant for legal workflows—showed that GPT‑4.1 improved multi-document review accuracy by 17% over GPT‑4o. This improvement is essential for managing complex legal tasks involving multiple lengthy documents.

  • Carlyle: In financial applications, GPT‑4.1 demonstrated a 50% better performance in extracting and understanding detailed data from extensive documents (including PDFs and Excel files), overcoming challenges such as hidden data retrieval and multi-hop reasoning.

Inference and Latency Enhancements

Beyond accuracy, speed is critical. Our inference stack has been re-engineered to reduce the time to the first token. For example, when processing 128K tokens, the p95 latency is approximately 15 seconds – extending to about 30 seconds for a full 1 million token context. The GPT‑4.1 mini and nano variants provide even lower latency, with nano often returning results in under five seconds for a 128K token query.

Vision for the Future

The advancements introduced in GPT‑4.1 not only set new performance benchmarks in coding, instruction adherence, long-context management, and image processing but also open up new possibilities for real-world applications. From enhancing legal research and financial analysis to powering sophisticated web applications, these models represent a significant step forward in both efficiency and scalability.

In summary, the GPT‑4.1 model family provides superior performance at lower costs, advancing the state-of-the-art along both the latency and accuracy curves. This positions it as a critical enabler for next-generation applications across industries.

Ready to Transform Your AI Workflow?

Join thousands of businesses already benefiting from LLMWizard's unified AI platform. Experience seamless model switching, unmatched versatility, and significant cost savings, all in one subscription.