AI Papers Reader

Personalized digests of latest AI research

View on GitHub

OpenAI's ol-Preview Model: A New Paradigm for Large Language Models

Large language models (LLMs) are increasingly used to solve complex problems, especially in specialized domains like medicine. To guide LLMs towards top performance, researchers have developed techniques such as Medprompt, which leverages a prompt to elicit run-time strategies such as chain-of-thought reasoning and ensembling. However, a new paradigm has emerged with OpenAI’s “ol-preview” model, which is inherently designed to perform sophisticated step-by-step problem solving during inference.

In a new paper titled “From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond”, researchers from Microsoft and OpenAI systematically evaluated the ol-preview model on a suite of medical benchmarks. They found that ol-preview largely outperforms the GPT-4 series with Medprompt, even without the need for additional prompting techniques. Furthermore, they found that few-shot prompting hinders ol-preview’s performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. Overall, the researchers found that ol-preview represents a more affordable option for achieving state-of-the-art performance, though GPT-40 with Medprompt retains value in specific contexts.

The research team found that ol-preview has reached near-saturation on many existing medical benchmarks. This underscores the need for new, challenging benchmarks that push the limits of LLMs.

One of the key takeaways from the paper is that ol-preview’s impressive performance might be due to its inherent reasoning abilities, which were trained as part of its model training process. The researchers suggest that ol-preview may be less reliant on elaborate prompt-engineering techniques that were previously essential for earlier generations of LLMs.

This research provides valuable insights into the evolving landscape of LLMs. Ol-preview’s capabilities suggest that we may be entering a new era in which models are increasingly trained with internal reasoning capabilities, reducing the need for extensive prompt engineering.

Here are some concrete examples of how ol-preview demonstrates its advanced reasoning abilities:

The research also explored the role of “reasoning tokens” in ol-preview’s performance. Reasoning tokens are internal tokens that the model generates during its inference process. They found that increasing the number of reasoning tokens by providing a prompt that encourages extended reasoning led to improvements in accuracy.

This research opens up exciting avenues for future research into LLMs. The researchers highlight several key directions for future work:

This research provides a compelling case for the growing importance of run-time strategies in LLMs. As LLMs become increasingly sophisticated and capable of complex reasoning, techniques like ol-preview will be essential for maximizing their performance and ensuring their safe and responsible use.