OpenAI's ol-Preview Model: A New Paradigm for Large Language Models
💬 Ask
Large language models (LLMs) are increasingly used to solve complex problems, especially in specialized domains like medicine. To guide LLMs towards top performance, researchers have developed techniques such as Medprompt, which leverages a prompt to elicit run-time strategies such as chain-of-thought reasoning and ensembling. However, a new paradigm has emerged with OpenAI’s “ol-preview” model, which is inherently designed to perform sophisticated step-by-step problem solving during inference.
In a new paper titled “From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond”, researchers from Microsoft and OpenAI systematically evaluated the ol-preview model on a suite of medical benchmarks. They found that ol-preview largely outperforms the GPT-4 series with Medprompt, even without the need for additional prompting techniques. Furthermore, they found that few-shot prompting hinders ol-preview’s performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. Overall, the researchers found that ol-preview represents a more affordable option for achieving state-of-the-art performance, though GPT-40 with Medprompt retains value in specific contexts.
The research team found that ol-preview has reached near-saturation on many existing medical benchmarks. This underscores the need for new, challenging benchmarks that push the limits of LLMs.
One of the key takeaways from the paper is that ol-preview’s impressive performance might be due to its inherent reasoning abilities, which were trained as part of its model training process. The researchers suggest that ol-preview may be less reliant on elaborate prompt-engineering techniques that were previously essential for earlier generations of LLMs.
This research provides valuable insights into the evolving landscape of LLMs. Ol-preview’s capabilities suggest that we may be entering a new era in which models are increasingly trained with internal reasoning capabilities, reducing the need for extensive prompt engineering.
Here are some concrete examples of how ol-preview demonstrates its advanced reasoning abilities:
- Medical Licensing Exams: The researchers found that ol-preview achieved high scores on the USMLE Sample Exam and the USMLE Self Assessment, even when given only a simple 0-shot prompt.
- Japanese Medical Licensing Exam: Ol-preview demonstrated remarkable proficiency in answering non-English medical questions on the JMLE-2024 benchmark, a newly curated dataset derived from the national medical licensing exam held in Japan in February 2024.
- MedQA Benchmark: The researchers evaluated the efficacy of Medprompt’s core components - such as tailored prompting, few-shot prompting, and ensembling - on the MedQA benchmark. They found that ensembling consistently yielded strong boosts in performance across benchmark datasets, while few-shot prompting had a marginal effect. Tailored prompting yielded small but consistent gains.
The research also explored the role of “reasoning tokens” in ol-preview’s performance. Reasoning tokens are internal tokens that the model generates during its inference process. They found that increasing the number of reasoning tokens by providing a prompt that encourages extended reasoning led to improvements in accuracy.
This research opens up exciting avenues for future research into LLMs. The researchers highlight several key directions for future work:
- Metareasoning: The paper discusses the potential of metareasoning techniques, which involve controlling and optimizing real-time deliberation over multiple generative processes in large language models.
- In-Context Learning: The researchers explore the use of in-context learning to optimize the input for LLMs, noting that the optimal way to provide relevant examples and additional context to improve performance remains a promising area of research.
- Integrating External Resources at Runtime: The paper highlights the potential of LLMs to actively acquire information at run-time from external sources, such as the web and knowledge bases.
- Guiding LLM Inference and Sampling: The authors discuss the emergence of new tools that enable finer-grained control and scaling of language model inference, such as entropy-based sampling techniques and software tooling like Guidance.
- Reasoning: The paper delves into the importance of reasoning in LLMs and explores different approaches to enhancing LLMs’ reasoning capabilities.
- Ensembling: The researchers underscore the power of ensembling, noting its potential for enhancing model performance and exploring emerging approaches such as Ensemble Refinement and LLM-Blender.
- Model Federation and Multi-Agent Architectures: The paper highlights the promising potential of multi-agent frameworks and multi-agent orchestration in LLMs.
This research provides a compelling case for the growing importance of run-time strategies in LLMs. As LLMs become increasingly sophisticated and capable of complex reasoning, techniques like ol-preview will be essential for maximizing their performance and ensuring their safe and responsible use.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.