Sharing Conversations with Large Language Models: A New Resource for Researchers
π Full Paper
π¬ Ask
As large language models (LLMs) like GPT-4 and LLAMA become increasingly sophisticated, they are used by a wider range of people, from experts to the general public, for various tasks. These interactions generate valuable data for training and improving LLMs. While for-profit companies collect user data through their model APIs, the open source and research community lags behind in accessing and using this data.
To bridge this gap, researchers from The Hebrew University of Jerusalem and MIT have developed the ShareLM collection, a unified set of human conversations with LLMs, and a Chrome plugin that allows users to contribute their own conversations.
The Need for Open Data
The ShareLM collection recognizes that the existing open datasets of human-model conversations are treated as static artifacts, lacking the dynamic nature of real-world interactions. These datasets also struggle with diversity and representativeness, as they often rely on specific demographics and platforms.
ShareLM Plugin: Empowering Users
The ShareLM plugin addresses these limitations by giving users control over their data and offering a simple way to contribute to the open-source community. Hereβs how it works:
- Easy Usage: The plugin seamlessly integrates with popular chat platforms like Gradio and ChatUI, requiring minimal effort from users.
- User Ownership: Users retain ownership of their data and can choose to delete conversations they prefer to keep private before sharing them.
- Delayed Upload: The plugin offers a 24-hour delay for users to review and delete conversations before they are uploaded to the ShareLM collection.
- Rating and Feedback: Users can rate their conversations and provide feedback at both the conversation and response levels, enriching the data with valuable insights.
The ShareLM Collection: A Growing Resource
The ShareLM collection currently contains over 2.3 million conversations from more than 40 models, and the plugin continuously adds new data. This rich dataset provides researchers with a valuable resource for:
- Training and Aligning Models: By understanding how users interact with LLMs in real-world scenarios, researchers can better align models to human preferences and needs.
- Cognitive and Linguistic Research: Analyzing conversations can reveal gaps in our understanding of human-model interaction and offer insights into how LLMs are used in different contexts.
A Call for Community Effort
The ShareLM collection and plugin are a testament to the power of open data and community collaboration in advancing LLM research. By making this data publicly available, researchers hope to encourage others to contribute to the collection and build upon the existing work.
The ShareLM initiative represents a significant step towards a more open and collaborative approach to LLM development. By providing a framework for sharing human-model conversations, it empowers researchers and empowers users to contribute to the advancement of AI.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.