SocialGPT: Using LLMs for Social Relationship Recognition
π Full Paper
π¬ Ask
In a new paper, researchers present SocialGPT, a modular framework that leverages the capabilities of Large Language Models (LLMs) to identify social relationships from images. SocialGPT first employs a Vision Foundation Model (VFM) to transform the image into a textual social story. This story is then used to prompt an LLM, which reasons over the text to produce an answer about the social relationship.
Previous approaches to social relationship recognition typically rely on end-to-end deep learning, which trains a dedicated neural network on image data. These methods are limited in terms of generalizability and interpretability, as they lack the ability to explain their decisions.
SocialGPT offers a different approach. It uses a two-step process to first perceive and then reason. The VFM is responsible for perceiving visual information and translating it into text. The LLM is responsible for reasoning over the text and generating an answer. This modular approach allows for greater flexibility and transparency, as the two steps can be designed and evaluated independently.
Understanding SocialGPT in Action
To understand how SocialGPT works, consider an image of a family playing in a park. The VFM would analyze the image and generate a story like this: βA young boy, P1, is wearing a red shirt and is playing with a red ball. His mother, P2, is wearing a blue dress and is sitting on a bench nearby. The family is enjoying a beautiful day in the park.β This story is then fed to the LLM along with a prompt asking for the social relationship between the boy and his mother. The LLM can reason over the text and output a response like βThe social relationship between P1 and P2 is mother-child.β
SocialGPT demonstrates competitive zero-shot performance on standard benchmarks for social relationship recognition. It also provides explainable answers, as the LLM can generate language-based explanations for its decisions. This makes SocialGPT a powerful tool for social relationship recognition, as it combines the accuracy of deep learning with the transparency of natural language processing.
Automatic Prompt Optimization
The manual design of prompts for LLMs can be tedious. The researchers propose a method called Greedy Segment Prompt Optimization (GSPO) to automate the prompt design process. GSPO searches for the optimal prompt by utilizing gradient information at the segment level, effectively allowing for a greedy search through a large space of possible prompts.
Experimental results demonstrate that GSPO significantly improves the performance of SocialGPT. It also shows that the framework can be generalized to different image styles, further highlighting its potential for real-world applications.
The Future of Social Relationship Recognition
SocialGPT opens a promising new direction for future research in social relationship recognition. By leveraging the power of LLMs, it can achieve high accuracy and interpretability, paving the way for more robust and intelligent systems for understanding social relationships.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.