We present a comprehensive benchmark comprising over 180+ simulated user-LLM interaction histories with up to 60 multi-turn sessions (~ 1M tokens) across 15 diverse personalized scenarios with 7 types of in-situ user query tasks.
🔍 Can LLMs leverage the past interaction history with a user to deliver personalized responses in real time? 🤔
Evaluation results across different models on 7 in-situ query types.
Query Type \ Model | Gemini 1.5-Flash |
GPT-4.5 | GPT-4.1 | o1 | Gemini 2.0-Flash |
o4-mini | Gemini 2.0-Flash-Lite |
GPT-4o | DeepSeek R1-671B |
Llama 4-Maverick |
o3-mini | GPT 4o-mini |
Llama 3.1-405B |
Claude 3.5-Haiku |
Claude 3.7-Sonnet |
Average | Random Guess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Revisit Reasons Behind Preference Updates | 0.77 | 0.76 | 0.84 | 0.75 | 0.79 | 0.75 | 0.77 | 0.77 | 0.83 | 0.76 | 0.72 | 0.70 | 0.41 | 0.64 | 0.57 | 0.72 | 0.25 |
Tracking Full Preference Evolution | 0.65 | 0.68 | 0.67 | 0.67 | 0.70 | 0.73 | 0.68 | 0.68 | 0.68 | 0.66 | 0.54 | 0.60 | 0.38 | 0.55 | 0.45 | 0.62 | 0.25 |
Recall User Shared Facts | 0.54 | 0.61 | 0.65 | 0.50 | 0.50 | 0.42 | 0.49 | 0.41 | 0.43 | 0.37 | 0.47 | 0.55 | 0.38 | 0.29 | 0.25 | 0.46 | 0.25 |
Acknowledge Latest User Preference | 0.59 | 0.55 | 0.50 | 0.54 | 0.52 | 0.55 | 0.51 | 0.46 | 0.42 | 0.43 | 0.39 | 0.34 | 0.31 | 0.27 | 0.09 | 0.43 | 0.25 |
Provide Preference Aligned Recommendations | 0.55 | 0.44 | 0.57 | 0.42 | 0.51 | 0.41 | 0.52 | 0.37 | 0.49 | 0.42 | 0.41 | 0.41 | 0.37 | 0.32 | 0.20 | 0.43 | 0.25 |
Generalize Reasons to New Scenarios | 0.54 | 0.46 | 0.53 | 0.39 | 0.46 | 0.38 | 0.33 | 0.32 | 0.38 | 0.32 | 0.30 | 0.33 | 0.21 | 0.20 | 0.29 | 0.36 | 0.25 |
Suggest New Ideas | 0.15 | 0.27 | 0.19 | 0.25 | 0.15 | 0.17 | 0.16 | 0.24 | 0.16 | 0.20 | 0.11 | 0.10 | 0.20 | 0.06 | 0.28 | 0.18 | 0.25 |
Overall Accuracy | 0.52 | 0.52 | 0.52 | 0.50 | 0.49 | 0.48 | 0.48 | 0.45 | 0.45 | 0.43 | 0.39 | 0.39 | 0.31 | 0.30 | 0.26 | 0.43 | 0.25 |
Model performances by number of sessions elapsed since most recent preferences were mentioned in long context (up to 20 sessions/128k tokens).
Model \ Num of Sessions | Overall | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gemini-1.5-Flash | 0.52 | 0.74 | 0.56 | 0.53 | 0.54 | 0.50 | 0.47 | 0.49 | 0.51 | 0.53 | 0.50 | 0.49 | 0.42 | 0.54 | 0.48 | 0.44 | 0.53 | 0.65 | 0.57 | 0.48 | 0.48 |
GPT-4.5 | 0.52 | 0.74 | 0.53 | 0.57 | 0.56 | 0.54 | 0.52 | 0.50 | 0.44 | 0.52 | 0.46 | 0.46 | 0.41 | 0.52 | 0.53 | 0.36 | 0.48 | 0.68 | 0.65 | 0.48 | 0.42 |
GPT-4.1 | 0.52 | 0.87 | 0.56 | 0.60 | 0.53 | 0.52 | 0.56 | 0.49 | 0.53 | 0.43 | 0.44 | 0.47 | 0.46 | 0.43 | 0.44 | 0.44 | 0.54 | 0.64 | 0.57 | 0.49 | 0.44 |
o1 | 0.50 | 0.68 | 0.56 | 0.54 | 0.49 | 0.54 | 0.45 | 0.48 | 0.46 | 0.48 | 0.45 | 0.46 | 0.41 | 0.53 | 0.39 | 0.36 | 0.44 | 0.66 | 0.58 | 0.47 | 0.44 |
Gemini-2.0-Flash | 0.49 | 0.73 | 0.52 | 0.55 | 0.48 | 0.45 | 0.48 | 0.48 | 0.51 | 0.50 | 0.44 | 0.42 | 0.41 | 0.52 | 0.46 | 0.42 | 0.46 | 0.61 | 0.51 | 0.52 | 0.42 |
o4-mini | 0.48 | 0.82 | 0.49 | 0.46 | 0.51 | 0.46 | 0.43 | 0.43 | 0.42 | 0.39 | 0.39 | 0.39 | 0.45 | 0.45 | 0.48 | 0.44 | 0.46 | 0.65 | 0.52 | 0.50 | 0.45 |
Gemini-2.0-Flash-Lite | 0.48 | 0.76 | 0.45 | 0.52 | 0.50 | 0.42 | 0.48 | 0.44 | 0.40 | 0.46 | 0.35 | 0.38 | 0.40 | 0.44 | 0.47 | 0.44 | 0.56 | 0.63 | 0.53 | 0.50 | 0.40 |
GPT-4o | 0.45 | 0.83 | 0.51 | 0.55 | 0.44 | 0.43 | 0.47 | 0.38 | 0.42 | 0.43 | 0.40 | 0.36 | 0.38 | 0.42 | 0.32 | 0.29 | 0.38 | 0.66 | 0.54 | 0.48 | 0.36 |
DeepSeek-R1-671B | 0.45 | 0.84 | 0.56 | 0.51 | 0.49 | 0.50 | 0.47 | 0.50 | 0.45 | 0.41 | 0.28 | 0.35 | 0.28 | 0.43 | 0.30 | 0.38 | 0.46 | 0.61 | 0.50 | 0.44 | 0.37 |
Llama-4-Maverick | 0.43 | 0.76 | 0.31 | 0.45 | 0.48 | 0.38 | 0.33 | 0.37 | 0.45 | 0.36 | 0.39 | 0.30 | 0.41 | 0.37 | 0.39 | 0.39 | 0.54 | 0.62 | 0.50 | 0.50 | 0.36 |
o3-mini | 0.39 | 0.80 | 0.48 | 0.44 | 0.45 | 0.36 | 0.39 | 0.39 | 0.36 | 0.37 | 0.27 | 0.31 | 0.38 | 0.35 | 0.32 | 0.26 | 0.41 | 0.56 | 0.39 | 0.35 | 0.33 |
GPT-4o-mini | 0.39 | 0.73 | 0.45 | 0.46 | 0.36 | 0.34 | 0.37 | 0.36 | 0.35 | 0.25 | 0.30 | 0.29 | 0.32 | 0.34 | 0.33 | 0.36 | 0.42 | 0.60 | 0.44 | 0.37 | 0.32 |
Llama-3.1-405B | 0.31 | 0.40 | 0.30 | 0.32 | 0.27 | 0.25 | 0.24 | 0.32 | 0.25 | 0.34 | 0.30 | 0.30 | 0.37 | 0.29 | 0.28 | 0.34 | 0.33 | 0.42 | 0.36 | 0.27 | 0.31 |
Claude-3.5-Haiku | 0.30 | 0.60 | 0.27 | 0.38 | 0.27 | 0.28 | 0.22 | 0.24 | 0.26 | 0.25 | 0.18 | 0.22 | 0.26 | 0.36 | 0.25 | 0.24 | 0.35 | 0.52 | 0.34 | 0.33 | 0.22 |
Claude-3.7-Sonnet | 0.26 | 0.76 | 0.27 | 0.31 | 0.26 | 0.20 | 0.28 | 0.21 | 0.20 | 0.10 | 0.15 | 0.17 | 0.12 | 0.22 | 0.20 | 0.19 | 0.29 | 0.47 | 0.28 | 0.27 | 0.19 |
Average | 0.43 | 0.74 | 0.46 | 0.48 | 0.44 | 0.41 | 0.41 | 0.41 | 0.40 | 0.39 | 0.35 | 0.36 | 0.36 | 0.41 | 0.38 | 0.36 | 0.44 | 0.60 | 0.49 | 0.43 | 0.37 |
Random Guess | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 | 0.25 |
Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios.
In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users' profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.5, o1, Gemini-2.0, or Llama-4-Maverick achieving only around or below 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots.
To evaluate LLMs' ability to:
we design the following 7 types of in-situ user queries in the PERSONAMEM benchmark.
Query type | Examples of in-situ user queries and chatbot responses |
---|---|
[1] Recall user-shared facts | "User: I shared my playlist with my friends and they loved it. ...... (later) User: What are some creative ways to share music? --- Chatbot: Curating personalized playlists can be fun." |
[2] Suggest new ideas | "User: ...... (later) User: Suggest new restaurant I haven't ordered before." |
[3] Acknowledge latest user preferences | "User: I am a big fan of Italian food. ...... (later) User: I prefer Mediterranean cuisine now. ... (later) User: I went to an Italian restaurant yesterday --- Chatbot: Nice to hear! Though I thought you prefer Mediterranean food now." |
[4] Track full preference evolution | "User: ...... (later) User: I am thinking about attending a cooking class. --- Chatbot: I understand your perspective on cooking classes has evolved. Initially, you expressed a dislike, but later you mentioned your interest in them." |
[5] Revisit reasons behind preference updates | "User: I am a big fan of Italian food. ...... (later) User: I recently found I'm allergic to gluten, so I've shifted to other dishes. ... (later) User: I went to an Italian restaurant yesterday --- Chatbot: How's your allergy been? I guess they must have a gluten-free pizza." |
[6] Provide preference aligned recommendations | "User: I prefer Mediterranean seafood now. ...... (later) User: How can I make impactful lifestyle changes? --- Chatbot: How about diving into the Mediterranean diet plan?" |
[7] Generalize to new scenarios | "User: I prefer my yoga class because of its fixed weekly schedule. ...... (later) User: I recently haven't had time to cook. --- Chatbot: Since you seem to love routine would you be interested in weekly scheduled bento-box deliveries?" |
20
Unique Personas
180+
Interaction Histories
~6,000
Query-Response Pairs
Short
10
sessions
~32k tokens
Medium
20
sessions
~128k tokens
Long
60
sessions
~1M tokens
@misc{jiang2025knowmerespondme,
title={Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale},
author={Bowen Jiang and Zhuoqun Hao and Young-Min Cho and Bryan Li and Yuan Yuan and Sihao Chen and Lyle Ungar and Camillo J. Taylor and Dan Roth},
year={2025},
eprint={2504.14225},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.14225},
}