PERSONAMEM

Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Bowen Jiang^1*, Zhuoqun Hao^1*, Young-Min Cho¹, Bryan Li¹, Yuan Yuan¹,
Sihao Chen², Lyle Ungar¹, Camillo J. Taylor^1†, Dan Roth^1†

¹University of Pennsylvania ²Microsoft

^*Equal Contribution ^†Equal Advising

Paper 🤗 HF Dataset Code arXiv

🔍 Can LLMs leverage the past interaction history with a user to deliver personalized responses in real time? 🤔

Introducing PERSONAMEM

We present a comprehensive benchmark comprising over 180+ simulated user-LLM interaction histories with up to 60 multi-turn sessions (~ 1M tokens) across 15 diverse personalized scenarios with 7 types of in-situ user query tasks.

PersonaMem Overview: Animated demonstration showing user persona with static and dynamic profiles on the left, and evolving user preferences through chat interactions on the right

Each benchmark instance features a user persona with static (e.g., demographic info.) and dynamic attributes (e.g., evolving preferences). Users engage with a chatbot in multi-session interactions across a variety of topics such as food recommendation, travel planning, and therapy consultation. As the user's preferences evolve over time, the benchmark offers annotated questions assessing whether LLMs can provide the most suitable response to in-situ user queries, where users issue queries to LLMs from first-person perspectives.

We present two types of user queries here. The complete list of all 7 types of in-situ user queries can be found in the table below.

Leaderboard

Model Performance by Query Types

Evaluation results across different models on 7 in-situ query types.

Query Type \ Model	Gemini 1.5-Flash	GPT-4.5	GPT-4.1	o1	Gemini 2.0-Flash	o4-mini	Gemini 2.0-Flash-Lite	GPT-4o	DeepSeek R1-671B	Llama 4-Maverick	o3-mini	GPT 4o-mini	Llama 3.1-405B	Claude 3.5-Haiku	Claude 3.7-Sonnet	Average	Random Guess
Revisit Reasons Behind Preference Updates	0.77	0.76	0.84	0.75	0.79	0.75	0.77	0.77	0.83	0.76	0.72	0.70	0.41	0.64	0.57	0.72	0.25
Tracking Full Preference Evolution	0.65	0.68	0.67	0.67	0.70	0.73	0.68	0.68	0.68	0.66	0.54	0.60	0.38	0.55	0.45	0.62	0.25
Recall User Shared Facts	0.54	0.61	0.65	0.50	0.50	0.42	0.49	0.41	0.43	0.37	0.47	0.55	0.38	0.29	0.25	0.46	0.25
Acknowledge Latest User Preference	0.59	0.55	0.50	0.54	0.52	0.55	0.51	0.46	0.42	0.43	0.39	0.34	0.31	0.27	0.09	0.43	0.25
Provide Preference Aligned Recommendations	0.55	0.44	0.57	0.42	0.51	0.41	0.52	0.37	0.49	0.42	0.41	0.41	0.37	0.32	0.20	0.43	0.25
Generalize Reasons to New Scenarios	0.54	0.46	0.53	0.39	0.46	0.38	0.33	0.32	0.38	0.32	0.30	0.33	0.21	0.20	0.29	0.36	0.25
Suggest New Ideas	0.15	0.27	0.19	0.25	0.15	0.17	0.16	0.24	0.16	0.20	0.11	0.10	0.20	0.06	0.28	0.18	0.25
Overall Accuracy	0.52	0.52	0.52	0.50	0.49	0.48	0.48	0.45	0.45	0.43	0.39	0.39	0.31	0.30	0.26	0.43	0.25

Key Findings:

• Gemini-1.5, GPT-4.5, and GPT-4.1 lead in overall accuracy, but still hover around 52% on multiple-choice.
• Reasoning models (o4-mini, o1, and DeepSeek-R1) do not outperform their non-reasoning peers.
• LLMs recall basic facts and preferences fine -- but struggle to apply your latest preferences in their responses.

Model Performance by Session Count

Model performances by number of sessions elapsed since most recent preferences were mentioned in long context (up to 20 sessions/128k tokens).

Model \ Num of Sessions	Overall	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
Gemini-1.5-Flash	0.52	0.74	0.56	0.53	0.54	0.50	0.47	0.49	0.51	0.53	0.50	0.49	0.42	0.54	0.48	0.44	0.53	0.65	0.57	0.48	0.48
GPT-4.5	0.52	0.74	0.53	0.57	0.56	0.54	0.52	0.50	0.44	0.52	0.46	0.46	0.41	0.52	0.53	0.36	0.48	0.68	0.65	0.48	0.42
GPT-4.1	0.52	0.87	0.56	0.60	0.53	0.52	0.56	0.49	0.53	0.43	0.44	0.47	0.46	0.43	0.44	0.44	0.54	0.64	0.57	0.49	0.44
o1	0.50	0.68	0.56	0.54	0.49	0.54	0.45	0.48	0.46	0.48	0.45	0.46	0.41	0.53	0.39	0.36	0.44	0.66	0.58	0.47	0.44
Gemini-2.0-Flash	0.49	0.73	0.52	0.55	0.48	0.45	0.48	0.48	0.51	0.50	0.44	0.42	0.41	0.52	0.46	0.42	0.46	0.61	0.51	0.52	0.42
o4-mini	0.48	0.82	0.49	0.46	0.51	0.46	0.43	0.43	0.42	0.39	0.39	0.39	0.45	0.45	0.48	0.44	0.46	0.65	0.52	0.50	0.45
Gemini-2.0-Flash-Lite	0.48	0.76	0.45	0.52	0.50	0.42	0.48	0.44	0.40	0.46	0.35	0.38	0.40	0.44	0.47	0.44	0.56	0.63	0.53	0.50	0.40
GPT-4o	0.45	0.83	0.51	0.55	0.44	0.43	0.47	0.38	0.42	0.43	0.40	0.36	0.38	0.42	0.32	0.29	0.38	0.66	0.54	0.48	0.36
DeepSeek-R1-671B	0.45	0.84	0.56	0.51	0.49	0.50	0.47	0.50	0.45	0.41	0.28	0.35	0.28	0.43	0.30	0.38	0.46	0.61	0.50	0.44	0.37
Llama-4-Maverick	0.43	0.76	0.31	0.45	0.48	0.38	0.33	0.37	0.45	0.36	0.39	0.30	0.41	0.37	0.39	0.39	0.54	0.62	0.50	0.50	0.36
o3-mini	0.39	0.80	0.48	0.44	0.45	0.36	0.39	0.39	0.36	0.37	0.27	0.31	0.38	0.35	0.32	0.26	0.41	0.56	0.39	0.35	0.33
GPT-4o-mini	0.39	0.73	0.45	0.46	0.36	0.34	0.37	0.36	0.35	0.25	0.30	0.29	0.32	0.34	0.33	0.36	0.42	0.60	0.44	0.37	0.32
Llama-3.1-405B	0.31	0.40	0.30	0.32	0.27	0.25	0.24	0.32	0.25	0.34	0.30	0.30	0.37	0.29	0.28	0.34	0.33	0.42	0.36	0.27	0.31
Claude-3.5-Haiku	0.30	0.60	0.27	0.38	0.27	0.28	0.22	0.24	0.26	0.25	0.18	0.22	0.26	0.36	0.25	0.24	0.35	0.52	0.34	0.33	0.22
Claude-3.7-Sonnet	0.26	0.76	0.27	0.31	0.26	0.20	0.28	0.21	0.20	0.10	0.15	0.17	0.12	0.22	0.20	0.19	0.29	0.47	0.28	0.27	0.19
Average	0.43	0.74	0.46	0.48	0.44	0.41	0.41	0.41	0.40	0.39	0.35	0.36	0.36	0.41	0.38	0.36	0.44	0.60	0.49	0.43	0.37
Random Guess	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25

Key Findings:

• GPT-4.5, GPT-4.1, and Gemini-1.5 achieve the highest overall performance.
• The model performs better when the relevant information appears in the earlier or later sessions of the conversation history.

Contributions

We propose the PERSONAMEM benchmark and its synthetic dialog generation pipeline for persona-oriented, multi-session, and timelined user-chatbot interaction history.
We assess 15 LLMs on 7 types of in-situ user queries and evaluate their ability to provide responses aligned with user's dynamically changing profile across 15 task scenarios.
We observe that frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, DeepSeek-R1, Gemini-2.0, Llama-4 still struggle to be user-aware and deliver personalized responses, especially when the knowledge of the user needs to be applied across new scenarios.

Abstract

Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios.

In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users' profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.5, o1, Gemini-2.0, or Llama-4-Maverick achieving only around or below 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots.

Overview of PERSONAMEM

Evaluation Tasks

To evaluate LLMs' ability to:

memorize the user profile
track how the user profile evolve over time
generate personalized responses accordingly in new scenarios

we design the following 7 types of in-situ user queries in the PERSONAMEM benchmark.

Query type	Examples of in-situ user queries and chatbot responses
[1] Recall user-shared facts	"User: I shared my playlist with my friends and they loved it. ...... (later) User: What are some creative ways to share music? --- Chatbot: Curating personalized playlists can be fun."
[2] Suggest new ideas	"User: ...... (later) User: Suggest new restaurant I haven't ordered before."
[3] Acknowledge latest user preferences	"User: I am a big fan of Italian food. ...... (later) User: I prefer Mediterranean cuisine now. ... (later) User: I went to an Italian restaurant yesterday --- Chatbot: Nice to hear! Though I thought you prefer Mediterranean food now."
[4] Track full preference evolution	"User: ...... (later) User: I am thinking about attending a cooking class. --- Chatbot: I understand your perspective on cooking classes has evolved. Initially, you expressed a dislike, but later you mentioned your interest in them."
[5] Revisit reasons behind preference updates	"User: I am a big fan of Italian food. ...... (later) User: I recently found I'm allergic to gluten, so I've shifted to other dishes. ... (later) User: I went to an Italian restaurant yesterday --- Chatbot: How's your allergy been? I guess they must have a gluten-free pizza."
[6] Provide preference aligned recommendations	"User: I prefer Mediterranean seafood now. ...... (later) User: How can I make impactful lifestyle changes? --- Chatbot: How about diving into the Mediterranean diet plan?"
[7] Generalize to new scenarios	"User: I prefer my yoga class because of its fixed weekly schedule. ...... (later) User: I recently haven't had time to cook. --- Chatbot: Since you seem to love routine would you be interested in weekly scheduled bento-box deliveries?"

Benchmark Data Statistics

Unique Personas

180+

Interaction Histories

~6,000

Query-Response Pairs

Session Variations (15-30 turns/session)

Short

sessions

~32k tokens

Medium

sessions

~128k tokens

Long

sessions

~1M tokens

15 Diverse Topics

Therapy Legal Advice Books Music Movies Food Family Dating Health Finance Travel Planning Online Shopping Studying Tips Home Decoration

More Experiment Results

Performance on Ultra-Long Context (Up to 60 Sessions/1M Tokens)

Comparison of Direct Prompting vs. RAG vs. Mem0

Comparison of direct prompting vs. RAG vs. memory optimization approaches — External memory approaches (RAG and Mem0) significantly outperform vanilla LLMs (GPT-4o and GPT-4o-mini). Recall User-Shared Facts and Generalize to New Scenarios show the largest improvements, while Revisit Reasons Behind Preference Updates benefits less from retrieval-based methods.

Evaluation of LLMs in Generative Settings

Performance of generative models across different query types — In real-world scenarios, chatbots don't have access to multiple response options during inference. We evaluate models in generative settings, where the best response is selected using **joint sequence probability** from model predictions. Results follow similar trends to our discriminative evaluation across query types. Interestingly, LLama-3.1-8B-instruct performs better in generative settings, suggesting it can provide personalized responses without seeing all candidate options.

PERSONAMEM Curation Pipeline

PersonaMem data curation pipeline — We construct user personas, build time-stamped general and topic-specific personal histories, expand them into conversation sessions, and topologically concatenate sessions to create long conversation contexts—resulting in a scalable generation framework.

BibTeX

@misc{jiang2025knowmerespondme,
        title={Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale}, 
        author={Bowen Jiang and Zhuoqun Hao and Young-Min Cho and Bryan Li and Yuan Yuan and Sihao Chen and Lyle Ungar and Camillo J. Taylor and Dan Roth},
        year={2025},
        eprint={2504.14225},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2504.14225}, 
  }