A new analysis from the Future of Privacy Forum questions assumptions about how Large Language Models handle personal data. Yeong Zee Kin, CEO of the Singapore Academy of Law and FPF Senior Fellow, states that LLMs are fundamentally different from traditional information storage systems because of their tokenization and embedding processes.
The technical breakdown may be important for legal compliance: during training, personal data is segmented into subwords and converted into numerical vectors that lose the "association between data points' needed to identify individuals. While LLMs can still reproduce personal information through "memorization" when data appears frequently in training sets, Kin argues this is different from actual storage and retrieval.
The piece offers practical guidance for AI developers and deployers, recommending techniques such as pseudonymization during training, machine unlearning for trained models, and output filtering for deployed systems. For grounding with personal data, the author suggests using Retrieval Augmented Generation with trusted sources rather than relying on model training.
This technical perspective could reshape how product counsel assesses data protection obligations for AI systems. Rather than assuming LLMs "store" personal data like databases do, teams need nuanced approaches that account for how these models actually process and reproduce information.
Published by Future of Privacy Forum, authored by Yeong Zee Kin.
