Katherine Lee

I’m currently a senior research scientist at Google DeepMind and run the GenLaw Center. I study security and privacy in generative AI models and the legal implications those have. Specifically, I evaluate data extraction (memorization) in generative AI models and attacks (mis-aligning) for generative AI models.

Broadly, I’m interested in building machine learning systems we can trust. This means figuring out when models are untrustworthy and creating or discovering knobs to change their behavior.

I’m grateful to work with researchers and practitioners in law, policy, and the social sciences on developing shared understandings of generative models and the roles we’d like them to have in our society. For example, I ran the Generative AI + Law workshop at ICML bringing together the legal and technical communities around generative AI, and have written explainers on generative AI, and a law review piece on how the generative AI supply chain changes copyright concerns.

You can find me on the internet: Twitter, Google Scholar, Goodreads, or email me at [my github handle] @ gmail.com

If you need a bio for a talk, please use this one:: Katherine is a senior research scientist at Google DeepMind. Her work has provided essential empirical evidence and measurement for grounding discussions around concerns that language models infringe copyright, and about how language models can respect an individuals’ right to privacy and control of their data. Additionally, she has proposed methods of reducing memorization. Her work has received recognition at ACL, USENIX, and ICLR.

Selected Publications

Full list on Google Scholar

T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [JMLR]: Colin Raffel*, Noam Shazeer*, Adam Roberts*, Katherine Lee*, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. June, 2020
Extracting Training Data from Large Language Models [arxiv] [USENIX Oral][blog] [video]: Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel. Dec, 2020; runner up Caspar Bowden award at PETS 2023
Extracting Training Data from ChatGPT [arxiv][blog]: Milad Nasr*, Nicholas Carlini*, Jon Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee. Nov 2023
Deduplicating Training Data Makes Language Models Better [arxiv] [ACL Oral]: Katherine Lee*, Daphne Ippolito*, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Nicholas Carlini. July 2021
Quantifying Memorization Across Neural Language Models [arxiv][ICLR Spotlight]: Nicholas Carlini*, Daphne Ippolito*, Matthew Jagielski*, Katherine Lee*, Florian Tramèr*, Chiyuan Zhang*. Feb 2022 (*authors alphabetical)
What Does it Mean for a Language Model to Preserve Privacy? [arxiv][FAccT]: Hannah Brown, Katherine Lee, Fatemehsadat Mireshghalla, Reza Shokri, Florian Tramèr. Feb 2022

Hallucinations in Neural Machine Translation [NeurIPS IRASL]: Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. Dec, 2018

Talkin’ ’Bout AI Generation: Copyright and the Generative-AI Supply Chain [ssrn] [arxiv][blog]: Katherine Lee, A. Feder Cooper, James Grimmelmann

Writing

AI and Law: The Next Generation (2023): The Devil is in the Training Data (2023)

Talkin’ ’Bout AI Generation: Copyright and the Generative-AI Supply Chain (2023)

Writing a Google AI Residency Cover letter with Ben Eysenbach (2019)

Submit to Journals [pdf] (2018)

Fun writing

Sourdough Literature Review (2020)

Invited Talks

Security and Privacy: Panel on Machine Learning, Memorization, and Privacy at Theory and Practice of Differential Privacy (TPDP), Sep 2023; Prosus AI Marketplace, Oct 2023; Red Teaming Language Models at Supporting NIST’s Development of Guidelines on Red-teaming for Generative AI, Feb 2024 [slides]; Why do we care about privacy? at PPAI, Feb 2024 [slides]
Generative AI + Law: Copyright Panel at Generative AI + Law Workshop @ ICML, Jul 2023; The Copyright Law of Generative AI at Silicon Flatirons Generative AI and Copyright Conference, Oct 2023; Berkeley CS 188. Introduction to Artificial Intelligence, Dec 2023; CS + Law, Mar 2024
Memorization in Language Models [slides] [poster][last updated: Dec 2022]: University of Toronto, May 2022; Mosaic ML, Jul 2022; GovAI, Aug 2022; ML Security & Privacy Seminar, Aug 2022; LEgally Attentive Data Scientists, Sep 2022; Cornell, NLP Seminar, Sep 2022; Cornell, C-Psyd, Sep, 2022
What does Privacy in Language Modeling Mean? [slides]: UNC, Apr 2022; Cornell, Apr 2022
Dataset selection: Mosaic ML, Sep 2023

Service

Organized the Generative AI and Law Workshop (GenLaw ‘23’) at ICML 2023.
Helped organize WELM workshop at ICLR 2021 and moderated a panel discussion on “Bias, safety, copyright, and efficiency”
Reviewer for NeurIPS, ICML.
Lead Brain Women 2017-2021. I have a lot of thoughts about DEI in the workplace. Feel free to ask me.

Fun!

In my free time, I enjoy making stuff. Sometimes this is pottery, sourdough, or knitting/crocheting. I also enjoy being outdoors, listening to jazz, and dancing. Sometimes all at the same time! I used to live here.

During my undergrad in Operations Research at Princeton, I had great fun learning about how real neural networks (brains!) take in the world with Jonathan Pillow and Uri Hasson.

In the more distant past, I’ve also solved for optimal seating arrangements at Google, and I spotted ships at bay with hyperspectral sensors at the Naval Research Lab in DC.