Shannon Zejiang Shen

I am a second year PhD Student at MIT CSAIL,
working at the intersection between NLP and HCI,
advised by Prof. David Sontag.


I am interested in how human and AI (LLMs) can collaborate for expert tasks.

My research involves developing novel NLP models and suitable interactions/interfaces to
battle challenging HAI problems like model hallucination and generation verification and
support expert tasks like programming, doctors' writing, and legal summarization.

Scroll/drag right for more projects.

Check out the latest news about my research updates, talks & lectures, and more.
2024 May

Talk at Google Research

Developing User-Friendly Language Language Model Systems

2024 May

RSAP panel at the American Literature Association conference

LayoutParser and Historical Document Image Processing

2024 March

Talk at Ranjay Krishna’s Group @ UW

Developing User-Friendly Language Language Model Systems

2024 March

Talk at MIT Sloan AI/ML Conference

Towards Verifiable Text Generation for Developing Trustworthy LLMs

2024 March

Discussion on Image Extraction, hosted by Thomas Smits at University of Amsterdam

LayoutParser and Historical Document Image Processing

2024 Jan

Instructor for an MIT IAP Class

Visual Design in Scholarly Communication

2023 July

Blog Post

Introducing Chapyter

2023 April

Talk at Nigam Shah’s Group Meeting @ Stanford

Redesigning Clinical Documentation

2022 Dec

Talk at Natural Legal Language Processing workshop @ EMNLP 2022

Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities

2022 Nov

Guest Lecture in CSE 599D @ UW, hosted by Prof. Jeff Heer

Visual Content Extraction for Scientific Documents


This talk is hosted by Chiyuan Zhang and Yangsibo Hunag. We focused on the Co-LLM project and had a deep dive in the methodology and experiments. Slides available upon request.


We reviewed the LayoutParser design and functionality, as well as approaches to tackle historical image processing and extraction in 2024. Slides available upon request.


We start with the analogy between web interface development and llm development: LLM can produces raw text (as if htmls for the web pages) – what is the CSS and javascript in the context of LLMs? We then talk about two recent projects, Co-LLM and SymGen, drawing connections between our methods and web technologies like CSS, API calls, etc. Slides available upon request.


In this short talk, we cover our latest research on SymGen, a novel approach to generating verifiable text for developing trustworthy LLMs. Slides available upon request.


We reviewed the LayoutParser design and functionality, as well as approaches to tackle historical image processing and extraction in 2024. Slides available upon request.

A series of lectures over the MIT IAP period, co-taught with Lucas Torroba Hennigen, focused on visual design in scholarly communication. Visual design is a crucial element in various forms of scientific communication, ranging from papers, slides, to even videos. While there is an increasing need for researchers to produce high-quality visuals, it remains to be a time-consuming and sometimes very challenging task. Despite the significant role they play, there is a noticeable lack of formal education dedicated to this aspect. This subject aims to cover several key topics about visual designs in scholarly communication.

clinical documentation

Chapyter is a JupyterLab extension that seamlessly connects GPT-4 to your coding environment. It features a code interpreter that can translate your natural language description into Python code and automatically execute it.

clinical documentation

We took the inspiration from our position paper on AI supported expository writing and discuss how to apply such ideas in clinical documentation. This is a joint presentation with Monica Agrawal and Hunter Lang.

A presentation of our work on the Multi-LexSum dataset, containing real-world summaries of civil rights lawsuits at multiple granularities.

visual content extraction

We reviewed the general problem of visual content extraction in scientific documents, as well as the current state-of-the-art methods and challenges. Slides available upon request.



Learning to Decode Collaboratively with Multiple Language Models New Featured

Shannon Zejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, and David Sontag

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature New

David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, and Arman Cohan

Towards Verifiable Text Generation with Symbolic References New

Lucas Torroba Hennigen, Shannon Zejiang Shen, Ani Nrusimha, Bernhard Gapp, David Sontag, and Yoon Kim

Machine learning to predict notes for chart review in the oncology setting: a proof of concept strategy for improving clinician note-writing New

Sharon Jiang, Barbara Lam, Monica Agrawal, Shannon Zejiang Shen, Nicholas Kurtzman, Steven Horng, David Karger, and David Sontag

A Design Space for Intelligent and Interactive Writing Assistants New

Mina Lee, Katy Ilonka Gero, John Joon Young Chung, Simon Buckingham Shum, Vipul Raheja, Hua Shen, Subhashini Venugopalan, Thiemo Wambsganss, David Zhou, Emad A. Alghamdi, Tal August, Avinash Bhat, Madiha Zahrah Choksi, Senjuti Dutta, Jin L.C. Guo, Md Naimul Hoque, Yewon Kim, Simon Knight, Seyed Parsa Neshaei, Antonette Shibani, Disha Shrivastava, Lila Shroff, Agnia Sergeyuk, Jessi Stark, Sarah Sterman, Sitong Wang, Antoine Bosselut, Daniel Buschek, Joseph Chee Chang, Sherol Chen, Max Kreminski, Joonsuk Park, Roy Pea, Eugenia Ha Rim Rho, Shannon Zejiang Shen, and Pao Siangliulue

Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E Peters, Abhilasha Ravichander, Kyle Richardson, Shannon Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo


PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents

Kyle Lo, Shannon Zejiang Shen, Benjamin Newman, Joseph Chang, Russell Authur, Erin Bransom, Stefan Candra, Yoganand Chandrasekhar, Regan Huff, Bailey Kuehl, Amanpreet Singh, Chris Wilhelm, Angele Zamarron, Marti A. Hearst, Daniel Weld, Doug Downey, and Luca Soldaini

American Stories: A Large-Scale Structured Text Dataset of Historical US Newspapers

Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Shannon Zejiang Shen, Luca D’Amico-Wong, Quan Le, Pablo Querubin, and Leander Heldring

Conceptualizing Machine Learning for Dynamic Information Retrieval of Electronic Health Record Notes

Sharon Jiang, Shannon Zejiang Shen, Monica Agrawal, Barbara Lam, Nicholas Kurtzman, Steven Horng, David Karger, and David Sontag

Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

Catherine Chen, Shannon Zejiang Shen, Dan Klein, Gabriel Stanovsky, Doug Downey, and Kyle Lo

Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks

Shannon Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, Joseph Chee Chang, and David Sontag

The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces

With the Semantic Scholar Team
Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X Zhang, Cassidy Trier, Chloe Anastasiades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie Yu-Yen Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff, Dongyeop Kang, Tae Soo Kim, Rodney Kinney, Aniket Kittur, Hyeonsu Kang, Egor Klevak, Bailey Kuehl, Michael Langan, Matt Latzke, Jaron Lochner, Kelsey MacMillan, Eric Marsh, Tyler Murray, Aakanksha Naik, Ngoc-Uyen Nguyen, Srishti Palani, Soya Park, Caroline Paulic, Napol Rachatasumrit, Smita Rao, Paul Sayre, Shannon Zejiang Shen, Pao Siangliulue, Luca Soldaini, Huy Tran, Madeleine van Zuylen, Lucy Lu Wang, Christopher Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Marti A Hearst, and Daniel S Weld

The semantic scholar open data platform

With the Semantic Scholar Team
Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Kuehl, Michael Langan, Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner, Kelsey MacMillan, Tyler Murray, Chris Newell, Smita Rao, Shaurya Rohatgi, Paul Sayre, Shannon Zejiang Shen, Amanpreet Singh, Luca Soldaini, Shivashankar Subramanian, Amber Tanaka, Alex D Wade, Linda Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Madeleine Van Zuylen, and Daniel S Weld


Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities

Shannon Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, and Doug Downey

NeurIPS 2022 Datasets and Benchmarks Track

Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search

Daniel King, Shannon Zejiang Shen, Nishant Subramani, Daniel S. Weld, Iz Beltagy, and Doug Downey


Generating Object Stamps

Youssef Alami Mejjati, Shannon Zejiang Shen, Michael Snower,
Aaron Gokaslan, Oliver Wang, James Tompkin, and Kwang In Kim


Deep Learning based Framework for Automatic Damage Detection in Aircraft Engine Borescope Inspection

Shannon Zejiang Shen, Xili Wan, Feng Ye, Xinjie Guan, and Shuwen Liu

2019 International Conference on Computing, Networking and Communications (ICNC)

NLP Expert NLP LLM |
HAI AI-Assisted Writing |
Other Document Analysis Early Computer Vision Papers
Please click the tags above to show the papers.


Besides research, I've worked on various open source projects and here are a few of them:

Productivity & Utils


A JupyterLab extension that seamlessly connects GPT-4 to your coding environment. It features a code interpreter that can translate your natural language description into Python code and automatically execute it.


A Python package that seamlessly connects notion databases and pandas dataframe. It allows for easy uploading/downloading Notion databases to/from pandas dataframe.


An Obsidian plugin that streamlines bibliography management.

Websites & Design

A platform for current and past grad students to share their statement of purposes during application to help future applicants. It is a full-fledged website based on notion, and we develop an automated submission system that connects the notion database with a google form (code available here).

The layout-parser project website is built based on jekyll and bulma. Most interestingly, the layout-parser platform subpage is rendered by live fetching the model metadata stored in Github issues.

Avalanche: a personal website theme for academics

Also based on jekyll and bulma, the Avalanche theme can be used out-of-the box for creating an academic site beautifully displaying personal research description, publications, as well as recent news.


Whenever you have any questions regarding my research (or just want to say hi),
the best email address to find me is zejiangshen AT

You can also find me on Twitter, LinkedIn, and GitHub.