Shannon Zejiang Shen

I am a second year PhD Student at MIT CSAIL,
working at the intersection between NLP and HCI,
advised by Prof. David Sontag.

RESEARCH

I am interested in how human and AI (LLMs) can collaborate for expert tasks.

My research involves developing novel NLP models and suitable interactions/interfaces to
battle challenging HAI problems like model hallucination and generation verification and
support expert tasks like programming, doctors' writing, and legal summarization.

Co-LLM: Training LLMs to Decode Collaboratively

2024 March

We train a latent variable models that learns to call other "expert" LLMs to decode some "hard" tokens during generation. We show improvements on expert tasks like math reasoning and medical QA.

INVITED TALKS

Below are a few recent talks about my research and ideas.

You can view the talk details by clicking on the title on the left side.

Scroll horizontally and click a talk title for details.

If you recently saw one of the talks—I look forward to your feedback!
Here is a super simple form for thoughts, comments, and critiques.

2024 March

Ranjay Krishna’s Group @ UW

Developing User-Friendly Language Language Model Systems

2024 March

MIT Sloan AI/ML Conference

Towards Verifiable Text Generation for Developing Trustworthy LLMs

2024 March

Discussion on Image Extraction, hosted by Thomas Smits at University of Amsterdam

LayoutParser and Historical Document Image Processing

2024 Jan

MIT IAP Class

Visual Design in Scholarly Communication

2023 April

Nigam Shah’s Group Meeting @ Stanford

Redesigning Clinical Documentation

2022 Dec

Natural Legal Language Processing workshop @ EMNLP 2022

Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities

2022 Nov

Guest Lecture in CSE 599D @ UW, hosted by Prof. Jeff Heer

Visual Content Extraction for Scientific Documents

Link to event

We start with the analogy between web interface development and llm development: LLM can produces raw text (as if htmls for the web pages) – what is the CSS and javascript in the context of LLMs? We then talk about two recent projects, Co-LLM and SymGen, drawing connections between our methods and web technologies like CSS, API calls, etc. Slides available upon request.

Link to event

In this short talk, we cover our latest research on SymGen, a novel approach to generating verifiable text for developing trustworthy LLMs. Slides available upon request.

Link to event

We reviewed the LayoutParser design and functionality, as well as approaches to tackle historical image processing and extraction in 2024. Slides available upon request.

Link to event

A series of lectures over the MIT IAP period, co-taught with Lucas Torroba Hennigen, focused on visual design in scholarly communication. Visual design is a crucial element in various forms of scientific communication, ranging from papers, slides, to even videos. While there is an increasing need for researchers to produce high-quality visuals, it remains to be a time-consuming and sometimes very challenging task. Despite the significant role they play, there is a noticeable lack of formal education dedicated to this aspect. This subject aims to cover several key topics about visual designs in scholarly communication.

Link to event

We took the inspiration from our position paper on AI supported expository writing and discuss how to apply such ideas in clinical documentation. This is a joint presentation with Monica Agrawal and Hunter Lang.

Link to event

A presentation of our work on the Multi-LexSum dataset, containing real-world summaries of civil rights lawsuits at multiple granularities.

Link to event

We reviewed the general problem of visual content extraction in scientific documents, as well as the current state-of-the-art methods and challenges. Slides available upon request.

2024^[4]

Learning to Decode Collaboratively with Multiple Language Models New

Preprint, Tweet

Shannon Zejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, and David Sontag

A Design Space for Intelligent and Interactive Writing Assistants New

Preprint, Tweet

Mina Lee, Katy Ilonka Gero, John Joon Young Chung, Simon Buckingham Shum, Vipul Raheja, Hua Shen, Subhashini Venugopalan, Thiemo Wambsganss, David Zhou, Emad A. Alghamdi, Tal August, Avinash Bhat, Madiha Zahrah Choksi, Senjuti Dutta, Jin L.C. Guo, Md Naimul Hoque, Yewon Kim, Simon Knight, Seyed Parsa Neshaei, Antonette Shibani, Disha Shrivastava, Lila Shroff, Agnia Sergeyuk, Jessi Stark, Sarah Sterman, Sitong Wang, Antoine Bosselut, Daniel Buschek, Joseph Chee Chang, Sherol Chen, Max Kreminski, Joonsuk Park, Roy Pea, Eugenia Ha Rim Rho, Shannon Zejiang Shen, and Pao Siangliulue

Conference on Human Factors in Computing Systems (CHI) 2024

A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models New

Preprint, Tweet, Code

Stefan Hegselmann, Shannon Zejiang Shen, Florian Gierse, Monica Agrawal, David Sontag, and Xiaoyi Jiang

Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Preprint, Code

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E Peters, Abhilasha Ravichander, Kyle Richardson, Shannon Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo

2023^[8]

Towards Verifiable Text Generation with Symbolic References New

Preprint, Website, Tweet

Lucas Torroba Hennigen^†, Shannon Zejiang Shen^†, Ani Nrusimha, Bernhard Gapp, David Sontag, and Yoon Kim

PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents

Best Paper Demo | Paper, Website, Code

Kyle Lo, Shannon Zejiang Shen, Benjamin Newman, Joseph Chang, Russell Authur, Erin Bransom, Stefan Candra, Yoganand Chandrasekhar, Regan Huff, Bailey Kuehl, Amanpreet Singh, Chris Wilhelm, Angele Zamarron, Marti A. Hearst, Daniel Weld, Doug Downey, and Luca Soldaini

EMNLP 2023 Demo Track

American Stories: A Large-Scale Structured Text Dataset of Historical US Newspapers

Paper, Code

Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Shannon Zejiang Shen, Luca D’Amico-Wong, Quan Le, Pablo Querubin, and Leander Heldring

NeurIPS 2023 Datasets and Benchmarks Track

Conceptualizing Machine Learning for Dynamic Information Retrieval of Electronic Health Record Notes

Paper

Sharon Jiang, Shannon Zejiang Shen, Monica Agrawal, Barbara Lam, Nicholas Kurtzman, Steven Horng, David Karger, and David Sontag

Machine Learning for Healthcare 2023

Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

Paper, Code

Catherine Chen, Shannon Zejiang Shen, Dan Klein, Gabriel Stanovsky, Doug Downey, and Kyle Lo

ACL 2023 Findings

Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks

Paper

Shannon Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, Joseph Chee Chang, and David Sontag

In2Writing Workshop at CHI 2023

The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces

Paper

With the Semantic Scholar Team
Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X Zhang, Cassidy Trier, Chloe Anastasiades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie Yu-Yen Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff, Dongyeop Kang, Tae Soo Kim, Rodney Kinney, Aniket Kittur, Hyeonsu Kang, Egor Klevak, Bailey Kuehl, Michael Langan, Matt Latzke, Jaron Lochner, Kelsey MacMillan, Eric Marsh, Tyler Murray, Aakanksha Naik, Ngoc-Uyen Nguyen, Srishti Palani, Soya Park, Caroline Paulic, Napol Rachatasumrit, Smita Rao, Paul Sayre, Shannon Zejiang Shen, Pao Siangliulue, Luca Soldaini, Huy Tran, Madeleine van Zuylen, Lucy Lu Wang, Christopher Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Marti A Hearst, and Daniel S Weld

The semantic scholar open data platform

Paper

With the Semantic Scholar Team
Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Kuehl, Michael Langan, Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner, Kelsey MacMillan, Tyler Murray, Chris Newell, Smita Rao, Shaurya Rohatgi, Paul Sayre, Shannon Zejiang Shen, Amanpreet Singh, Luca Soldaini, Shivashankar Subramanian, Amber Tanaka, Alex D Wade, Linda Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Madeleine Van Zuylen, and Daniel S Weld

2022^[3]

Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities

Featured Paper | Website, Paper, Poster, Slides, Video

Shannon Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, and Doug Downey

NeurIPS 2022 Datasets and Benchmarks Track

Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search

Paper

Daniel King^†, Shannon Zejiang Shen^†, Nishant Subramani, Daniel S. Weld, Iz Beltagy, and Doug Downey

The GEM Workshop at EMNLP 2022

OLALA: Object-Level Active Learning for Efficient Document Layout Annotation

Paper, Code

Shannon Zejiang Shen, Jian Zhao, Melissa Dell, Yaoliang Yu, and Weining Li

5TH Workshop on NLP and Computational Social Science at EMNLP 2022

2021^[3]

VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups

Paper, Poster, Video, Code

Shannon Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S. Weld, and Doug Downey

Transactions of the Association for Computational Linguistics (TACL), Volume 10 2022

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

Website, Paper, Video, Code

Shannon Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li

International Conference on Document Analysis and Recognition (ICDAR) 2021 (Oral)

PAWLS: PDF Annotation With Labels and Structure

Website, Paper, Poster, Video, Code

Mark Neumann, Shannon Zejiang Shen, and Sam Skjonsberg

ACL-IJCNLP 2021, Demo Track

2020^[2]

A Large Dataset of Historical Japanese Documents with Complex Layouts

Website, Paper, Slides, Video

Shannon Zejiang Shen, Kaixuan Zhang, and Melissa Dell

CVPR 2020 Workshop on Text and Documents in the Deep Learning Era

Generating Object Stamps

Website, Paper, Code

Youssef Alami Mejjati, Shannon Zejiang Shen, Michael Snower,
Aaron Gokaslan, Oliver Wang, James Tompkin, and Kwang In Kim

CVPR 2020 AI for Content Creation Workshop

2019^[2]

Information Extraction from Text Regions with Complex Tabular Structure

Paper, Poster

Kaixuan Zhang, Shannon Zejiang Shen, Jie Zhou, and Melissa Dell

Workshop on Document Intelligence (DI 2019) at NeurIPS 2019

Deep Learning based Framework for Automatic Damage Detection in Aircraft Engine Borescope Inspection

Paper, Video

Shannon Zejiang Shen, Xili Wan, Feng Ye, Xinjie Guan, and Shuwen Liu

2019 International Conference on Computing, Networking and Communications (ICNC)

NLP Expert NLP LLM |

HAI AI-Assisted Writing |

Other Document Analysis Early Computer Vision Papers

Please click the tags above to show the papers.

PROJECTS

Besides research, I've worked on various open source projects and here are a few of them:

Productivity & Utils

Chapyter

A JupyterLab extension that seamlessly connects GPT-4 to your coding environment. It features a code interpreter that can translate your natural language description into Python code and automatically execute it.

notion-df

A Python package that seamlessly connects notion databases and pandas dataframe. It allows for easy uploading/downloading Notion databases to/from pandas dataframe.

Obsidian-Scholar

An Obsidian plugin that streamlines bibliography management.

Websites & Design

cs-sop.org

A platform for current and past grad students to share their statement of purposes during application to help future applicants. It is a full-fledged website based on notion, and we develop an automated submission system that connects the notion database with a google form (code available here).

layout-parser.github.io

The layout-parser project website is built based on jekyll and bulma. Most interestingly, the layout-parser platform subpage is rendered by live fetching the model metadata stored in Github issues.

Avalanche: a personal website theme for academics

Also based on jekyll and bulma, the Avalanche theme can be used out-of-the box for creating an academic site beautifully displaying personal research description, publications, as well as recent news.

CONTACT

Whenever you have any questions regarding my research (or just want to say hi),
the best email address to find me is zejiangshen AT gmail.com.

You can also find me on Twitter, LinkedIn, and GitHub.

Shannon Zejiang Shen

I am a second year PhD Student at MIT CSAIL, working at the intersection between NLP and HCI, advised by Prof. David Sontag.

RESEARCH

INVITED TALKS

Developing User-Friendly Language Language Model Systems

Towards Verifiable Text Generation for Developing Trustworthy LLMs

LayoutParser and Historical Document Image Processing

Visual Design in Scholarly Communication

Redesigning Clinical Documentation

Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities

Visual Content Extraction for Scientific Documents

PUBLICATIONS

Learning to Decode Collaboratively with Multiple Language Models New

A Design Space for Intelligent and Interactive Writing Assistants New

A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models New

Towards Verifiable Text Generation with Symbolic References New

Learning to Decode Collaboratively with Multiple Language Models New

A Design Space for Intelligent and Interactive Writing Assistants New

A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models New

Towards Verifiable Text Generation with Symbolic References New

PROJECTS

Productivity & Utils

Websites & Design

CONTACT

I am a second year PhD Student at MIT CSAIL,
working at the intersection between NLP and HCI,
advised by Prof. David Sontag.