Selected Publications
-
A Survey of Generative AI for De Novo Drug Design: New Frontiers in Molecule and Protein Generation
Xiangru Tang, Howard Dai, Elizabeth Knight, Fang Wu, Yunyang Li, Tianxiao Li, Mark Gerstein
arXiv, 2024
[PDF] [Abstract] [Bib]GenAI4DrugArtificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
@article{tang2024gaisurvey, title={A Survey of Generative AI for De Novo Drug Design: New Frontiers in Molecule and Protein Generation}, author={Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark}, journal={arXiv preprint arXiv:2402.08703}, year={2023} }
-
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, Mark Gerstein
arXiv, 2024
[PDF] [Abstract] [Bib]Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, they also introduce novel vulnerabilities that demand careful consideration for safety. However, there exists a notable gap in the literature, as there has been no comprehensive exploration of these vulnerabilities. This position paper fills this gap by conducting a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures. We begin by providing a comprehensive overview of the potential risks inherent to scientific LLM agents, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we delve into the origins of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding scientific agents and advocate for the development of improved models, robust benchmarks, and comprehensive regulations to address these issues effectively.
@article{tang2024prioritizing, title={Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science}, author={Tang, Xiangru and Jin, Qiao and Zhu, Kunlun and Yuan, Tongxin and Zhang, Yichi and Zhou, Wangchunshu and Qu, Meng and Zhao, Yilun and Tang, Jian and Zhang, Zhuosheng and others}, journal={arXiv preprint arXiv:2402.04247}, year={2024} }
-
Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark Gerstein, Rui Wang, Gongshen Liu, Hai Zhao.
arXiv, 2023
[PDF] [Abstract] [Bib]CoT-Igniting-AgentLarge language models (LLMs) have dramatically enhanced the field of language intelligence, as demonstrably evidenced by their formidable empirical performance across a spectrum of complex reasoning tasks. Additionally, theoretical proofs have illuminated their emergent reasoning capabilities, providing a compelling showcase of their advanced cognitive abilities in linguistic contexts. Critical to their remarkable efficacy in handling complex reasoning tasks, LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer. The CoT reasoning approach has not only exhibited proficiency in amplifying reasoning performance but also in enhancing interpretability, controllability, and flexibility. In light of these merits, recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents, which adeptly adhere to language instructions and execute actions within varied environments. This survey paper orchestrates a thorough discourse, penetrating vital research dimensions, encompassing: (i) the foundational mechanics of CoT techniques, with a focus on elucidating the circumstances and justification behind its efficacy; (ii) the paradigm shift in CoT; and (iii) the burgeoning of language agents fortified by CoT approaches. Prospective research avenues envelop explorations into generalization, efficiency, customization, scaling, and safety. We hope to offer readers a comprehensive understanding of prevalent research areas such as CoT reasoning and language agents and illuminate the interconnections weaving through these areas. This paper caters to a wide audience, including beginners seeking comprehensive knowledge of CoT reasoning and language agents, as well as experienced researchers interested in foundational mechanics and engaging in cutting-edge discussions on these topics. A repository for the related papers is available at https://github.com/Zoeyyao27/CoT-Igniting-Agent.
@article{zhang2023igniting, title={Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents}, author={Zhang, Zhuosheng and Yao, Yao and Zhang, Aston and Tang, Xiangru and Ma, Xinbei and He, Zhiwei and Wang, Yiming and Gerstein, Mark and Wang, Rui and Liu, Gongshen and others}, journal={arXiv preprint arXiv:2311.11797}, year={2023} }
-
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
Xiangru Tang*, Anni Zou*, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, Mark Gerstein.
arXiv, 2023
[PDF] [Abstract] [Bib]MedAgentsLarge Language Models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and the reasoning over specialized knowledge. To address these obstinate issues, we propose a novel Multi-disciplinary Collaboration (MC) framework for the medical domain that leverages role-playing LLM-based agents who participate in a collaborative multi-round discussion, thereby enhancing LLM proficiency and reasoning capabilities. This training-free and interpretable framework encompasses five critical steps: gathering domain experts, proposing individual analyses, summarising these analyses into a report, iterating over discussions until a consensus is reached, and ultimately making a decision. Our work particularly focuses on the zero-shot scenario, our results on nine data sets (MedQA, MedMCQA, PubMedQA, and six subtasks from MMLU) establish that our proposed MC framework excels at mining and harnessing the medical expertise in LLMs, as well as extending its reasoning abilities. Based on these outcomes, we further conduct a human evaluation to pinpoint and categorize common errors within our method, as well as ablation studies aimed at understanding the impact of various factors on overall performance.
@article{tang2023medagents, title={MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning}, author={Tang, Xiangru and Zou, Anni and Zhang, Zhuosheng and Zhao, Yilun and Zhang, Xingyao and Cohan, Arman and Gerstein, Mark}, journal={arXiv preprint arXiv:2311.10537}, year={2023} }
-
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
Cunxiang Wang*, Xiaoze Liu*, Yuanhao Yue*, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, Yue Zhang.
arXiv, 2023
[PDF] [Abstract] [Bib]LLM-Factuality-SurveyThis survey addresses the crucial issue of factuality in Large Language Models (LLMs). As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital. We define the Factuality Issue as the probability of LLMs to produce content inconsistent with established facts. We first delve into the implications of these inaccuracies, highlighting the potential consequences and challenges posed by factual errors in LLM outputs. Subsequently, we analyze the mechanisms through which LLMs store and process facts, seeking the primary causes of factual errors. Our discussion then transitions to methodologies for evaluating LLM factuality, emphasizing key metrics, benchmarks, and studies. We further explore strategies for enhancing LLM factuality, including approaches tailored for specific domains. We focus on two primary LLM configurations standalone LLMs and Retrieval-Augmented LLMs that utilize external data, we detail their unique challenges and potential enhancements. Our survey offers a structured guide for researchers aiming to fortify the factual reliability of LLMs.
@article{wang2023survey, title={Survey on factuality in large language models: Knowledge, retrieval and domain-specificity}, author={Wang, Cunxiang and Liu, Xiaoze and Yue, Yuanhao and Tang, Xiangru and Zhang, Tianhang and Jiayang, Cheng and Yao, Yunzhi and Gao, Wenyang and Hu, Xuming and Qi, Zehan and others}, journal={arXiv preprint arXiv:2310.07521}, year={2023} }
-
Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
Anni Zou, Zhuosheng Zhang, Hai Zhao, Xiangru Tang.
arXiv, 2023
[PDF] [Abstract] [Bib]Meta-CoTLarge language models (LLMs) have unveiled remarkable reasoning capabilities by exploiting chain-of-thought (CoT) prompting, which generates intermediate reasoning chains to serve as the rationale for deriving the answer. However, current CoT methods either simply employ general prompts such as Let's think step by step, or heavily rely on handcrafted task-specific demonstrations to attain preferable performances, thereby engendering an inescapable gap between performance and generalization. To bridge this gap, we propose Meta-CoT, a generalizable CoT prompting method in mixed-task scenarios where the type of input questions is unknown. Meta-CoT firstly categorizes the scenario based on the input question and subsequently constructs diverse demonstrations from the corresponding data pool in an automatic pattern. Meta-CoT simultaneously enjoys remarkable performances on ten public benchmark reasoning tasks and superior generalization capabilities. Notably, Meta-CoT achieves the state-of-the-art result on SVAMP (93.7%) without any additional program-aided methods. Our further experiments on five out-of-distribution datasets verify the stability and generality of Meta-CoT.
@article{zou2023metacot, title={Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models}, author={Anni Zou and Zhuosheng Zhang and Hai Zhao and Xiangru Tang}, journal={arXiv preprint arXiv:2310.06692}, year={2023} }
-
ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks
Yuliang Liu*, Xiangru Tang*, Zefan Cai*, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein.
arXiv, 2023
[PDF] [Abstract] [Bib]ML-BenchLarge language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements.
@article{liu2023ml, title={ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks}, author={Liu, Yuliang and Tang, Xiangru and Cai, Zefan and Lu, Junjie and Zhang, Yichi and Shao, Yanjun and Deng, Zexuan and Hu, Helan and Yang, Zengxian and An, Kaikai and others}, journal={arXiv preprint arXiv:2311.09835}, year={2023} }
-
MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations
Xiangru Tang, Andrew Tran, Jeffrey Tan, Mark Gerstein.
ISMB 2024 (Bioinformatics special issue)
[PDF] [Abstract] [Bib]MolLMThe present paradigm of deep learning models for molecular representation relies mostly on 1D or 2D formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the model's versatility and adaptability across a wide range of modalities. Conversely, the smaller amount of research that focuses on explicit 3D representation tends to overlook textual data within the biomedical domain. We present a unified pre-trained language model that concurrently captures biomedical text, 2D, and 3D molecular information. Our model (the three-modality molecular language model, MolLM) consists of a text Transformer encoder and a molecular Transformer encoder, which encodes both 2D and 3D molecular structures. Our approach employs contrastive learning as a supervisory signal for cross-modal information learning, and we assemble a multimodality dataset using cheminformatics-based molecular modifications and a wealth of chemical text. MolLM demonstrates robust molecular representation capabilities in numerous downstream tasks, including cross-modality molecule and text matching, property prediction, captioning, and text-prompted editing. Through ablating the 3D functionality of our model, we demonstrate that the inclusion of text, 2D, and 3D representations significantly improves performance on the downstream tasks. Our code, data, and pre-trained model weights are all available at https://github.com/gersteinlab/MolLM.
@article{tang2023mollm, title={MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations}, author={Tang, Xiangru and Tran, Andrew and Tan, Jeffrey and Gerstein, Mark}, journal={bioRxiv}, pages={2023--11}, year={2023}, publisher={Cold Spring Harbor Laboratory} }
-
BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge
Xiangru Tang*, Bill Qian*, Rick Gao, Jiakang Chen, Xinyun Chen, Mark Gerstein.
ISMB 2024 (Bioinformatics special issue)
[PDF] [Abstract] [Bib]BioCoderPre-trained language models like ChatGPT have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks. Moreover, in bioinformatics, generating functional programs poses additional notable challenges due to the amount of domain knowledge, the need for complicated data operations, and intricate functional dependencies between the operations. Here, we present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates 1026 functions and 1243 methods in Python and Java from GitHub and 253 examples from the Rosalind Project. BioCoder incorporates a fuzz-testing framework for evaluation, and we have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT. Our detailed analysis of these models emphasizes the importance of domain knowledge, pragmatic code generation, and contextual understanding. Our dataset, benchmark, Docker images, and scripts required for testing are all available at https://github.com/gersteinlab/biocoder.
@article{tang2023biocoder, title={BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge}, author={Tang, Xiangru and Qian, Bill and Gao, Rick and Chen, Jiakang and Chen, Xinyun and Gerstein, Mark}, journal={arXiv preprint arXiv:2308.16458}, year={2023} }
-
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, Mark Gerstein.
NAACL 2024
[PDF] [Abstract] [Bib]Struc-BenchDespite the power of Large Language Models (LLMs) like GPT-4, they still struggle with tasks that require generating complex, structured outputs. In this study, we assess the capability of Current LLMs in generating complex structured data and propose a structure-aware fine-tuning approach as a solution to improve this ability. To perform a comprehensive evaluation, we propose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B, GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructed datasets spanning raw text, HTML, and LaTeX tables. Based on our analysis of current model performance, we identify specific common formatting errors and areas of potential improvement. To address complex formatting requirements, we utilize FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Our experiments show that our structure-aware fine-tuning method, when applied to LLaMA-7B, significantly improves adherence to natural language constraints, outperforming other evaluated LLMs. Based on these results, we present an ability map of model capabilities from six dimensions (i.e., coverage, formatting, reasoning, comprehension, pragmatics, and hallucination). This map highlights the weaknesses of LLMs in handling complex structured outputs and suggests promising directions for future work. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.
@article{tang2023struc, title={Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?}, author={Tang, Xiangru and Zong, Yiming and Zhao, Yilun and Cohan, Arman and Gerstein, Mark}, journal={arXiv preprint arXiv:2309.08963}, year={2023} }
-
Investigating Data Contamination in Modern Benchmarks for Large Language Models
Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan.
NAACL 2024
[PDF] [Abstract] [Bib]Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about the potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper, we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named Testset Slot Guessing (TS-Guessing), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT4 demonstrated an exact match rate of 52% and 57%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.
@article{deng2023investigating, title={Investigating Data Contamination in Modern Benchmarks for Large Language Models}, author={Deng, Chunyuan and Zhao, Yilun and Tang, Xiangru and Gerstein, Mark and Cohan, Arman}, journal={arXiv preprint arXiv:2311.09783}, year={2023} }
-
DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Financial Documents
Yilun Zhao, Yitao Long, Hongjun Liu, Linyong Nan, Lyuhao Chen, Ryo Kamoi, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan
NAACL 2024
[PDF] [Abstract] [Bib]Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables. We evaluate a wide spectrum of 27 LLMs, including those specialized in math, coding and finance, with Chain-of-Thought and Program-of-Thought prompting methods. We found that even the current best-performing system (i.e., GPT-4) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe DocMath-Eval can be used as a valuable benchmark to evaluate LLMs' capabilities to solve challenging numerical reasoning problems in expert domains.
@article{zhao2023docmath, title={DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data}, author={Zhao, Yilun and Long, Yitao and Liu, Hongjun and Nan, Linyong and Chen, Lyuhao and Kamoi, Ryo and Liu, Yixin and Tang, Xiangru and Zhang, Rui and Cohan, Arman}, journal={arXiv preprint arXiv:2311.09805}, year={2023} }
-
OctoPack: Instruction Tuning Code Large Language Models
Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, Shayne Longpre.
ICLR 2024 (Spotlight)
[PDF] [Abstract] [Bib]OctopackFinetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack's benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack.
@article{muennighoff2023octopack, title={OctoPack: Instruction Tuning Code Large Language Models}, author={Niklas Muennighoff and Qian Liu and Armel Zebaze and Qinkai Zheng and Binyuan Hui and Terry Yue Zhuo and Swayam Singh and Xiangru Tang and Leandro von Werra and Shayne Longpre}, journal={arXiv preprint arXiv:2308.07124}, year={2023} }
-
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin*, Shihao Liang*, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun.
ICLR 2024 (Spotlight)
[PDF] [Abstract] [Bib]ToolLLMDespite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.
@article{qin2023toolllm, title={Toolllm: Facilitating large language models to master 16000+ real-world apis}, author={Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and others}, journal={arXiv preprint arXiv:2307.16789}, year={2023} }
-
QTSumm: Query-Focused Summarization over Tabular Data
Yilun Zhao, Zhenting Qi, Linyong Nan, Boyu Mi, Yixin Liu, Weijin Zou, Simeng Han, Ruizhe Chen, Xiangru Tang, Yumo Xu, Dragomir Radev, Arman Cohan.
EMNLP 2023
[PDF] [Abstract] [Bib]QTSummPeople primarily consult tables to conduct data analysis or answer specific questions. Text generation systems that can provide accurate table summaries tailored to users' information needs can facilitate more efficient access to relevant data insights. Motivated by this, we define a new query-focused table summarization task, where text generation models have to perform human-like reasoning and analysis over the given table to generate a tailored summary. We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables covering diverse topics. We investigate a set of strong baselines on QTSumm, including text generation, table-to-text generation, and large language models. Experimental results and manual analysis reveal that the new task presents significant challenges in table-to-text generation for future research. Moreover, we propose a new approach named ReFactor, to retrieve and reason over query-relevant information from tabular data to generate several natural language facts. Experimental results demonstrate that ReFactor can bring improvements to baselines by concatenating the generated facts to the model input.
@misc{zhao2023qtsumm, title={QTSUMM: Query-Focused Summarization over Tabular Data}, author={Yilun Zhao and Zhenting Qi and Linyong Nan and Boyu Mi and Yixin Liu and Weijin Zou and Simeng Han and Xiangru Tang and Yumo Xu and Arman Cohan and Dragomir Radev}, year={2023}, eprint={2305.14303}, archivePrefix={arXiv}, primaryClass={cs.CL} }
-
Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios
Yilun Zhao*, Haowei Zhang*, Shengyun Si*, Linyong Nan, Xiangru Tang, Arman Cohan.
EMNLP 2023
[PDF] [Abstract] [Bib]LLM-T2TTabular data is prevalent across various industries, necessitating significant time and effort for users to understand and manipulate for their information-seeking purposes. The advancements in large language models (LLMs) have shown enormous potential to improve user efficiency. However, the adoption of LLMs in real-world applications for table information seeking remains underexplored. In this paper, we investigate the table-to-text capabilities of different LLMs using four datasets within two real-world information seeking scenarios. These include the LogicNLG and our newly-constructed LoTNLG datasets for data insight generation, along with the FeTaQA and our newly-constructed F2WTQ datasets for query-based generation. We structure our investigation around three research questions, evaluating the performance of LLMs in table-to-text generation, automated evaluation, and feedback generation, respectively. Experimental results indicate that the current high-performing LLM, specifically GPT-4, can effectively serve as a table-to-text generator, evaluator, and feedback generator, facilitating users' information seeking purposes in real-world scenarios. However, a significant performance gap still exists between other open-sourced LLMs (e.g., Tulu and LLaMA-2) and GPT-4 models.
@misc{zhao2023investigating, title={Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios}, author={Yilun Zhao and Haowei Zhang and Shengyun Si and Linyong Nan and Xiangru Tang and Arman Cohan}, year={2023}, eprint={2305.14987}, archivePrefix={arXiv}, primaryClass={cs.CL} }
-
RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations
Yilun Zhao, Chen Zhao, Linyong Nan, Zhenting Qi, Wenlin Zhang, Xiangru Tang, Boyu Mi, Dragomir Radev.
ACL 2023
[PDF] [Abstract] [Bib]RobuTDespite significant progress having been made in question answering on tabular data (Table QA), it’s unclear whether, and to what extent existing Table QA models are robust to task-specific perturbations, e.g., replacing key question entities or shuffling table columns. To systematically study the robustness of Table QA models, we propose a benchmark called RobuT, which builds upon existing Table QA datasets (WTQ, WikiSQL-Weak, and SQA) and includes human-annotated adversarial perturbations in terms of table header, table content, and question. Our results indicate that both state-of-the-art Table QA models and large language models (e.g., GPT-3) with few-shot learning falter in these adversarial sets. We propose to address this problem by using large language models to generate adversarial examples to enhance training, which significantly improves the robustness of Table QA models.
@inproceedings{zhao-etal-2023-robut, title = "{R}obu{T}: A Systematic Study of Table {QA} Robustness Against Human-Annotated Adversarial Perturbations", author = "Zhao, Yilun and Zhao, Chen and Nan, Linyong and Qi, Zhenting and Zhang, Wenlin and Tang, Xiangru and Mi, Boyu and Radev, Dragomir", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.334", doi = "10.18653/v1/2023.acl-long.334", pages = "6064--6081", }
-
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning
Xiangru Tang, Arjun Nair, Borui Wang, Bingyao Wang, Jai Desai, Aaron Wade, Haoran Li, Asli Celikyilmaz, Yashar Mehdad, Dragomir Radev
NAACL 2022
[PDF] [Abstract] [Bib]Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.
@inproceedings{tang-etal-2022-confit, title = "{CONFIT}: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning", author = "Tang, Xiangru and Nair, Arjun and Wang, Borui and Wang, Bingyao and Desai, Jai and Wade, Aaron and Li, Haoran and Celikyilmaz, Asli and Mehdad, Yashar and Radev, Dragomir", editor = "Carpuat, Marine and de Marneffe, Marie-Catherine and Meza Ruiz, Ivan Vladimir", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.415", doi = "10.18653/v1/2022.naacl-main.415", pages = "5657--5668", }
-
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries
Xiangru Tang, Alexander Fabbri, Haoran Li, Ziming Mao, Griffin Adams, Borui Wang, Asli Celikyilmaz, Yashar Mehdad, Dragomir Radev
NAACL 2022
[PDF] [Abstract] [Bib]Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across state-of-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling. Our analysis reveals that the ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets and that the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve crowdsourcing reliability, we extend the scale of the Likert rating and present a scoring algorithm for Best-Worst Scaling that we call value learning. Our crowdsourcing guidelines will be publicly available to facilitate future work on factual consistency in summarization.
@inproceedings{tang-etal-2022-investigating, title = "Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries", author = "Tang, Xiangru and Fabbri, Alexander and Li, Haoran and Mao, Ziming and Adams, Griffin and Wang, Borui and Celikyilmaz, Asli and Mehdad, Yashar and Radev, Dragomir", editor = "Carpuat, Marine and de Marneffe, Marie-Catherine and Meza Ruiz, Ivan Vladimir", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.417", doi = "10.18653/v1/2022.naacl-main.417", pages = "5680--5692", }
-
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning
-
RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations
Recent Talks
02/2024 Talk at AI in Medicine Symposium at Yale School of Medicine.
01/2024 Talk at PSB 2024 Workshop on LLMs for Biomedicine.
07/2023 Talk at ISMB/ECCB 2023 Text Mining Section.
Professional Services
Tutorial Organizer: ISMB 2024 Tutorial on A Practical Introduction to LLMs in Biomedical Research.
Workshop Organizer: ICLR 2024 Workshop on LLM Agents, SIGDIAL/INLG 2023 Workshop on Taming LLMs.
Conference Program Committee / Reviewer: NeurIPS, ICML, ACL, EMNLP, CIKM, NAACL, INLG, IEEE BigData, COLM.
Journal Reviewer: npj Digital Medicine, Neurocomputing, BMC Bioinformatics, PLOS ONE, Health Data Science.
Workshop Reviewer: KDD 2023 Workshop on Data Mining in Bioinformatics, ACL 2023 Workshop on Building Educational Apps, ACL 2023 Workshop on Clinical NLP, ICML 2023 Workshop on Neural Conv AI, ICML 2023 Workshop on Interpretable ML in Healthcare, NAACL-HLT 2021 Workshop on Language and Vision Research.
Teaching
Teaching Fellow for CPSC 452/CPSC 552/AMTH 552/CB&B 663 Deep Learning Theory and Applications, Yale University, 2023 and 2024.
Teaching Fellow for CPSC 437/CPSC 537 Introduction to Database Systems, Yale University, 2023.
Misc.
My 12 coursework at Yale: CPSC 523 Principles of Operating Systems, 537 Intro to Database, 539 Software Engineering, 552 Deep Learning Theory, 553 Unsupervised Learning, 569 Randomized Algorithms, 577 NLP, 583 Deep Learning on Graph, 668 Blockchain Research, 677 Adv NLP, 680 Trustworthy Deep Learning, 752 Biomedical Data Sci.
-
Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios
-
QTSumm: Query-Focused Summarization over Tabular Data
-
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
-
Investigating Data Contamination in Modern Benchmarks for Large Language Models
-
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
-
BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge
-
MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations
-
ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks
-
Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
-
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
-
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
-
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science