Selected Publications
-
Fast, Sensitive Detection of Protein Homologs Using Deep Dense Retrieval
Liang Hong*, Zhihang Hu*, Siqi Sun*, Xiangru Tang*, Jiuming Wang, Qingxiong Tan, Liangzhen Zheng, Sheng Wang, Sheng Xu, Irwin King, Mark Gerstein, Yu Li.
Nature Biotechnology (IF=33.1)
[PDF] [Abstract] [Bib]DPRThe identification of protein homologs in large databases using conventional methods, such as protein sequence comparison, often misses remote homologs. Here, we offer an ultrafast, highly sensitive method, dense homolog retriever (DHR), for detecting homologs on the basis of a protein language model and dense retrieval techniques. Its dual-encoder architecture generates different embeddings for the same protein sequence and easily locates homologs by comparing these representations. Its alignment-free nature improves speed and the protein language model incorporates rich evolutionary and structural information within DHR embeddings. DHR achieves a >10% increase in sensitivity compared to previous methods and a >56% increase in sensitivity at the superfamily level for samples that are challenging to identify using alignment-based approaches. It is up to 22 times faster than traditional methods such as PSI-BLAST and DIAMOND and up to 28,700 times faster than HMMER. The new remote homologs exclusively found by DHR are useful for revealing connections between well-characterized proteins and improving our knowledge of protein evolution, structure and function.
@article{hong2024fast, title={Fast, sensitive detection of protein homologs using deep dense retrieval}, author={Hong, Liang and Hu, Zhihang and Sun, Siqi and Tang, Xiangru and Wang, Jiuming and Tan, Qingxiong and Zheng, Liangzhen and Wang, Sheng and Xu, Sheng and King, Irwin and others}, journal={Nature Biotechnology}, pages={1--13}, year={2024}, publisher={Nature Publishing Group US New York} }
-
MIMIR: A Customizable Agent Tuning Platform for Enhanced Scientific Applications
Xiangru Tang*, Chunyuan Deng*, Hanmin Wang*, Haoran Wang*, Yilun Zhao, Wenqi Shi, Yi Fung, Wangchunshu Zhou, Jiannan Cao, Heng Ji, Arman Cohan, Mark Gerstein.
EMNLP 2024 (Demo)
[PDF] [Abstract] [Bib]MIMIRRecently, large language models (LLMs) have evolved into interactive agents, proficient in planning, tool use, and task execution across various tasks. However, without agent-tuning, open-source models like LLaMA2 currently struggle to match the efficiency of larger models such as GPT-4 in scientific applications due to a lack of agent-tuning datasets. In response, we introduce MIMIR, a streamlined platform offering a customizable pipeline that enables users to leverage both private knowledge and publicly available, legally compliant datasets at scale for agent tuning. Additionally, MIMIR supports the generation of general instruction-tuning datasets from the same input. This dual capability ensures LLM agents developed through the platform possess specific agent abilities and general competencies. MIMIR integrates these features into an end-to-end platform, facilitating everything from the uploading of scientific data to one-click agent fine-tuning. MIMIR is publicly released and actively maintained at https://github.com/gersteinlab/MIMIR, along with a demo video1 for a quick start, calling for broader development.
@inproceedings{tang-etal-2024-mimir, title = "{MIMIR}: A Customizable Agent Tuning Platform for Enhanced Scientific Applications", author = "Tang, Xiangru and Deng, Chunyuan and Wang, Hanmin and Wang, Haoran and Zhao, Yilun and Shi, Wenqi and Fung, Yi and Zhou, Wangchunshu and Cao, Jiannan and Ji, Heng and Cohan, Arman and Gerstein, Mark", editor = "Hernandez Farias, Delia Irazu and Hope, Tom and Li, Manling", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-demo.49", pages = "486--496", }
-
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, Mark Gerstein.
ICLR 2024 Workshop on LLM Agents
[PDF] [Abstract] [Bib]Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, they also introduce novel vulnerabilities that demand careful consideration for safety. However, there exists a notable gap in the literature, as there has been no comprehensive exploration of these vulnerabilities. This position paper fills this gap by conducting a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures. We begin by providing a comprehensive overview of the potential risks inherent to scientific LLM agents, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we delve into the origins of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding scientific agents and advocate for the development of improved models, robust benchmarks, and comprehensive regulations to address these issues effectively.
@article{tang2024prioritizing, title={Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science}, author={Tang, Xiangru and Jin, Qiao and Zhu, Kunlun and Yuan, Tongxin and Zhang, Yichi and Zhou, Wangchunshu and Qu, Meng and Zhao, Yilun and Tang, Jian and Zhang, Zhuosheng and others}, journal={arXiv preprint arXiv:2402.04247}, year={2024} }
-
Step-Back Profiling: Distilling User History for Personalized Scientific Writing
Xiangru Tang, Xingyao Zhang, Yanjun Shao, Jie Wu, Yilun Zhao, Arman Cohan, Ming Gong, Dongmei Zhang, Mark Gerstein.
IJCAI 2024 Workshop on AI4Research (Best Paper Award)
[PDF] [Abstract] [Bib]Step-Back ProfilingLarge language models (LLMs) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals, particularly in real-world scenarios like scientific writing. Addressing this challenge, we introduce Step-Back Profiling to personalize LLMs by distilling user history into concise profiles, including essential traits and preferences of users. Regarding our experiments, we construct a Personalized Scientific Writing (PSW) dataset to study multiuser personalization. PSW requires the models to write scientific papers given specialized author groups with diverse academic backgrounds. As for the results, we demonstrate the effectiveness of capturing user characteristics via Step-Back Profiling for collaborative writing. Moreover, our approach outperforms the baselines by up to 3.6 points on the general personalization benchmark (LaMP), including 7 personalization LLM tasks. Our extensive ablation studies validate the contributions of different components in our method and provide insights into our task definition.
@article{tang2024step, title={Step-Back Profiling: Distilling User History for Personalized Scientific Writing}, author={Xiangru Tang and Xingyao Zhang and Yanjun Shao and Jie Wu and Yilun Zhao and Arman Cohan and Ming Gong and Dongmei Zhang and Mark Gerstein}, journal={arXiv preprint arXiv:2406.14275}, year={2024} }
-
BC-Design: A Biochemistry-Aware Framework for High-Precision Inverse Protein Folding
Xiangru Tang*, Xinwu Ye*, Fang Wu*, Yanjun Shao, Yin Fang, Siming Chen, Dong Xu, Mark Gerstein.
biorXiv, 2024
"A quantum leap in inverse protein folding from 67% to 88%!"
[PDF] [Abstract] [Bib]BC-DesignInverse protein folding, which aims to design amino acid sequences for desired protein structures, is fundamental to protein engineering and therapeutic development. While recent deep-learning approaches have made remarkable progress in addressing this challenge, they typically represent biochemical properties as discrete features associated with individual residues. Here, we present BC-Design, an approach that explicitly represents these properties as decorations on randomly sampled points on exterior surfaces and within internally bound regions representing the complete molecular extent of the protein. This provides a more natural way to capture the spatial distribution of properties. We demonstrate that BC-Design significantly outperforms all current methods, improving sequence recovery from 67% to 88.37% over the state-of-the-art methods (a 21.32% absolute improvement) and reducing perplexity from 2.4 to 1.47 (a 39.51% relative improvement) on the CATH 4.2 benchmark. Notably, our model exhibits robust generalization across diverse protein characteristics, achieving consistently high performance on proteins of varying sizes (50-500 residues), structural complexity (measured by contact order), and all major CATH fold classes. Through ablation tests, we compare the relative contribution of both structure encoding information and the encoded property information, and we show that both substantially contribute equally to this strong performance. Overall, this opens new avenues for computational protein engineering and drug discovery.
@article{tang2024bc, title={BC-Design: A Biochemistry-Aware Framework for High-Precision Inverse Protein Folding}, author={Tang, Xiangru and Ye, Xinwu and Wu, Fang and Shao, Yanjun and Fang, Yin and Chen, Siming and Xu, Dong and Gerstein, Mark}, journal={bioRxiv}, pages={2024--10}, year={2024}, publisher={Cold Spring Harbor Laboratory} }
-
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
Xiangru Tang*, Yuliang Liu*, Zefan Cai*, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein.
arXiv, 2023
"Can LLMs do machine learning tasks?"
[PDF] [Abstract] [Bib]ML-BenchDespite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution.
@article{tang2023ml, title={ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code}, author={Tang, Xiangru and Liu, Yuliang and Cai, Zefan and Shao, Yanjun and Lu, Junjie and Zhang, Yichi and Deng, Zexuan and Hu, Helan and Yang, Zengxian and An, Kaikai and others}, journal={arXiv preprint arXiv:2311.09835}, year={2023} }
-
A Survey of Generative AI for De Novo Drug Design: New Frontiers in Molecule and Protein Generation
Xiangru Tang*, Howard Dai*, Elizabeth Knight*, Fang Wu, Yunyang Li, Tianxiao Li, Mark Gerstein.
Briefings in Bioinformatics 2024 (IF=13.99)
"An introductory overview with a clear breakdown of datasets, benchmarks, & models."
[PDF] [Abstract] [Bib]GenAI4DrugArtificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
@article{tang2024survey, title={A survey of generative ai for de novo drug design: new frontiers in molecule and protein generation}, author={Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark}, journal={Briefings in Bioinformatics}, volume={25}, number={4}, year={2024}, publisher={Oxford Academic} }
-
MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations
Xiangru Tang, Andrew Tran, Jeffrey Tan, Mark Gerstein.
ISMB 2024 (published in Bioinformatics, IF=6.93)
[PDF] [Abstract] [Bib]MolLMThe present paradigm of deep learning models for molecular representation relies mostly on 1D or 2D formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the model's versatility and adaptability across a wide range of modalities. Conversely, the smaller amount of research that focuses on explicit 3D representation tends to overlook textual data within the biomedical domain. We present a unified pre-trained language model that concurrently captures biomedical text, 2D, and 3D molecular information. Our model (the three-modality molecular language model, MolLM) consists of a text Transformer encoder and a molecular Transformer encoder, which encodes both 2D and 3D molecular structures. Our approach employs contrastive learning as a supervisory signal for cross-modal information learning, and we assemble a multimodality dataset using cheminformatics-based molecular modifications and a wealth of chemical text. MolLM demonstrates robust molecular representation capabilities in numerous downstream tasks, including cross-modality molecule and text matching, property prediction, captioning, and text-prompted editing. Through ablating the 3D functionality of our model, we demonstrate that the inclusion of text, 2D, and 3D representations significantly improves performance on the downstream tasks. Our code, data, and pre-trained model weights are all available at https://github.com/gersteinlab/MolLM.
@article{10.1093/bioinformatics/btae260, author = {Tang, Xiangru and Tran, Andrew and Tan, Jeffrey and Gerstein, Mark B}, title = "{MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations}", journal = {Bioinformatics}, volume = {40}, number = {Supplement_1}, pages = {i357-i368}, year = {2024}, month = {06}, abstract = "{The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models’ versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain.We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM’s self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks.Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.}", issn = {1367-4811}, doi = {10.1093/bioinformatics/btae260}, url = {https://doi.org/10.1093/bioinformatics/btae260}, eprint = {https://academic.oup.com/bioinformatics/article-pdf/40/Supplement\_1/i357/58355106/btae260.pdf}, }
-
BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
Xiangru Tang, Bill Qian, Rick Gao, Jiakang Chen, Xinyun Chen, Mark Gerstein.
ISMB 2024 (published in Bioinformatics, IF=6.93)
"BioCoder input covers repository-level potential package dependencies, class declarations, & global variables."
[PDF] [Abstract] [Bib]BioCoderPre-trained language models like ChatGPT have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks. Moreover, in bioinformatics, generating functional programs poses additional notable challenges due to the amount of domain knowledge, the need for complicated data operations, and intricate functional dependencies between the operations. Here, we present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates 1026 functions and 1243 methods in Python and Java from GitHub and 253 examples from the Rosalind Project. BioCoder incorporates a fuzz-testing framework for evaluation, and we have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT. Our detailed analysis of these models emphasizes the importance of domain knowledge, pragmatic code generation, and contextual understanding. Our dataset, benchmark, Docker images, and scripts required for testing are all available at https://github.com/gersteinlab/biocoder.
@article{10.1093/bioinformatics/btae230, author = {Tang, Xiangru and Qian, Bill and Gao, Rick and Chen, Jiakang and Chen, Xinyun and Gerstein, Mark B}, title = "{BioCoder: a benchmark for bioinformatics code generation with large language models}", journal = {Bioinformatics}, volume = {40}, number = {Supplement_1}, pages = {i266-i276}, year = {2024}, month = {06}, abstract = "{Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by \\>15\\% in terms of Pass@K under certain prompt configurations and always \\>3\\%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (\\>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50\\% versus up to 25\\%).All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.}", issn = {1367-4811}, doi = {10.1093/bioinformatics/btae230}, url = {https://doi.org/10.1093/bioinformatics/btae230}, eprint = {https://academic.oup.com/bioinformatics/article-pdf/40/Supplement\_1/i266/58354818/btae230.pdf}, }
-
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
Xiangru Tang*, Anni Zou*, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, Mark Gerstein.
ACL 2024 Findings
"The first multi-agent framework within the medical context!"
[PDF] [Abstract] [Bib]MedAgentsLarge Language Models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and the reasoning over specialized knowledge. To address these obstinate issues, we propose a novel Multi-disciplinary Collaboration (MC) framework for the medical domain that leverages role-playing LLM-based agents who participate in a collaborative multi-round discussion, thereby enhancing LLM proficiency and reasoning capabilities. This training-free and interpretable framework encompasses five critical steps: gathering domain experts, proposing individual analyses, summarising these analyses into a report, iterating over discussions until a consensus is reached, and ultimately making a decision. Our work particularly focuses on the zero-shot scenario, our results on nine data sets (MedQA, MedMCQA, PubMedQA, and six subtasks from MMLU) establish that our proposed MC framework excels at mining and harnessing the medical expertise in LLMs, as well as extending its reasoning abilities. Based on these outcomes, we further conduct a human evaluation to pinpoint and categorize common errors within our method, as well as ablation studies aimed at understanding the impact of various factors on overall performance.
@article{tang2023medagents, title={MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning}, author={Tang, Xiangru and Zou, Anni and Zhang, Zhuosheng and Zhao, Yilun and Zhang, Xingyao and Cohan, Arman and Gerstein, Mark}, journal={arXiv preprint arXiv:2311.10537}, year={2023} }
-
Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?
Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, Mark Gerstein.
NAACL 2024 (Oral)
[PDF] [Abstract] [Bib]Struc-BenchDespite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions, coverage, formatting, reasoning, comprehension, pragmatics, and hallucination, highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.
@inproceedings{tang-etal-2024-struc, title = "Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?", author = "Tang, Xiangru and Zong, Yiming and Phang, Jason and Zhao, Yilun and Zhou, Wangchunshu and Cohan, Arman and Gerstein, Mark", editor = "Duh, Kevin and Gomez, Helena and Bethard, Steven", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-short.2", pages = "12--34", abstract = "Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs{'} proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions, coverage, formatting, reasoning, comprehension, pragmatics, and hallucination, highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.", }
-
Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
Anni Zou, Zhuosheng Zhang, Hai Zhao, Xiangru Tang.
arXiv, 2023
"Bridge the gap between performance and generalization when using the CoT prompting!"
[PDF] [Abstract] [Bib]Meta-CoTLarge language models (LLMs) have unveiled remarkable reasoning capabilities by exploiting chain-of-thought (CoT) prompting, which generates intermediate reasoning chains to serve as the rationale for deriving the answer. However, current CoT methods either simply employ general prompts such as Let's think step by step, or heavily rely on handcrafted task-specific demonstrations to attain preferable performances, thereby engendering an inescapable gap between performance and generalization. To bridge this gap, we propose Meta-CoT, a generalizable CoT prompting method in mixed-task scenarios where the type of input questions is unknown. Meta-CoT firstly categorizes the scenario based on the input question and subsequently constructs diverse demonstrations from the corresponding data pool in an automatic pattern. Meta-CoT simultaneously enjoys remarkable performances on ten public benchmark reasoning tasks and superior generalization capabilities. Notably, Meta-CoT achieves the state-of-the-art result on SVAMP (93.7%) without any additional program-aided methods. Our further experiments on five out-of-distribution datasets verify the stability and generality of Meta-CoT.
@article{zou2023metacot, title={Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models}, author={Anni Zou and Zhuosheng Zhang and Hai Zhao and Xiangru Tang}, journal={arXiv preprint arXiv:2310.06692}, year={2023} }
-
Aligning Factual Consistency for Clinical Studies Summarization through Reinforcement Learning
Xiangru Tang, Arman Cohan, Mark Gerstein.
Clinical Natural Language Processing Workshop at ACL 2023
[PDF] [Abstract] [Bib]In the rapidly evolving landscape of medical research, accurate and concise summarization of clinical studies is crucial to support evidence-based practice. This paper presents a novel approach to clinical studies summarization, leveraging reinforcement learning to enhance factual consistency and align with human annotator preferences. Our work focuses on two tasks: Conclusion Generation and Review Generation. We train a CONFIT summarization model that outperforms GPT-3 and previous state-of-the-art models on the same datasets and collects expert and crowd-worker annotations to evaluate the quality and factual consistency of the generated summaries. These annotations enable us to measure the correlation of various automatic metrics, including modern factual evaluation metrics like QAFactEval, with human-assessed factual consistency. By employing top-correlated metrics as objectives for a reinforcement learning model, we demonstrate improved factuality in generated summaries that are preferred by human annotators.
@inproceedings{tang-etal-2023-aligning, title = "Aligning Factual Consistency for Clinical Studies Summarization through Reinforcement Learning", author = "Tang, Xiangru and Cohan, Arman and Gerstein, Mark", editor = "Naumann, Tristan and Ben Abacha, Asma and Bethard, Steven and Roberts, Kirk and Rumshisky, Anna", booktitle = "Proceedings of the 5th Clinical Natural Language Processing Workshop", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.clinicalnlp-1.7", doi = "10.18653/v1/2023.clinicalnlp-1.7", pages = "48--58", abstract = "In the rapidly evolving landscape of medical research, accurate and concise summarization of clinical studies is crucial to support evidence-based practice. This paper presents a novel approach to clinical studies summarization, leveraging reinforcement learning to enhance factual consistency and align with human annotator preferences. Our work focuses on two tasks: Conclusion Generation and Review Generation. We train a CONFIT summarization model that outperforms GPT-3 and previous state-of-the-art models on the same datasets and collects expert and crowd-worker annotations to evaluate the quality and factual consistency of the generated summaries. These annotations enable us to measure the correlation of various automatic metrics, including modern factual evaluation metrics like QAFactEval, with human-assessed factual consistency. By employing top-correlated metrics as objectives for a reinforcement learning model, we demonstrate improved factuality in generated summaries that are preferred by human annotators.", }
-
GersteinLab at MEDIQA-Chat 2023: Clinical Note Summarization from Doctor-Patient Conversations through Fine-tuning and In-context Learning
Xiangru Tang, Andrew Tran, Jeffrey Tan, Mark Gerstein.
Clinical Natural Language Processing Workshop at ACL 2023
[PDF] [Abstract] [Bib]MEDIQAThis paper presents our contribution to the MEDIQA-2023 Dialogue2Note shared task, encompassing both subtask A and subtask B. We approach the task as a dialogue summarization problem and implement two distinct pipelines: (a) a fine-tuning of a pre-trained dialogue summarization model and GPT-3, and (b) few-shot in-context learning (ICL) using a large language model, GPT-4. Both methods achieve excellent results in terms of ROUGE-1 F1, BERTScore F1 (deberta-xlarge-mnli), and BLEURT, with scores of 0.4011, 0.7058, and 0.5421, respectively. Additionally, we predict the associated section headers using RoBERTa and SciBERT based classification models. Our team ranked fourth among all teams, while each team is allowed to submit three runs as part of their submission. We also utilize expert annotations to demonstrate that the notes generated through the ICL GPT-4 are better than all other baselines.
@inproceedings{tang-etal-2023-gersteinlab, title = "{G}erstein{L}ab at {MEDIQA}-Chat 2023: Clinical Note Summarization from Doctor-Patient Conversations through Fine-tuning and In-context Learning", author = "Tang, Xiangru and Tran, Andrew and Tan, Jeffrey and Gerstein, Mark", editor = "Naumann, Tristan and Ben Abacha, Asma and Bethard, Steven and Roberts, Kirk and Rumshisky, Anna", booktitle = "Proceedings of the 5th Clinical Natural Language Processing Workshop", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.clinicalnlp-1.58", doi = "10.18653/v1/2023.clinicalnlp-1.58", pages = "546--554", abstract = "This paper presents our contribution to the MEDIQA-2023 Dialogue2Note shared task, encompassing both subtask A and subtask B. We approach the task as a dialogue summarization problem and implement two distinct pipelines: (a) a fine-tuning of a pre-trained dialogue summarization model and GPT-3, and (b) few-shot in-context learning (ICL) using a large language model, GPT-4. Both methods achieve excellent results in terms of ROUGE-1 F1, BERTScore F1 (deberta-xlarge-mnli), and BLEURT, with scores of 0.4011, 0.7058, and 0.5421, respectively. Additionally, we predict the associated section headers using RoBERTa and SciBERT based classification models. Our team ranked fourth among all teams, while each team is allowed to submit three runs as part of their submission. We also utilize expert annotations to demonstrate that the notes generated through the ICL GPT-4 are better than all other baselines. The code for our submission is available.", }
-
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning
Xiangru Tang, Arjun Nair, Borui Wang, Bingyao Wang, Jai Desai, Aaron Wade, Haoran Li, Asli Celikyilmaz, Yashar Mehdad, Dragomir Radev.
NAACL 2022 (Oral)
[PDF] [Abstract] [Bib]Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.
@inproceedings{tang-etal-2022-confit, title = "{CONFIT}: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning", author = "Tang, Xiangru and Nair, Arjun and Wang, Borui and Wang, Bingyao and Desai, Jai and Wade, Aaron and Li, Haoran and Celikyilmaz, Asli and Mehdad, Yashar and Radev, Dragomir", editor = "Carpuat, Marine and de Marneffe, Marie-Catherine and Meza Ruiz, Ivan Vladimir", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.415", doi = "10.18653/v1/2022.naacl-main.415", pages = "5657--5668", }
-
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries
Xiangru Tang, Alexander Fabbri, Haoran Li, Ziming Mao, Griffin Adams, Borui Wang, Asli Celikyilmaz, Yashar Mehdad, Dragomir Radev.
NAACL 2022
[PDF] [Abstract] [Bib]Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across state-of-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling. Our analysis reveals that the ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets and that the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve crowdsourcing reliability, we extend the scale of the Likert rating and present a scoring algorithm for Best-Worst Scaling that we call value learning. Our crowdsourcing guidelines will be publicly available to facilitate future work on factual consistency in summarization.
@inproceedings{tang-etal-2022-investigating, title = "Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries", author = "Tang, Xiangru and Fabbri, Alexander and Li, Haoran and Mao, Ziming and Adams, Griffin and Wang, Borui and Celikyilmaz, Asli and Mehdad, Yashar and Radev, Dragomir", editor = "Carpuat, Marine and de Marneffe, Marie-Catherine and Meza Ruiz, Ivan Vladimir", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.417", doi = "10.18653/v1/2022.naacl-main.417", pages = "5680--5692", }
-
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning
-
Aligning Factual Consistency for Clinical Studies Summarization through Reinforcement Learning
Recent Talks
11/2024 Talk at Takeda Pharmaceutical.
07/2024 Talk at Yale Department of Biomedical Informatics & Data Science.
07/2024 Talk at ISMB 2024 Text Mining Section.
07/2024 Talk at Multimodal Large Language Model.
02/2024 Talk at AI in Medicine Symposium at Yale School of Medicine.
01/2024 Talk at PSB 2024 Workshop on LLMs for Biomedicine.
07/2023 Talk at ISMB/ECCB 2023 Text Mining Section.
Professional Services
Area Chair: ACL ARR (ACL, EMNLP, NAACL, etc).
Workshop Organizer: ICLR 2024 Workshop on LLM Agents, SIGDIAL/INLG 2023 Workshop on Taming LLMs.
Tutorial Organizer: ISMB 2024 Tutorial on A Practical Introduction to LLMs in Biomedical Research.
Session Chair: ACL 2024 BoF on AI for Science, NAACL 2024 BoF on LLMs for Science.
Conference Program Committee / Reviewer: NeurIPS, ICML, ACL, EMNLP, CIKM, NAACL, INLG, IEEE BigData, COLM.
Journal Reviewer: npj Digital Medicine, TPAMI, Neurocomputing, Briefings in Bioinformatics, PLOS Computational Biology, BMC Bioinformatics, PLOS ONE, Health Data Science.
Workshop Reviewer: KDD 2023 Workshop on Data Mining in Bioinformatics, ACL 2023 Workshop on Building Educational Apps, ACL 2023 Workshop on Clinical NLP, ICML 2023 Workshop on Neural Conv AI, ICML 2023 Workshop on Interpretable ML in Healthcare, NAACL-HLT 2021 Workshop on Language and Vision Research.
Teaching
Teaching Fellow - CPSC 452/CPSC 552/AMTH 552/CB&B 663 Deep Learning Theory and Applications, Yale University, 2023 Spring.
Teaching Fellow - CPSC 437/CPSC 537 Introduction to Database Systems, Yale University, 2023 Fall.
Teaching Fellow - CPSC 452/CPSC 552/AMTH 552/CB&B 663 Deep Learning Theory and Applications, Yale University, 2024 Spring.
Teaching Fellow - CPSC 437/CPSC 537 Database Systems, Yale University, 2024 Fall.
Misc.
My 12 coursework at Yale: CPSC 523 Principles of Operating Systems, 537 Intro to Database, 539 Software Engineering, 552 Deep Learning Theory, 553 Unsupervised Learning, 569 Randomized Algorithms, 577 NLP, 583 Deep Learning on Graph, 668 Blockchain Research, 677 Adv NLP, 680 Trustworthy Deep Learning, 752 Biomedical Data Sci.
-
Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
-
Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?
-
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
-
BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
-
MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations
-
A Survey of Generative AI for De Novo Drug Design: New Frontiers in Molecule and Protein Generation
-
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
-
BC-Design: A Biochemistry-Aware Framework for High-Precision Inverse Protein Folding
-
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
-
MIMIR: A Customizable Agent Tuning Platform for Enhanced Scientific Applications