Haoqin (Isaac) Tu    

I'm an incoming Ph.D. student at UCSC CSE, working with Prof. Cihang Xie. I obtained my M.Eng. at UCAS.

My research interests lie around Natural Language Processing (NLP), multi-modal learning and their applications. I'm particularly interested in efficient&controllable generation (e.g., unsupervised, Plug-and-Play), multi-modal interactions (e.g., visual dialogue, captioning), and the conbination of both. My utimate goal is to empower any off-the-shelf language model the ability of understanding real-world experiences and interacting with people.

Specifically, I'm now working on Controllable/Efficient/Multimodal Text Generation (//). I'm also interested in problems in LLM-based models.

I am open for collaborations in research. Also, I am looking for potential intern positions in the summer of 2025.

Email: tuisaac163(at)gmail.com  /  Google Scholar  /  Github  /  Twitter

profile photo
Publications
(C: Conference, J: Journal, P: Preprint, W: Workshop, * represents equal contribution.)

2024

[P10] What If We Recaption Billions of Web Images with LLaMA-3?
Xianhang Li*, Haoqin Tu*, Mude Hui*, Zeyu Wang*, Bingchen Zhao*, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie
Arxiv
arxiv / data / code / website

Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption ~1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models.

[P9] Autoregressive Pretraining with Mamba in Vision
Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie
Arxiv
arxiv / code

This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored.

[P8] How Far Are We From AGI
Tao Feng*, Chuanyang Jin*, Jingyu Liu*, Kunlun Zhu*, Haoqin Tu, Zirui Cheng, Guanyu Lin, Jiaxuan You
Arxiv
arxiv / paper list

This paper delves into the pivotal questions of our proximity to AGI and the strategies necessary for its realization through extensive surveys, discussions, and original perspectives. We start by articulating the requisite capability frameworks for AGI, integrating the internal, interface, and system dimensions.

[C5][P7] Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
RWKV Team
COLM 2024
arxiv / code / models

We present Eagle (RWKV-5) and Finch (RWKV-6). Our architectural design advancements include multiheaded matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs.

[C4][P5] How Many Unicorns Are In This Image? A Safety Evaluation Benchmark For Vision LLMs
Haoqin Tu*, Chenhang Cui*, Zijun Wang*, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie
ECCV 2024
arxiv / code / VHELM evaluation

We introduce a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization with four new datasets and adversarial robustness with one novel attack and two existing attack strategies.

[C3][P6] Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning
Bingchen Zhao*, Haoqin Tu*, Chen Wei, Jieru Mei, Cihang Xie
ICLR 2024, (Spotlight, top 5%)
arxiv / OpenReview / HF tutorial

We propose LayerNorm tuning, a simple yet effective tuning for finetuning MLLM. Compared to LoRA tuning, LayerNorm tuning reduces the trainable parameters by a significant 41.9% while improves model performance by 20%.

2023

[C2][P3] ReSee: Responding through Seeing Fine-grained Visual Knowledge in Open-domain Dialogue
Haoqin Tu, Yitong Li, Fei Mi, Zhongliang Yang
EMNLP 2023 (Oral)
arxiv / code / slides

Two currently the most fine-grained multimodal dialogue datasets with entity&turn-level images on Wizard of Wikipedia and DailyDialog. And a unified multimodal dialog system with either shared or separate encoder-decoder setup.

[W1][P4] Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Haoqin Tu*, Bingchen Zhao*, Chen Wei, Cihang Xie
Instruction Workshop@NeurIPS 2023
arxiv / code / poster / twitter

Without any explicit prompting for truthful or ethical behaviors, simply tuning LLM on multi-modal instruction datasets leads to noticeable improvements in the TruthfulQA and Ethics benchmarks.

[C1] ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles
Haoqin Tu, Bowen Yang, Xianfeng Zhao
NLPCC 2023 (acceptance rate: 29%)
arxiv / code / poster / reddit post

Zero-shot text generation model controlled by vision&text signals without extra training on images. ZeroGen shows SOTA performances on three vision-language tasks (two captioning tasks and controllable news generation).

[J3] FET-LM: Flow Enhanced Variational Auto-Encoder for Topic-Guided Language Modeling
Haoqin Tu, Zhongliang Yang, Jinshuai Yang, Yongfeng Huang
TNNLS'23. IEEE Transactions on Neural Networks and Learning Systems (Early Access)
Impact Factor: 14.26
IEEE Xplore / code / paper&Appendix

A VAE model towards unsupervised topic modeling and controllable text generation (CTG). It employs two continuous latent spaces with the conditional dependency between them for topic and sequence modeling. The model builds the sequence latent space with a series of flexible Householder process to create plausible content.

2022

[J1] PCAE: A Framework of Plug-in Conditional Auto-Encoder for Controllable Text Generation
Haoqin Tu, Zhongliang Yang, Jinshuai Yang, Siyu Zhang, Yongfeng Huang
KBS'22. Knowledge-Based Systems
Impact Factor: 8.14
paper / code

A model-agnostic framework towards flexible, semi-supervised and controllable text generation. This framework is “plug-and-play” with partial parameters to be fine-tuned in the pre-trained model.

[P1] AdaVAE: Exploring Adaptive GPT-2s in VAEs for Language Modeling
Haoqin Tu, Zhongliang Yang, Jinshuai Yang, Siyu Zhang, Yongfeng Huang
Arxiv (Submitting to TASLP)
arxiv / code

The first big VAE model with adaptive parameter-efficient PLMs that can be optimized with minimum trainable parameters. Latent Attention is proposed to better construct latent spaces in VAE from the transformer encoder. AdaVAE achieves competitive performances in language modeling and low-resource classification with only 14.66% parameter activated.

[P2] An Overview on Controllable Text Generation via Variational Auto-Encoders
Haoqin Tu, Yitong Li
Arxiv
arxiv / paper list / Chinese blog

This survey gives an introduction into existing generation schemes and problems associated with text auto-encoders, a review of several applications about controllable generation that are instantiations of these general formulations, as well as a discussion for future research.

[J2] Linguistic Steganalysis Towards Social Network
Jinshuai Yang, Zhongliang Yang, Jiajun Zou, Haoqin Tu, Yongfeng Huang
T-IFS'22. IEEE Transactions on Information Forensics and Security
Impact Factor: 7.23
IEEE Xplore / code

A dataset called Stego-Sandbox to simulate the real social network scenarios and an effective linguistic steganalysis framework integrating linguistic features and context features.

Experiences
UCSC

Sep. 2024 - Present, VLAA Lab, UC Santa Cruz ,

Ph.D. Student, Multimodal & AI Safety.

UCAS

Aug. 2021 - Jun. 2024, University of Chinese Academy of Sciences ,

M.Eng., NLG & Multimodal, Current GPA: 3.88/4.0.

THU

Jun. 2020 - Sep. 2022, NGNLab, Tsinghua University ,

Research Assistant, NLG & Latent Variable Models.

Miscellaneous
@Monaco
@d’If-Island
@Cambridge
@Luzern
@London
@Marseille


thanks jon for the website template