Trustworthy LLMs

ChatGPT 4o

Large language models are amazingly performant and we use them in our daily life. Simultaneously, they raise many practical issues, including hallucination and harmful responses. This raises the following question.

How to mitigate the hallucination, safety, security, and bias problems of large language models (LLMs) or large reasoning models (LRMs)?

LLMs confidently generate wrong, biased, and harmful information, which undermines the trust of LLMs as a knowledge base. How to mitigate this? One way could be leveraging conformal prediction and selective prediction to measure uncertainty as a basis for trust (e.g., NeurIPS'24). What other possibilities?

On-going/Potential Projects

Hallucination mitigation of LLM/LRMs: Mitigate the hallucination of LRMs.
Natural hamfulness mitigation of LLM/LRMs: Mitigate the safety of LRMs from natrual jailbreaking.
Adversarial hamfulness mitigation of LLM/LRMs: Mitigate the security of LRMs from adversarial jailbreaking.
Uncertainty quantification of LLM/LRMs: Quantify the correctness of the answers of LLMs or LRMs.
RL-based Alignment: Leverage bandits and Reinforcement Learning (RL) for building trustworthy LLMs.
Trustworthiness Control: Leverage theoretical results of bandits or RL to guarantee the trustworthiness.

What’s your own project idea?

Keywords

LLMs
LRMs
Selective Prediction
Conformal Prediction
Uncertainty Quantification

Trustworthy LLMs

On-going/Potential Projects

Keywords

Related Work