Trustworthy LLMs

Large language models are amazingly performant and we use them in our daily life. Simultaneously, they raise many practical issues, including hallucination and harmful responses. This raises the following question.
How to mitigate the hallucination, safety, security, and bias problems of large language models (LLMs) or large reasoning models (LRMs)?
LLMs confidently generate wrong, biased, and harmful information, which undermines the trust of LLMs as a knowledge base. How to mitigate this? One way could be leveraging conformal prediction and selective prediction to measure uncertainty as a basis for trust (e.g., NeurIPS'24). What other possibilities?
On-going/Potential Projects
- Hallucination mitigation of LLM/LRMs: Mitigate the hallucination of LRMs.
- Natural hamfulness mitigation of LLM/LRMs: Mitigate the safety of LRMs from natrual jailbreaking.
- Adversarial hamfulness mitigation of LLM/LRMs: Mitigate the security of LRMs from adversarial jailbreaking.
- Uncertainty quantification of LLM/LRMs: Quantify the correctness of the answers of LLMs or LRMs.
- RL-based Alignment: Leverage bandits and Reinforcement Learning (RL) for building trustworthy LLMs.
- Trustworthiness Control: Leverage theoretical results of bandits or RL to guarantee the trustworthiness.
What’s your own project idea?
Keywords
- LLMs
- LRMs
- Selective Prediction
- Conformal Prediction
- Uncertainty Quantification