Research

Research

My research focuses on the theoretical foundations of reinforcement learning, statistical learning, and control under uncertainty. In particular, I am interested in designing algorithms that remain reliable when the feedback used for learning is corrupted, heavy-tailed, asynchronous, temporally dependent, or generated by unreliable agents. A central goal of my work is to develop provably robust reinforcement learning algorithms with sharp finite-sample guarantees and matching lower bounds that clarify the fundamental limits of learning under adversarial corruption. My work has appeared in major AI venues, including ICML 2026, NeurIPS 2025, and AISTATS 2025, as well as major control venues such as IEEE CDC 2024 and IEEE ACC 2026.

[Theme 1] Learning Robust Optimal Policy from Strongly Correlated and Corrupted Observations

Can reinforcement learning still discover an optimal policy when the feedback it receives is both statistically dependent and adversarially corrupted? This line of work answers this question by developing a finite-time theory for robust optimal policy learning from unreliable trajectories, where rewards, transitions, observations, or Bellman targets may be strategically distorted. We show that standard algorithms such as vanilla Q-Learning can be fundamentally brittle under even small corruptions, and then design robust Bellman-update schemes that recover near-clean statistical performance up to an explicit and unavoidable corruption penalty. Together, this series of work, appearing across major venues including ICML 2026, IEEE CDC 24, and the NeuRIPS 2025-Reliable ML Workshop, establishes some of the first fundamental guarantees for learning optimal policies from strongly correlated and corrupted observations, combining robust statistics, stochastic approximation, and Bellman fixed-point theory to identify both what is algorithmically achievable and what is information-theoretically impossible.

Robust optimal policy learning

Recently, there has been a surge of interest in understanding the non-asymptotic behavior of model-free reinforcement learning algorithms. However, the performance of these algorithms in non-ideal environments, particularly under corrupted rewards, remains poorly understood. Motivated by this gap, we study the robustness of the celebrated Q-Learning algorithm under a strong-contamination attack model, where an adversary can arbitrarily perturb a small fraction of the observed rewards. We first show that such attacks can cause vanilla Q-Learning to incur arbitrarily large errors.

We then develop robust \(Q\)-Learning algorithms that use historical reward data to construct robust empirical Bellman operators. Our first variant is reward-statistics aware and leverages distributional information, such as reward scale, to obtain sharp finite-time guarantees. We then develop a reward-statistics agnostic version that does not require prior knowledge of the true reward distribution, using an agnostic tuning parameter \(\color{#fca5a5}{p}>1\) to control the robustness–concentration tradeoff. For both i.i.d. (\(\bar{\tau}_{\mathrm{mix}}=1\)) and Markovian data, both approaches achieve finite-time convergence rates that match known state-of-the-art bounds up to a small and inevitable corruption-dependent term, which scales with the adversarial corruption fraction.

\begin{equation} \Vert Q_t - Q^* \Vert_{\infty} \;\leq\; \widetilde{\mathcal O}\!\left(\bar{\sigma} \sqrt{\frac{\bar{\tau}_{\mathrm{mix}}}{T}} \right) + \mathcal O\!\left( \varepsilon \bar{\sigma} \right), \quad \Vert Q_t - Q^* \Vert_{\infty} \;\leq\; \widetilde{\mathcal O}\!\left(\bar{\sigma}^{1+1/\color{#fca5a5}{p}} \sqrt{\frac{\bar{\tau}_{\mathrm{mix}}}{T}} \right) + \mathcal O\!\left( \varepsilon \bar{\sigma} \right). \end{equation}
We further prove an information-theoretic lower bound showing that the dependence on the corruption fraction is unavoidable. To state this formally, let \(\mathcal{H}(\varepsilon,\bar{\sigma},\mathcal{Q})\) denote the class of all finite-state, finite-action MDPs and observation models in which the true reward distributions have bounded mean rewards, variance at most \(\bar{\sigma}^2\), and are subject to an \(\varepsilon\)-fraction strong-contamination model.
\begin{equation} \inf_{\hat{Q}_T} \sup_{Q^* \in \mathcal{H}(\varepsilon, \bar{\sigma}, \mathcal{Q})} \mathbb{P}\left( \Vert \hat{Q}_T - Q^* \Vert_{\infty} \geq \frac{\tilde{c} \bar{\sigma} \sqrt{\varepsilon}}{(1-\gamma)}\right) \geq \hat{\delta}, \quad \hat{\delta} > 0. \end{equation}
The sequence of results gradually moves from idealized sampling models toward realistic trajectory-based learning. The CDC 24 work studies robust Q-Learning in the synchronous generative-model setting, where each state-action pair can be sampled directly, and shows both the fragility of vanilla Q-Learning under reward corruption and the effectiveness of robust empirical Bellman updates. The ICML 2026 work completes this line by developing an agnostic and asynchronous theory under Markovian, strongly correlated observations, proving near-optimal finite-time upper bounds together with matching lower bounds that certify the unavoidable statistical price of adversarial corruption.

Plot 1 Plot 2 Plot 3 Plot 4
Figure: Robust \(Q\)-Learning remains reliable under adversarial corruption, Markovian data, and statistical agnosticism.

Representative Publications

[\(C_1\)] Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates (ICML 2026) [Paper]

[\(C_2\)] Robust Q-Learning under Corrupted Rewards (CDC 2024) [Paper]

[\(W_1\)] Agnostic Q-Learning under Corrupted Feedback (NeuRIPS 2025-Reliable ML Workshop)


[Theme 2] Robust Policy Evaluation and Temporal-Difference (TD) Learning

The second direction of my research focuses on robust policy evaluation with Markovian data and function approximation. In contrast to idealized i.i.d. sampling models, reinforcement learning data are often collected from a single trajectory, making consecutive observations strongly dependent. This temporal dependence makes robust estimation substantially more delicate, since the learner must distinguish genuine statistical fluctuations from adversarial perturbations while tracking the value function. In this line of work, which appeared at AISTATS 2025, we develop finite-time guarantees for robust temporal-difference learning under adversarial corruption and Markovian sampling, and clarify both what robust TD algorithms can achieve and what limitations are unavoidable under time-correlated corrupted data.
Robust optimal policy learning

One of the fundamental challenges in reinforcement learning (RL) is policy evaluation, which involves estimating the value function—the long-term return—associated with a fixed policy. The well-known Temporal Difference (TD) learning algorithm addresses this problem, and recent research has provided finite-time convergence guarantees for TD and its variants. However, these guarantees typically rely on the assumption that reward observations are generated from a well-behaved (e.g., sub-Gaussian) true reward distribution.

Recognizing that such assumptions may not hold in harsh, real-world environments, we revisit the policy evaluation problem through the lens of adversarial robustness. We develop a robust temporal-difference learning method and prove that, with linear function approximation, its finite-time guarantees match those of Vanilla-TD up to a small \(O(\varepsilon)\) term that precisely captures the effect of adversarial corruption.

Robust TD illustration
\begin{equation} \mathbb{E}[\Vert \theta_t - \theta^* \Vert_2^2] \;\leq\; \widetilde{\mathcal O}\!\left( \frac{\bar{\tau}_{\mathrm{mix}}\sigma^2}{T} \right) + \mathcal O\!\left( \varepsilon \sigma_1^2 \right). \end{equation} Here, \(\theta_t\) denotes the parameter vector generated by the temporal-difference learning recursion at time \(t\), and \(d_T\) measures the error of the learned parameter after \(T\) samples relative to the target TD solution.
The first term captures the clean statistical error under Markovian sampling: it decays as \(1/T\), with the mixing time \(\bar{\tau}_{\mathrm{mix}}\) and noise level \(\sigma^2\) governing the effective difficulty of estimation. The second term is the corruption-induced bias, scaling linearly with the adversarial contamination level \(\varepsilon\). We complement this upper bound with a minimax lower bound showing that such an additive corruption-induced term is unavoidable, thereby establishing that robust TD learning achieves the optimal robustness profile up to problem-dependent constants and logarithmic factors.

\begin{equation} \inf_{\hat{V}_T} \sup_{V \in \mathcal{M}(\epsilon, \rho, \mathcal{Q})} \mathbb{P}\left( \Vert \hat{V}_T - V \Vert_2 \geq \frac{\tilde{c} \rho \sqrt{\varepsilon}}{(1-\gamma)}\right) \geq \frac{1}{2}. \end{equation}

Our lower bound shows that no algorithm can completely remove the effect of adversarial corruption: even the best possible method must incur an additive error proportional to the corruption level. Thus, the corruption-dependent term in our upper bound is not a proof artifact, but a fundamental statistical barrier. To our knowledge, these results are the first of their kind in the context of adversarial robustness of stochastic approximation schemes driven by Markov noise.
Plot 1 Plot 2 Plot 3 Plot 4
Figure: Robust policy evaluation remains reliable under adversarial corruption with Markovian data and function approximation.

Representative Publications

[\(C_1\)] Adversarially-Robust TD-Learning with Markovian Data (AISTATS 2025) [Paper]


[Theme 3] Robust Federated Multi-Agent Reinforcement Learning

The third direction of my research focuses on robust federated reinforcement learning with Byzantine agents and minimal communication. In this setting, multiple agents interact with a common environment and send information to a central learner, but a fraction of the agents may be adversarial and can transmit arbitrary, corrupted messages. This creates a challenging tension between collaboration, robustness, and communication efficiency: the learner should benefit from distributed data collection while remaining resilient to malicious agents and avoiding excessive communication. In this line of work, which appeared at ACC 2026, we develop robust federated Q-Learning algorithms that achieve near-optimal finite-time guarantees with Byzantine tolerance and almost no communication, showing that collaborative speedups are still possible even when some agents are unreliable.
Robust Federated Q-Learning

We consider a federated reinforcement learning setting with multiple agents interacting with a common Markov Decision Process, while communicating through a central server to learn the optimal value function. The central question is whether one can still obtain collaborative sample-complexity gains when a small fraction of agents are Byzantine and may send arbitrary, corrupted messages. To address this, we propose a novel federated \(Q\)-Learning algorithm that blends model-based and model-free reinforcement learning with median-of-means aggregation from robust statistics.

We prove that, with high probability, our proposed algorithm converges exactly to the optimal value function in the infinite-sample limit despite adversarial agents, and achieves near-optimal finite-time rates that retain the benefits of collaboration. More precisely, our finite-time bound takes the simplified form

More precisely, our finite-time bound takes the simplified form \begin{equation} \|Q_K-Q^\star\|_\infty \;\le\; \widetilde{\mathcal O}\!\left(\frac{1}{\sqrt{MT}}\right) \;+\; \widetilde{\mathcal O}\!\left(\frac{\sqrt{\color{#fca5a5}{\varepsilon}}}{\sqrt{T}}\right). \end{equation} Here, \(M\) denotes the number of agents and drives the collaborative speedup, improving the clean statistical error from the single-agent rate \(1/\sqrt{\color{}{T}}\) to \(1/\sqrt{MT}\). The term involving \(\varepsilon\) captures the price of adversarial corruption; importantly, for any fixed corruption level, this term still decays with the number of samples \(T\), showing a vanishing corruption effect as learning progresses. Apart from that, these guarantees require only a constant number of communication rounds \(\widetilde{O}(1)\), making the method particularly appealing in federated learning settings where communication is often the dominant bottleneck.

Plot 1 Plot 2 Plot 3 Plot 4
Figure: Robust learning under adversaries, with vanishing corruption effects and collaborative gains from more agents.

[Theme 3-a] Decentralized and Networked Reinforcement Learning

A second direction of my research focuses on decentralized reinforcement learning over general communication networks. Unlike centralized or server-based federated learning, here agents collaborate only through local message passing over an underlying graph. This setting raises a rich set of questions at the interface of reinforcement learning, distributed optimization, and networked systems: how does the network topology affect learning speed? How many rounds of consensus are needed to realize collaborative gains? How do local sampling noise and network-induced mixing errors interact? My goal is to develop algorithms and finite-time analyses that make these tradeoffs explicit.
Robust Federated Q-Learning

We consider a networked reinforcement learning setting in which multiple agents interact with a common Markov Decision Process and exchange information over a communication graph to collaboratively learn the optimal state-action value function. The central question is whether agents can achieve collaborative sample-complexity gains while incurring only limited communication overhead. To address this

question, we propose a novel epoch-based distributed \(Q\)-Learning algorithm that combines local Bellman operator estimation with consensus-based information diffusion. This design blends model-based variance reduction with model-free \(Q\)-function updates, enabling agents to attain the statistical benefits of collaboration with substantially reduced communication.

Plot 1 Plot 2 Plot 3 Plot 4
Figure: Distributed learning under clean, adversarial, and robust settings.

Representative Publications

[\(C_1\)] Robust Federated Q-Learning with Almost No Communication (ACC 2026) [Paper]

[\(C_2\)] Variance-Reduced Q-Learning under Static and Time-Varying Networks (ACC 2026) [Paper]