Dynamic voltage scaling (DVS) has been widely used to suppress power consumption in modern designs. The decision of optimal operating voltage at runtime should consider the variations in workload, process as well as environment. As these variations are hard to predict accurately at design time, various reinforcement learning based DVS schemes have been proposed in the literature. However, none of them can be readily applied to designs with graceful degradation, where timing errors are allowed with bounded probability to trade for further power reduction. In this paper, we propose a Q-learning based DVS scheme dedicated to the designs with graceful degradation. We compare it with two deterministic DVS schemes, i.e., a stepping based scheme and a statistical modeling based scheme. Experimental results on three 45nm industrial designs show that the proposed Q-learning based scheme can achieve up to 83.9% and 29.1% power reduction respectively with 0.01 timing error probability bound. To the best of the authors' knowledge, this is the first in-depth work to explore reinforcement learning based DVS schemes for designs with graceful degradation.