Evaluation of pattern classification systems is the critical and important step in order to understand the system's performance over a chosen testing dataset. In general, considering cross validation can produce the 'optimal' or 'objective' classification result. As some ground-truth dataset(s) are usually used for simulating the system's classification performance, this may be somehow difficult to judge the system, which can provide similar performances for future unknown events. That is, when the system facing the real world cases are unlikely to provide as similar classification performances as the simulation results. This paper presents an ARS evaluation framework for binary pattern classification systems to solve the limitation of using the ground-truth dataset during system simulation. It is based on accuracy, reliability, and stability testing strategies. The experimental results based on the bankruptcy prediction case show that the proposed evaluation framework can solve the limitation of using some chosen testing set and allow us to understand more about the system's classification performances.