We propose a machine learning-aided semantic understanding framework of surrounding scenes for intelligent human-computer interaction in mixed reality (MR). The proposed framework perceives semantic information from the front-view camera of MR glasses with fast and accurate machine learning-based scene text spotting models. Furthermore, it allows MR glasses to generate corresponding virtual objects automatically to coincide with the surrounding scenes without further user intervention. Moreover, for near real-time computing capability, scene text spotting models serve as a remote service under the client-server model in the framework to break through the computing bottleneck of wearable devices. We demonstrate the framework with Microsoft HoloLens 2, and experiment results show its feasibility in improving user experience under self-collected real-world scenarios. In addition, the proposed client-server architecture provides 0.77 seconds of computational time per frame on average, which is not only on average 11.8 times faster than the client-only architecture but also achieves near real-time computation. To investigate the usability of text spotting algorithms in real-world applications, we also compare several state-of-the-art scene text spotting approaches regarding recognition precision and computational time.