Additional file 1 Table S1. The information of the relative works. MLST: multilocus sequencing typing; CC: clonal cluster. Table S2. The distribution of MLS types among various serotypes of GBS. Table S3. Most discriminative peaks (top 10) of each serotypes. The peaks were selected by OneR and sorted according to the importance score generated from OneR. Table S4. Most discriminative peaks (top 10) of each serotypes. The peaks were selected by PCC and sorted according to the importance score generated from PCC. Table S5. Data distribution of selected peaks between type Ia and non-type Ia. Table S6. Data distribution of selected peaks between type Ib and non-type Ib. Table S7. Data distribution of selected peaks between type III and non-type III. Table S8. Data distribution of selected peaks between type V and non-type V. Table S9. Data distribution of selected peaks between type VI and non-type VI. Table S10. Number of peak pairs for each serotype under various bin size. The peak pairs were selected by either OneR or PCC. Figure S1. Data distribution of training data set by pseudo gel views. Figure S2. Performance of machine learning models under different number of features, which were selected and ranked by OneR. Figure S3. Performance of machine learning models under different number of features, which were selected and ranked by PCC. Figure S4. The ROC curve of comparison the predictive models for each serotype when using OneR for feature selection with four-kind fold of cross validation (5, 10, 20, and 30-fold cross validation). Figure S5. The ROC curve of comparison the predictive models for each serotype when using PCC for feature selection with four-kind fold of cross validation (5, 10, 20, and 30-fold cross validation).