«THE BULLETIN OF IRKUTSK STATE UNIVERSITY». SERIES «MATHEMATICS»
«IZVESTIYA IRKUTSKOGO GOSUDARSTVENNOGO UNIVERSITETA». SERIYA «MATEMATIKA»
ISSN 1997-7670 (Print)
ISSN 2541-8785 (Online)

List of issues > Series «Mathematics». 2021. Vol 38

On the Accuracy of Cross-Validation in the Classification Problem

Author(s)
V. M. Nedel’ko
Abstract

In this work we will study the accuracy of the cross-validation estimates for decision functions. The main idea of the research consists in the scheme of statistical modeling that allows using real data to obtain statistical estimates, which are usually obtained only by using model (synthetic) distributions. The studies confirm the well-known empirical recommendation to choose the number of folds equal to 5 or more. The choice of more than 10 folds does not yield a significant increase in accuracy. The use of repeated cross-validation also does not provide fundamental gain in precision. The results of the experiments allow us to formulate an empirical fact that the accuracy of the estimates obtained by the cross-validation method is approximately the same as the accuracy of the estimates obtained from the test sample of half the size. This result can be easily explained by the fact that all the objects of the test sample are independent, and the estimates built by the cross-validation on different subsamples (folds) are not independent.

About the Authors

Victor Nedel’ko, Cand. Sci. (Phys.–Math.), Sobolev Institute of Mathematics SB RAS, 4, Koptjuga, Novosibirsk, 630090, Russian Federation, tel.: +7(383)333-27-93, email: nedelko@math.nsc.ru

For citation

Nedel’ko V.M. On the Accuracy of Cross-Validation in the Classification Problem. The Bulletin of Irkutsk State University. Series Mathematics, 2021, vol. 38, pp. 84-95. https://doi.org/10.26516/1997-7670.2021.38.84

Keywords
K-fold cross-validation, accuracy, statistical estimates, machinelearning
UDC
519.246
MSC
68T10, 62H30
DOI
https://doi.org/10.26516/1997-7670.2021.38.84
References
  1. Bayle P., Bayle A., Janson L., Mackey L. Cross-validation Confidence Intervals for Test Error. <Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 16339-16350.
  2. Beleites C., Baumgartner R., Bowman C., Somorjai R., Steiner G., Salzer R., Sowa M. G. Variance reduction in estimating classication error using sparse datasets. Chemometrics and Intelligent Laboratory Systems, 2005, vol. 79, iss. 1-2, pp. 91-100. https://doi.org/10.1016/j.chemolab.2005.04.008
  3. Franc V., Zien A., Sch¨olkopf B. Support Vector Machines as Probabilistic Models. Proc. of the International Conference on Machine Learning (ICML). ACM, New York, USA, 2011, pp. 665-672.
  4. Friedman J., Hastie T., Tibshirani R. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 2000, vol. 28, pp. 337-407. https://doi.org/10.1214/aos/1016218223
  5. Kelmanov A.V., Pyatkin A.V. NP-trudnost nekotorykh kvadratichnykh evklidovykh zadach 2-klasterizatsii [NP-hardness of some quadratic Euqleadean biclasterization tasks]. Doklady Akademii Nauk [Reports of Academy of Science], 2015, vol. 464, no. 5, pp. 535-538. https://doi.org/10.7868/S0044466916030091 (in Russian)
  6. Lbov G. S., Starceva N. G. Sravnenie algoritmov raspoznavanija s pomoshh’ju programmnoj istemy “Poligon” [Comparison of recognition algorithms with the software system “Poligon”]. Analiz dannyh i znanij v jekspertnyh sistemah [Analysis of data and knowledge in expert systems], Novosibirsk, 1990, iss. 134, Vychislitel’nye sistemy [Computer systems], pp. 56-66. (in Russian)
  7. Lbov G. S., Starceva N. G. Logicheskie reshajushhie funkcii i voprosy statisticheskoj ustojchivosti reshenij [Logical decision functions and problem of statistical robustness of the solutions]. Novosibirsk, Institute of Mathematics SB RAS Publ., 1999, 211 p. (in Russian)
  8. Lugosi G., Vayatis N. On the bayes-risk consistency of regularized boosting methods. Annals of Statistics, 2004, vol. 32, pp. 30-55. https://doi.org/10.1214/aos/1079120129
  9. Mease D., Wyner A. Evidence contrary to the statistical view of boosting. Journal of Machine Learning Research, 2008, vol. 9, pp. 131-156. https://doi.org/10.1145/1390681.1390687
  10. Motrenko A., Strijov V., Weber G.-W. Sample Size Determination For Logistic Regression. Journal of Computational and Applied Mathematics, 2014, vol. 255, pp. 743-752. https://doi.org/10.1016/j.cam.2013.06.031
  11. Krasotkina O.V., Turkov P.A., Mottl V.V. Bayesian Approach to the Pattern Recognition Problem in Nonstationary Environment. Lecture Notes in Computer Science, 2011, vol. 6744, pp. 24-29. https://doi.org/10.1007/978-3-642-21786-9_6
  12. Krasotkina O.V., Turkov P.A., Mottl’ V.V. Bajesovskaja logisticheskaja regressija v zadache obuchenija raspoznavaniju obrazov pri smeshhenii reshajushhego pravila [Bayesian logistic regression in the problem of pattern recognition learning on shifting decision rule]. Izvestija Tulskogo gosudarstvennogo universiteta. Tehnicheskie nauki. [Proceedings of the Tula State University. Engineering,] 2013, no. 2, pp. 177-187. (in Russian)
  13. Nedel’ko V.M. Misclassification probability estimations for linear decision functions. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2004, vol. 3138, pp. 780-787. https://doi.org/10.1007/978-3-540-27868-9_85
  14. Nedel’ko V. Decision trees capacity and probability of misclassification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). LNAI, 2005, vol. 3505, pp. 193-199. https://doi.org/10.1007/11492870_16
  15. Nedel’ko V. M. Regressionnye modeli v zadache klassifikacii [Regression models in the classification problem]. Sibirskij zhurnal industrialnoj matematiki [Siberian Journal of Industrial Mathematics], 2014, vol. 27, no. 1, pp. 86-98. (in Russian)
  16. Nedel’ko V.M. K voprosu ob jeffektivnosti bustinga v zadache klassifikacii [On the boosting efficiency in the classification problem]. Vestnik Novosibirskogo gosudarstvennogo universiteta. Serija: matematika, mehanika, informatika. [Bulletin of the Novosibirsk State University. Series: Mathematics, Mechanics, Computer Science], 2015, vol. 15, iss. 2, pp. 72—89. (in Russian) https://doi.org/10.17377/PAM.2015.15.206
  17. Torshin I.Yu., Rudakov K.V. On the Theoretical Basis of Metric Analysis of Poorly Formalized Problems of Recognition and Classification. Pattern Recognition and Image Analysis (Advances in Mathematical Theory and Applications), 2015, vol. 25, no. 4, pp. 577-587. https://doi.org/10.1134/S1054661815040252
  18. Vanwinckelen G., Blockeel H. On estimating model accuracy with repeated crossvalidation. BeneLearn 2012: Proceedings of the 21st Belgian-Dutch Conference on Machine Learning, 2012, pp. 39-44.
  19. Vorontsov K.V. Exact Combinatorial Bounds on the Probability of Overfitting for Empirical Risk Minimization. Pattern Recognition and Image Analysis (Advances in Mathematical Theory and Applications), 2010, vol. 20, no. 3, pp. 269-285, https://doi.org/10.1134/S105466181003003X

Full text (english)