Kybernetika 61 no. 4, 492-508, 2025

$Ibk$-means: An iterative batch $k$-means algorithm for big data clustering

Rasim Alguliyev, Ramiz Aliguliyev, Adil M. Bagirov and Mustafa AliyevDOI: 10.14736/kyb-2025-4-0492

Abstract:

Information technologies such as social media, mobile computing, and the realization of the industrial Internet of Things (IoT) produce huge amounts of data every day. The development of powerful tools for knowledge-discovery is imperative to deal with such a volume of data. Clustering methods are among the most important knowledge-discovery techniques. The growth in computational power and algorithmic developments allow us to efficiently and accurately solve clustering problems in large datasets. However, these developments are insufficient to deal with clustering problems in big datasets. This is because these datasets cannot be processed as a whole due to hardware and computational restrictions. In this paper an iterative batch $k$-means ($ibk$-means) algorithm is proposed that yields good clustering results with low computation costs on big datasets. It is designed to cluster datasets using batch data. The efficiency and accuracy of the proposed algorithm are investigated depending on the size of batches, the number of attributes and clusters. The algorithm is compared with the classic $k$-means and mini batch $k$-means algorithms using computational results on several real-world datasets, all of which are available from the UCI Machine Learning Repository. The smallest dataset has 500000 data points and 2 attributes and the largest one contains 43930257 data points and 16 attributes. Results demonstrated that the $ibk$-means algorithm outperforms both the $k$-means and mini batch $k$-means algorithms in the sense of both efficiency and accuracy and it is applicable for the clustering of big datasets. The proposed algorithm provides real time clustering and may have direct applications in expert and intelligent systems. Furthermore, results from this paper will have a clear impact in the sense of designing more accurate and efficient clustering algorithms for big datasets taking into account available computer resources.

Keywords:

big data, $k$-means algorithm, batch clustering, mini batch $k$-means, cluster analysis

Classification:

68T09, 90B99

References:

  1. C. C. Aggarwal and C. K. Reddy: Data Clustering: Algorithms and Applications (First edition). CRC Press, Taylor and Francis Group, Boca Raton, London, New York 2014.   DOI:10.1201/b17320
  2. E. Ahmed, I. Yaqoob, I. A. T.Hashem and et al.: The role of big data analytics in Internet of Things. Computer Networks 129 (2017), 2, 459-471.   DOI:10.1016/j.comnet.2017.06.013
  3. R. Alguliyev, R. Aliguliyev and L. Sukhostat: Anomaly detection in big data based on clustering. Statistics, Optim. Inform. Computing 5 (2017), 4, 325-340.   DOI:10.19139/soic.v5i4.365
  4. R. Alguliyev, R. Aliguliyev, A. Bagirov and R. Karimov: Batch clustering algorithm for big data sets. In: Proc. 2016 IEEE 10th International Conference on Application of Information and Communication Technologies, IEEE Press 2016, pp. 79-82.   DOI:10.1109/ICAICT.2016.7991657
  5. R. Alguliyev, R. Aliguliyev, Y. Imamverdiyev and L. Sukhostat: An anomaly detection based on optimization. Int. J. Intell. Systems Appl. 9 (2017), 12, 87-96.   DOI:10.5815/ijisa.2017.12.08
  6. R. Alguliyev, R. Aliguliyev, Y. Imamverdiyev and L. Sukhostat: Weighted clustering for anomaly detection in big data. Statist. Optim. Inform. Comput. 6 (2018), 2, 178-188.   DOI:10.19139/soic.v6i2.404
  7. R. M. Alguliyev, R. M. Aliguliyev and R. G. Alakbarov: Constrained $k$-means algorithm for resource allocation in mobile cloudlets. Kybernetika 59 (2023), 88-109.   DOI:10.14736/kyb-2023-1-0088
  8. R. Alguliyev, R. Aliguliyev and L. Sukhostat: Improved parallel big data clustering based on $k$-medoids and $k$-means algorithm. Probl. Inform. Technol. 15 (2024), 18-25.   DOI:10.25045/jpit.v15.i1.03
  9. R. Alguliyev, R. Aliguliyev and L. Sukhostat: Parallel batch $k$-means for big data clustering. Comput. Industr. Engrg. 152 (2021), 107023, 1-11.   DOI:10.1016/j.cie.2020.107023
  10. D. Aloise, A. Deshpande, P. Hansen and P. Popat: NP-hardness of Euclidean sum-of-squares clustering. Machine Learn. 75 (2009), 2, 245-248.   DOI:10.1007/s10994-009-5103-0
  11. S. E. Amrahov, Y. Ar, B. Tugrul, B. E. Akay and N. Kartli: A new approach to Mergesort algorithm: Divide smart and conquer. Future Gener. Comput. Syst. 157 (2024), 330-343.   DOI:10.1016/j.future.2024.03.049
  12. A. M. Bagirov, S. Taheri and B. Ordin: An adaptive $k$-medians clustering algorithm. Probl. Inform. Technol. 13 (2022), 3-15.   DOI:10.25045/jpit.v13.i2.01
  13. A. M. Bagirov, J. Ugon and D. Webb: Fast modified global $k$-means algorithm for incremental cluster construction. Pattern Recogn. 44 (2011), 866-876.   DOI:10.1016/j.patcog.2010.10.018
  14. B. Bahmani, B. Moseley, A. Vattani and et al.: Scalable $k$-means++. Proc. VLDB Endowment 5 (2012), 7, 622-633.   DOI:10.14778/2180912.2180915
  15. J. Béjar: $k$-means vs mini batch $k$-means: A comparison. Technical Report, Universitat Politécnica de Catalunya, 2013.   CrossRef
  16. A. Bose, A. Munir and N. Shabani: A comparative quantitative analysis of contemporary big data clustering algorithms for market segmentation in hospitality industry. 2017.   CrossRef
  17. L. Bottou and Y. Bengio: Convergence properties of the $k$-means algorithm. In: Proc. 7th International Conference on Neural Information Processing Systems, MIT Press, Cambridge 1995, pp. 585-592.   CrossRef
  18. P. S. Bradley, U. Fayyad and C. Reina: Scaling clustering algorithms to large databases. In: Proc. Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, New York 1998, pp. 9-15.   DOI:10.5555/3000292.3000295
  19. X. Cai, F. Nie and H. Huang: Multi-view $k$-means clustering on big data. In: Proc. Twenty-Third International Joint Conference on Artificial Intelligence, ACM Press, New York 2013, pp. 2598-2604.   DOI:10.5555/2540128.2540503
  20. A. Catherine, C. Alejandro, F. Ricardo and G. Badih: Multivariate and functional robust fusion methods for structured big data. J. Multivar. Anal. 170 (2019), 149-161.   DOI:10.1016/j.jmva.2018.06.012
  21. P. Cetin and Ö. Ö. Tanriöver: Priority rule for resource constrained project planning problem with predetermined work package durations. J. Fac. Engrg. Architect. Gazi University 35 (2020), 3, 149-161.   DOI:10.17341/gazimmfd.545873
  22. C. L. P. Chen and C.-Y. Zhang: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inform. Sci. 275 (2014), 314-347.   DOI:10.1016/j.ins.2014.01.015
  23. X. Cui, P. Zhu, X. Yang and et al.: Optimized big data k-means clustering using MapReduce. J. Supercomput. 70 (2014), 3, 1249-1259.   DOI:10.1007/s11227-014-1225-7
  24. D. Dua and E. Karra Taniskidou: UCI Machine Learning Repository. Irvine, Univ. California, School of Information and Computer Science, 2017.   CrossRef
  25. A. Fahad, N. Alshatri, Z. Tari and et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerging Topics Computing 2 (2014), 3, 267-279.   DOI:10.1109/TETC.2014.2330519
  26. N. Ghadiri, M. Ghaffari and M. A. Nikbakht: BigFCM: Fast, precise and scalable FCM on Hadoop. Future Gener. Computer Syst. 77 (2018), 29-39.   DOI:10.1016/j.future.2017.06.010
  27. A. Hadian and S. Shahrivari: High performance parallel $k$-means clustering for disk-resident datasets on multi-core CPUs. J. Supercomput. 69 (2014), 2, 845-863.   DOI:10.1007/s10586-017-1687-5
  28. L. Haibo and W. Zhi: Application of an intelligent early-warning method based on DBSCAN clustering for drilling overflow accident. Cluster Comput. (2019).   DOI:10.1007/s10586-017-1687-5
  29. J. Han, M. Kamber and J. Pei: Data Mining: Concepts and Techniques (Third edition). Morgan Kaufmann, 2011.   DOI:10.1016/C2009-0-61819-5
  30. I. Hassan: $I-k$-means-+: an iterative clustering algorithm based on an enhanced version of the $k$-means. Pattern Recogn. 79 (2018), 402-413.   DOI:10.1016/j.patcog.2018.02.015
  31. R. J. Hathaway, J.C . Bezdek and J. M. Huband: Scalable visual assessment of cluster tendency for large data sets. Patt. Recog. 39 (2006), 7, 1315-1324.   DOI:10.1016/j.patcog.2006.02.011
  32. T. C. Havens and J. C. Bezdek: An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans. Knowl. Data Engrg. 24 (2012), 5, 813-822.   DOI:10.1109/TKDE.2011.33
  33. T. C. Havens, J. C. Bezdek and M. Palaniswami: Scalable single linkage hierarchical clustering for big data. In: Proc. 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, IEEE Press 2013. pp. 396-401.   DOI:10.1109/ISSNIP.2013.6529823
  34. M. Hilbert and P. López: The world's technological capacity to store, communicate, and compute information. Science 332 (2011), 6025, 60-65.   DOI:10.1126/science.1200970
  35. S. S. Ilango, S. Vimal, M. Kaliappan and P. Subbulakshmi: Optimization using Artificial Bee Colony based clustering approach for big data. Cluster Comput. (2019).   DOI:10.1007/s10586-017-1571-3
  36. Y. Imamverdiyev and F. Abdullayeva: Deep learning method for DoS attack detection based on restricted Boltzmann machine. Big Data 6 (2018), 2, 159-169.   DOI:10.1089/big.2018.29026.zob
  37. N. Karmitsa, A. M. Bagirov and S. Taheri: New diagonal bundle method for clustering problems in large data sets. European J. Oper. Res. 263 (2017), 2, 367-379.   DOI:10.1016/j.ejor.2017.06.010
  38. N. Karmitsa, A. M. Bagirov and S. Taheri: Clustering in large data sets with the limited memory bundle method. Pattern Recogn. 83 (2018), 245-249.   DOI:10.1016/j.patcog.2018.05.028
  39. D. Kumar, J. C. Bezdek, M. Palaniswami and et al.: A hybrid approach to clustering in big data. IEEE Trans. Cybernet. 46 (2016), 10, 2372-2385.   DOI:10.1109/TCYB.2015.2477416
  40. S. Lloyd: Least squares quantization in PCM. IEEE Trans. Inform. Theory 28 (1982), 2, 129-137.   DOI:10.1109/TIT.1982.1056489
  41. J. B. MacQueen: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Berkeley, University of California Press, 1967, pp. 281-297.   CrossRef
  42. M. Marjani, F. Nasaruddin, A. Gani and et al.: Big IoT data analytics: architecture, opportunities, and open research challenges. IEEE Access 5 (2017), 5247-5261.   DOI:10.1109/ACCESS.2017.2689040
  43. J. Newling and F. Fleuret: Nested mini-batch $k$-means. In: Proc. 30th International Conference on Neural Information Processing Systems, Curran Associates Inc. 2016, pp. 1360-1368.   DOI:10.48550/arXiv.1602.02934
  44. K. Peng, V. C. M. Leung and Q. Huang: Clustering approach based on mini batch $k$-means for intrusion detection system over big data. IEEE Access 6 (2018), 11897-11906.   DOI:10.1109/ACCESS.2018.2810267
  45. M. Sabo: Consensus clustering with differential evolution. Kybernetika 50 (2014), 661-678.   DOI:10.14736/kyb-2014-5-0661
  46. A. Saini, J. Minocha, J. Ubriani and D. Sharma: New approach for clustering of big data: disk-means. In: Proc. International Conference on Computing, Communication and Automation, IEEE Press, 2016, pp. 122-126.   DOI:10.1109/CCAA.2016.7813702
  47. D. Sculley: Web-scale $k$-means clustering. In: Proc. 19th International Conference on World Wide Web, ACM Press, New York 2010, pp. 1177-1178.   DOI:10.1145/1772690.1772862
  48. A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah and T. Herawan: Big data clustering: a review. In: Proc. International Conference on Computational Science and its Applications, LNCS 8583, Part V, Springer 2014, pp. 707-720.   DOI:10.1007/978-3-319-09156-3-49
  49. Z. Sun and P. P. Wang: A mathematical foundation of big data. New Math. Natur. Comput. 13 (2017), 2, 83-99.   DOI:10.1142/S1793005717400014
  50. Q. Tong, X. Li and B. Yuan: Efficient distributed clustering using boundary information. Neurocomput. 275 (2018), 2355-2366.   DOI:10.1016/j.neucom.2017.11.014
  51. V. Torra, Y. Endo and S. Miyamoto: On the comparison of some fuzzy clustering methods for privacy preserving data mining: towards the development of specific information loss measures. Kybernetika 45 (2009), 548-560.   CrossRef
  52. C.-W. Tsai, S.-J. Liu and Y.-C. Wang: A parallel metaheuristic data clustering framework for cloud. J. Parallel Distribut. Comput. 116 (2018), 39-49.   DOI:10.1016/j.jpdc.2017.10.020
  53. R. Xu and D. Wunsch: Survey of clustering algorithms. IEEE Trans. Neural Networks 16 (2005, 3, 645-678.   DOI:10.1109/TNN.2005.845141
  54. Q. Zhang, L. T. Yang, Z. Chen and P. Li: High-order possibilistic $c$-means algorithms based on tensor decompositions for big data in IoT. Inform. Fusion 39 (2018), 72-80.   DOI:10.1016/j.inffus.2017.04.002
  55. W.-L. Zhao, C.-H. Deng and C.-W. Ngo: $k$-means: a revisit. Neurocomput. 291 (2018), 195-206.   DOI:10.1016/j.neucom.2018.02.072