Kybernetika 61 no. 4, 492-508, 2025

Ibk-means: An iterative batch k-means algorithm for big data clustering

Rasim Alguliyev, Ramiz Aliguliyev, Adil M. Bagirov and Mustafa AliyevDOI: 10.14736/kyb-2025-4-0492

Abstract:

Information technologies such as social media, mobile computing, and the realization of the industrial Internet of Things (IoT) produce huge amounts of data every day. The development of powerful tools for knowledge-discovery is imperative to deal with such a volume of data. Clustering methods are among the most important knowledge-discovery techniques. The growth in computational power and algorithmic developments allow us to efficiently and accurately solve clustering problems in large datasets. However, these developments are insufficient to deal with clustering problems in big datasets. This is because these datasets cannot be processed as a whole due to hardware and computational restrictions. In this paper an iterative batch $k$-means ($ibk$-means) algorithm is proposed that yields good clustering results with low computation costs on big datasets. It is designed to cluster datasets using batch data. The efficiency and accuracy of the proposed algorithm are investigated depending on the size of batches, the number of attributes and clusters. The algorithm is compared with the classic $k$-means and mini batch $k$-means algorithms using computational results on several real-world datasets, all of which are available from the UCI Machine Learning Repository. The smallest dataset has 500000 data points and 2 attributes and the largest one contains 43930257 data points and 16 attributes. Results demonstrated that the $ibk$-means algorithm outperforms both the $k$-means and mini batch $k$-means algorithms in the sense of both efficiency and accuracy and it is applicable for the clustering of big datasets. The proposed algorithm provides real time clustering and may have direct applications in expert and intelligent systems. Furthermore, results from this paper will have a clear impact in the sense of designing more accurate and efficient clustering algorithms for big datasets taking into account available computer resources.

Keywords:

big data, $k$-means algorithm, batch clustering, mini batch $k$-means, cluster analysis

Classification:

68T09, 90B99

paper.pdf

References:

C. C. Aggarwal and C. K. Reddy: Data Clustering: Algorithms and Applications (First edition). CRC Press, Taylor and Francis Group, Boca Raton, London, New York 2014. DOI:10.1201/b17320
E. Ahmed, I. Yaqoob, I. A. T.Hashem and et al.: The role of big data analytics in Internet of Things. Computer Networks 129 (2017), 2, 459-471. DOI:10.1016/j.comnet.2017.06.013
R. Alguliyev, R. Aliguliyev and L. Sukhostat: Anomaly detection in big data based on clustering. Statistics, Optim. Inform. Computing 5 (2017), 4, 325-340. DOI:10.19139/soic.v5i4.365
R. Alguliyev, R. Aliguliyev, A. Bagirov and R. Karimov: Batch clustering algorithm for big data sets. In: Proc. 2016 IEEE 10th International Conference on Application of Information and Communication Technologies, IEEE Press 2016, pp. 79-82. DOI:10.1109/ICAICT.2016.7991657
R. Alguliyev, R. Aliguliyev, Y. Imamverdiyev and L. Sukhostat: An anomaly detection based on optimization. Int. J. Intell. Systems Appl. 9 (2017), 12, 87-96. DOI:10.5815/ijisa.2017.12.08
R. Alguliyev, R. Aliguliyev, Y. Imamverdiyev and L. Sukhostat: Weighted clustering for anomaly detection in big data. Statist. Optim. Inform. Comput. 6 (2018), 2, 178-188. DOI:10.19139/soic.v6i2.404
R. M. Alguliyev, R. M. Aliguliyev and R. G. Alakbarov: Constrained $k$-means algorithm for resource allocation in mobile cloudlets. Kybernetika 59 (2023), 88-109. DOI:10.14736/kyb-2023-1-0088
R. Alguliyev, R. Aliguliyev and L. Sukhostat: Improved parallel big data clustering based on $k$-medoids and $k$-means algorithm. Probl. Inform. Technol. 15 (2024), 18-25. DOI:10.25045/jpit.v15.i1.03
R. Alguliyev, R. Aliguliyev and L. Sukhostat: Parallel batch $k$-means for big data clustering. Comput. Industr. Engrg. 152 (2021), 107023, 1-11. DOI:10.1016/j.cie.2020.107023
D. Aloise, A. Deshpande, P. Hansen and P. Popat: NP-hardness of Euclidean sum-of-squares clustering. Machine Learn. 75 (2009), 2, 245-248. DOI:10.1007/s10994-009-5103-0
S. E. Amrahov, Y. Ar, B. Tugrul, B. E. Akay and N. Kartli: A new approach to Mergesort algorithm: Divide smart and conquer. Future Gener. Comput. Syst. 157 (2024), 330-343. DOI:10.1016/j.future.2024.03.049
A. M. Bagirov, S. Taheri and B. Ordin: An adaptive $k$-medians clustering algorithm. Probl. Inform. Technol. 13 (2022), 3-15. DOI:10.25045/jpit.v13.i2.01
A. M. Bagirov, J. Ugon and D. Webb: Fast modified global $k$-means algorithm for incremental cluster construction. Pattern Recogn. 44 (2011), 866-876. DOI:10.1016/j.patcog.2010.10.018
B. Bahmani, B. Moseley, A. Vattani and et al.: Scalable $k$-means++. Proc. VLDB Endowment 5 (2012), 7, 622-633. DOI:10.14778/2180912.2180915
J. Béjar: $k$-means vs mini batch $k$-means: A comparison. Technical Report, Universitat Politécnica de Catalunya, 2013. CrossRef
A. Bose, A. Munir and N. Shabani: A comparative quantitative analysis of contemporary big data clustering algorithms for market segmentation in hospitality industry. 2017. CrossRef
L. Bottou and Y. Bengio: Convergence properties of the $k$-means algorithm. In: Proc. 7th International Conference on Neural Information Processing Systems, MIT Press, Cambridge 1995, pp. 585-592. CrossRef
P. S. Bradley, U. Fayyad and C. Reina: Scaling clustering algorithms to large databases. In: Proc. Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, New York 1998, pp. 9-15. DOI:10.5555/3000292.3000295
X. Cai, F. Nie and H. Huang: Multi-view $k$-means clustering on big data. In: Proc. Twenty-Third International Joint Conference on Artificial Intelligence, ACM Press, New York 2013, pp. 2598-2604. DOI:10.5555/2540128.2540503
A. Catherine, C. Alejandro, F. Ricardo and G. Badih: Multivariate and functional robust fusion methods for structured big data. J. Multivar. Anal. 170 (2019), 149-161. DOI:10.1016/j.jmva.2018.06.012
P. Cetin and Ö. Ö. Tanriöver: Priority rule for resource constrained project planning problem with predetermined work package durations. J. Fac. Engrg. Architect. Gazi University 35 (2020), 3, 149-161. DOI:10.17341/gazimmfd.545873
C. L. P. Chen and C.-Y. Zhang: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inform. Sci. 275 (2014), 314-347. DOI:10.1016/j.ins.2014.01.015
X. Cui, P. Zhu, X. Yang and et al.: Optimized big data k-means clustering using MapReduce. J. Supercomput. 70 (2014), 3, 1249-1259. DOI:10.1007/s11227-014-1225-7
D. Dua and E. Karra Taniskidou: UCI Machine Learning Repository. Irvine, Univ. California, School of Information and Computer Science, 2017. CrossRef
A. Fahad, N. Alshatri, Z. Tari and et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerging Topics Computing 2 (2014), 3, 267-279. DOI:10.1109/TETC.2014.2330519
N. Ghadiri, M. Ghaffari and M. A. Nikbakht: BigFCM: Fast, precise and scalable FCM on Hadoop. Future Gener. Computer Syst. 77 (2018), 29-39. DOI:10.1016/j.future.2017.06.010
A. Hadian and S. Shahrivari: High performance parallel $k$-means clustering for disk-resident datasets on multi-core CPUs. J. Supercomput. 69 (2014), 2, 845-863. DOI:10.1007/s10586-017-1687-5
L. Haibo and W. Zhi: Application of an intelligent early-warning method based on DBSCAN clustering for drilling overflow accident. Cluster Comput. (2019). DOI:10.1007/s10586-017-1687-5
J. Han, M. Kamber and J. Pei: Data Mining: Concepts and Techniques (Third edition). Morgan Kaufmann, 2011. DOI:10.1016/C2009-0-61819-5
I. Hassan: $I-k$-means-+: an iterative clustering algorithm based on an enhanced version of the $k$-means. Pattern Recogn. 79 (2018), 402-413. DOI:10.1016/j.patcog.2018.02.015
R. J. Hathaway, J.C . Bezdek and J. M. Huband: Scalable visual assessment of cluster tendency for large data sets. Patt. Recog. 39 (2006), 7, 1315-1324. DOI:10.1016/j.patcog.2006.02.011
T. C. Havens and J. C. Bezdek: An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans. Knowl. Data Engrg. 24 (2012), 5, 813-822. DOI:10.1109/TKDE.2011.33
T. C. Havens, J. C. Bezdek and M. Palaniswami: Scalable single linkage hierarchical clustering for big data. In: Proc. 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, IEEE Press 2013. pp. 396-401. DOI:10.1109/ISSNIP.2013.6529823
M. Hilbert and P. López: The world's technological capacity to store, communicate, and compute information. Science 332 (2011), 6025, 60-65. DOI:10.1126/science.1200970
S. S. Ilango, S. Vimal, M. Kaliappan and P. Subbulakshmi: Optimization using Artificial Bee Colony based clustering approach for big data. Cluster Comput. (2019). DOI:10.1007/s10586-017-1571-3
Y. Imamverdiyev and F. Abdullayeva: Deep learning method for DoS attack detection based on restricted Boltzmann machine. Big Data 6 (2018), 2, 159-169. DOI:10.1089/big.2018.29026.zob
N. Karmitsa, A. M. Bagirov and S. Taheri: New diagonal bundle method for clustering problems in large data sets. European J. Oper. Res. 263 (2017), 2, 367-379. DOI:10.1016/j.ejor.2017.06.010
N. Karmitsa, A. M. Bagirov and S. Taheri: Clustering in large data sets with the limited memory bundle method. Pattern Recogn. 83 (2018), 245-249. DOI:10.1016/j.patcog.2018.05.028
D. Kumar, J. C. Bezdek, M. Palaniswami and et al.: A hybrid approach to clustering in big data. IEEE Trans. Cybernet. 46 (2016), 10, 2372-2385. DOI:10.1109/TCYB.2015.2477416
S. Lloyd: Least squares quantization in PCM. IEEE Trans. Inform. Theory 28 (1982), 2, 129-137. DOI:10.1109/TIT.1982.1056489
J. B. MacQueen: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Berkeley, University of California Press, 1967, pp. 281-297. CrossRef
M. Marjani, F. Nasaruddin, A. Gani and et al.: Big IoT data analytics: architecture, opportunities, and open research challenges. IEEE Access 5 (2017), 5247-5261. DOI:10.1109/ACCESS.2017.2689040
J. Newling and F. Fleuret: Nested mini-batch $k$-means. In: Proc. 30th International Conference on Neural Information Processing Systems, Curran Associates Inc. 2016, pp. 1360-1368. DOI:10.48550/arXiv.1602.02934
K. Peng, V. C. M. Leung and Q. Huang: Clustering approach based on mini batch $k$-means for intrusion detection system over big data. IEEE Access 6 (2018), 11897-11906. DOI:10.1109/ACCESS.2018.2810267
M. Sabo: Consensus clustering with differential evolution. Kybernetika 50 (2014), 661-678. DOI:10.14736/kyb-2014-5-0661
A. Saini, J. Minocha, J. Ubriani and D. Sharma: New approach for clustering of big data: disk-means. In: Proc. International Conference on Computing, Communication and Automation, IEEE Press, 2016, pp. 122-126. DOI:10.1109/CCAA.2016.7813702
D. Sculley: Web-scale $k$-means clustering. In: Proc. 19th International Conference on World Wide Web, ACM Press, New York 2010, pp. 1177-1178. DOI:10.1145/1772690.1772862
A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah and T. Herawan: Big data clustering: a review. In: Proc. International Conference on Computational Science and its Applications, LNCS 8583, Part V, Springer 2014, pp. 707-720. DOI:10.1007/978-3-319-09156-3-49
Z. Sun and P. P. Wang: A mathematical foundation of big data. New Math. Natur. Comput. 13 (2017), 2, 83-99. DOI:10.1142/S1793005717400014
Q. Tong, X. Li and B. Yuan: Efficient distributed clustering using boundary information. Neurocomput. 275 (2018), 2355-2366. DOI:10.1016/j.neucom.2017.11.014
V. Torra, Y. Endo and S. Miyamoto: On the comparison of some fuzzy clustering methods for privacy preserving data mining: towards the development of specific information loss measures. Kybernetika 45 (2009), 548-560. CrossRef
C.-W. Tsai, S.-J. Liu and Y.-C. Wang: A parallel metaheuristic data clustering framework for cloud. J. Parallel Distribut. Comput. 116 (2018), 39-49. DOI:10.1016/j.jpdc.2017.10.020
R. Xu and D. Wunsch: Survey of clustering algorithms. IEEE Trans. Neural Networks 16 (2005, 3, 645-678. DOI:10.1109/TNN.2005.845141
Q. Zhang, L. T. Yang, Z. Chen and P. Li: High-order possibilistic $c$-means algorithms based on tensor decompositions for big data in IoT. Inform. Fusion 39 (2018), 72-80. DOI:10.1016/j.inffus.2017.04.002
W.-L. Zhao, C.-H. Deng and C.-W. Ngo: $k$-means: a revisit. Neurocomput. 291 (2018), 195-206. DOI:10.1016/j.neucom.2018.02.072

Kybernetika

Journal