Mining Imbalanced Big Data with Julia
2019-07-24 , Elm B

Machine learning for data Mining applications in imbalanced big data classification is very challenging task. In this talk, we have proposed a new cluster-based under-sampling approach with ensemble learning for mining real-life imbalanced big data in Julia.


In this era of big data, classifying imbalanced real-life data in supervised learning is a challenging research issue. Standard data sampling methods: under-sampling, and over-sampling have several limitations for dealing with big data. Mostly, under-sampling approach removes data points from majority class instances and over-sampling approach engenders artificial minority class instances to make the data balanced. However, we may lose informative information/ instances using under-sampling approach, and under other conditions over-sampling approach causes overfitting problem. In this talk, we have presented a new cluster-based under-sampling approach by amalgamating ensemble learning (e.g. RandomForest classifier) for classification of imbalanced data that we implemented in Julia. We have collected actual illegal money transaction telecom fraud data, which is highly imbalanced with only 8,213 minority class instances amount 63,62,620 instances. The proposed method bifurcates the data into majority class and minority class instances. Then, clusters the majority class instances into several clusters and considers a set of instances from each cluster to create several sub-balanced datasets. Finally, a number of classifiers are generated using these balances datasets and apply majority voting technique for classifying unknown/ new instances. We have tested the proposed method on separate test dataset that achieved 97% accuracy.


Co-authors

Swakkhar Shatabda, Mohammad Zoynul Abedin, Md. Tarikul Islam, Md. Ishtiak Hossain

Dr. Dewan Md. Farid is an Associate Professor, Department of Computer Science and Engineering, United International University, Bangladesh. He worked as a Postdoctoral Fellow at the following research groups: (1) Computational Modeling Lab (CoMo), Department of Computer Science, Vrije Universiteit Brussel, Belgium in 2015-2016, and (2) Computational Intelligence Group (CIG), Department of Computer Science and Digital Technology, University of Northumbria at Newcastle, UK in 2013. Dr. Farid was a Visiting Faculty at the Faculty of Engineering, University of Porto, Portugal in June 2016. He holds a PhD in Computer Science and Engineering from Jahangirnagar University, Bangladesh in 2012. Part of his PhD research has been done at ERIC Laboratory, University Lumière Lyon 2, France by Erasmus-Mundus ECW eLink PhD Exchange Program. He has published 73 peer-reviewed scientific articles, including 26 journal papers in the field of machine learning and data mining. Dr. Farid received United Group Research Award 2016 in the field of Science and Engineering. He received following Erasmus Mundus scholarships: (1) LEADERS (Leading mobility between Europe and Asia in Developing Engineering Education and Research) in 2015, (2) cLink (Centre of excellence for Learning, Innovation, Networking and Knowledge) in 2013, and (3) eLink (east west Link for Innovation, Networking and Knowledge exchange) in 2009. Dr. Farid also received Senior Fellowship I, and II award by National Science & Information and Communication Technology (NSICT), Ministry of Science & Information and Communication Technology, Government of Bangladesh respectively in 2008 and 2011. He is a member of IEEE.

Dr. Shatabda is Associate Professor and Undergraduate Program Co-ordinator of Computer Science and Engineering Department.

He achieved his Ph. D degree from the Institute for Integrated and Intelligent Systems (IIIS), Griffith University in 2014. His thesis is titled “Local Search Heuristics for Protein Structure Prediction”. He completed his BSc. in Computer Science and Engineering from Bangladesh University of Engineering and Technology (BUET) in 2007.

Research interest of Dr. Shatabda includes bioinformatics, optimization, search and meta-heuristics, data Mining, constraint programming, approximation Algorithms and graph theory. He has a number of quality publications in both national and international conferences and journals.

He has worked as Graduate Researcher in Queensland Research Laboratory, NICTA, Australia. Prior entering the teaching line he worked as a Software Engineer in Vonair Inc, Bangladesh.