Mining Imbalanced Big Data with Julia
2019-07-24, 11:50–12:00, Elm B

Machine learning for data Mining applications in imbalanced big data classification is very challenging task. In this talk, we have proposed a new cluster-based under-sampling approach with ensemble learning for mining real-life imbalanced big data in Julia.


In this era of big data, classifying imbalanced real-life data in supervised learning is a challenging research issue. Standard data sampling methods: under-sampling, and over-sampling have several limitations for dealing with big data. Mostly, under-sampling approach removes data points from majority class instances and over-sampling approach engenders artificial minority class instances to make the data balanced. However, we may lose informative information/ instances using under-sampling approach, and under other conditions over-sampling approach causes overfitting problem. In this talk, we have presented a new cluster-based under-sampling approach by amalgamating ensemble learning (e.g. RandomForest classifier) for classification of imbalanced data that we implemented in Julia. We have collected actual illegal money transaction telecom fraud data, which is highly imbalanced with only 8,213 minority class instances amount 63,62,620 instances. The proposed method bifurcates the data into majority class and minority class instances. Then, clusters the majority class instances into several clusters and considers a set of instances from each cluster to create several sub-balanced datasets. Finally, a number of classifiers are generated using these balances datasets and apply majority voting technique for classifying unknown/ new instances. We have tested the proposed method on separate test dataset that achieved 97% accuracy.


Co-authors – Swakkhar Shatabda, Mohammad Zoynul Abedin, Md. Tarikul Islam, Md. Ishtiak Hossain