Behaviour-Based Quality Assessment of OpenStreetMap Data in Data Scarce Area Using Unsupervised Machine Learning
2025-10-03 , Pulag

This study introduces a behavior-dependent, unsupervised machine learning approach to assess the intrinsic quality of OpenStreetMap (OSM) data in Dhaka, which is both data-starved and urbanizing rapidly urbanizing area. Leveraging enriched contributor metadata and Principal Component Analysis (PCA), latent behavioral patterns and segmented contributors identified using KMeans and HDBSCAN. The silhouette score for PCA-based clustering was 0.951. The results show superior interpretability of KMeans over HDBSCAN. This repeatable methodology provides a scalable and reference-free solution to take quality assurance of VGI datasets to the front-line, in cases of limited or no authoritative data.


OpenStreetMap (OSM) is an important source of geospatial information in data-starved urban areas, where official geospatial data are scarce, outdated, or are not readily available. Increasing need for current and accurate geospatial data in fast urbanizing and under surveyed regions makes the use of OpenStreetMap (OSM) an essential resource. As one of the most representative Volunteered Geographic Information (VGI), OSM offers a free world map that is editable and can be contributed by millions of people [1]. The tool is an essential component for urban analytics, transport planning, disaster risk reduction, and spatial modeling in the world [2], [3], [4]. Although widely used, the quality of OSM data varies greatly across regions and contributor skill level, and there is no unified, system level quality assurance mechanism [5]. This heterogeneity can be risk inducing for users making use of this data for precision tasks (e.g., routing, land use modeling and infrastructure design) [2], [6].
Traditional OSM quality assessments rely on extrinsic comparisons with satellite imagery or authoritative datasets, which are often unavailable in the very regions that need the data the most [7], [8]. To overcome this challenge, a reproducible, unsupervised machine learning framework propose to assess OSM data quality intrinsically, based on contributor behavior metadata alone. Specifically, Dhaka —a data-scarce and fast-growing megacity in Bangladesh select as a study area—using the hypothesis that distinct contributor behavioral patterns correlate with different levels of data reliability. This behavior-centric perspective leverages the insight that contributor frequency, recency, thematic focus, and spatial editing behavior can serve as meaningful proxies for feature quality [5], [9].
Roads and buildings for Dhaka extracts by using by a.osm.pbf with the Pyrosm library. Then enriched feature vector creates for each unique contributor, composed of (total_edits, edit_rate, active_days, spatial_extent, pct_road, pct_building, weekday_activity, days_since_last_edit). Principal Component Analysis (PCA) applies for dimensionality reduction and shows that PC1 roughly represents global mapping activity, while PC2 corresponds to thematic attention (road versus building), and PC3 represents the geographical coverage of contributions. These observations are supported by a feature contribution heatmap (Figure 1.(a)), which indicates that it is reasonable to consider the behavioral features to be interpretable and highly separable in the component-reduced space. PCA has also the purpose of reducing noise and gets the data ready for clustering [10].
Next, KMeans clustering (with k = 4) and HDBSCAN, a density-based clustering is performed on the PCA-transformed feature set. The silhouette score of the KMeans model was 0.951, suggesting high cohesion within the clusters and good separation between the clusters of behaviors. The PCA cluster scatterplot (Figure 1.(c)) indicates four separated clusters: (1) most participants (Figure 1. (b)) fall in cluster 0, which mainly encompasses casual or one-hit contributors who probably participate in sporadic mapathons, or make large scale imports, (2) cluster 1 and 2 consist of moderate to heavy contributors, who are relatively more or less stable, with richer semantic tagging, and whose edits are spatially distributed, (3) cluster 3 is composed of a small group of “power users,” who are characterized by high activity volume and a large geographical distribution.
HDBSCAN also use on the same dataset in order to analyze its capability of separating varies densities in clusters and noise. HDBSCAN found small, dense clusters, and labeled a large percentage of contributors as noise. Although helpful for identifying anomalousness and potential vandalism, HDBSCAN was unable to produce as clear clusters for the main contributors as KMeans, likely because the extreme imbalance in contributor engagement. This benchmarking demonstrated that KMeans comes with a better interpretability and cluster stability, and is therefore preferred for behavioral segmentation at the high volumes of OSM dataset.
To further verify the clustering, the changes in edit volume over time per cluster investigated, and calculated feature distributions per cluster. The contributor distribution bar chart (Figure 1. (b)) shows that the participation structure in OSM is highly skewed, which is also in line with previous VGI studies [11], [12]. Feature analysis showed that clusters associated with more recent, frequent, and thematically rich editing were also responsible for higher-quality contributions—consistent with prior work linking contributor experience to data quality [5], [9], [13].
A key contribution of this work is its extensible and repeatable approach. All data processing, feature engineering, PCA and clustering have been performed in Python (Colab) with open-source packages (scikit-learn, geopandas, pyrosm, matplotlib). This method doesn't need any external validation databases, so it is particularly adapted for developing countries and isolated locations, where reference data are limited or unavailable [8].
This study contributes methodologically to three areas in the sciences, more precisely to the area of geospatial data science, unsupervised machine learning, and VGI quality assurance in showing how user behavior can be harnessed for deriving inherent data quality. It complements the literature about behavior-based contributor profiling, incorporates dimensionality reduction to facilitate the interpretation of results, and is an argument against central quality assessment as well as one for local quality assessment, which seems feasible even in urban settings with complex mobility patterns.
Pragmatically, this work can help NGOs, local authorities and the OSM community to support the allocation of resources toward data validation and enrichment where coverage is primarily in lower-quality contribution clusters. It also allows hybrid-quality models with behavior signals are augmented with selective extrinsic checks (such as anomaly detection or community verification). For example, contributors from Cluster 3 (power users) may be assigned higher trust weights in quality models, while edits from Cluster 0 may be flagged for further review or enrichment.
In conclusion, a new behaviour-based quality assessment of OSM report based on the specific usage of unsupervised machine learning. This cluster- and PCA-driven design is transparent, and interpretable, and completely reproducible. It is a model that addresses the challenges of working in data scarce urban areas and it paves the way for a behavior driven VGI quality models in the framework of urban resilience, infrastructure planning and humanitarian mapping. Future studies will incorporate spatial error measures and use this methodology with longitudinal OSM data for quality evolution monitoring.

I am a young researcher and engineer passionate about climate change impact assessment, ecosystem service assessment, disaster resilience, and environmental sustainability. With a strong academic foundation in Civil Engineering (BSc), Humanitarian Engineering (MSc), and Environmental Economics (MEcon), I bring an interdisciplinary approach to solving complex environmental and societal challenges. My work integrates hydrological modeling, GIS-based analysis, machine learning, and multi-criteria decision analysis (MCDA) to assess climate risks and develop sustainable solutions.