0.3
State of the Map 2021 - Academic Track
sotm2021-academic
2021-07-09
2021-07-11
3
00:05
https://pretalx.com
UTC
Track 1 - Talks
NLMaps Web: A Natural Language Interface to OpenStreetMap
Academic Talk
2021-07-11T10:00:00+00:00
10:00
00:20
NLMaps Web is a web interface for querying OSM with natural language questions such as “Show me where I can find drinking water within 500m of the Louvre in Paris”. They are first parsed into a custom query language, which is then used to retrieve the answer by queries to Nominatim and Overpass.
sotm2021-academic-10416-nlmaps-web-a-natural-language-interface-to-openstreetmap
Simon Will
en
Nominatim and Overpass are powerful ways of querying OSM, but the Overpass Query
Language is somewhat impractical for quick queries for unfamiliar users. In
order to query OSM using natural language (NL) queries such as “Show me where I
can find drinking water within 500m of the Louvre in Paris”, Lawrence and
Riezler [1] created the first NLMaps dataset mapping NL queries to a custom
machine-readable language (MRL), which can then be used to retrieve the answer
from OSM via a combination of queries to Nominatim and Overpass. They extended
their dataset in a subsequent work by auto-generating synthetic queries from a
table mapping NL terms to OSM tags – calling the combined dataset NLMaps v2. [2]
The proposed purpose of these datasets is training a parser that can parse NL
queries into their MRL representation, as done in [2-5].
The main aim of my Master’s thesis was building a web-based NLMaps interface
that can be used to issue queries and to view the result. In addition, the web
interface should enable the user to give feedback on the returned, either by
simply marking the parser-produced MRL query as correct or incorrect, or by
explicitly correcting it with the help of a web form. This feedback should be
directly used to improve the parser by training it in an asynchronous online
learning procedure.
After observing that parsers trained on NLMaps v2 perform poorly on new queries,
an investigation into the causes for this revealed several shortcomings in
NLMaps v2, mainly: (1) Train and test split are extremely similar limiting the
informativeness of evaluating on the test split. (2) Various inconsistencies
exist mapping from NL terms to OSM tags (e.g. “forest” sometimes mapping to
natural=wood, sometimes to landuse=forest). (3) The NL queries’ linguistic
diversity is limited since most of them were generated with a very simple
templating procedure, which leads to parsers trained on the data not being very
robust to new wordings of a query. (4) In a similar vein, there is only a small
amount of different area names in NLMaps v2 with the names “Paris”, “Heidelberg”
and “Edinburgh” being so dominant that parsers are biased towards producing
them. (5) Some generated NL queries are worded very unnaturally making them
counter-productive learning examples. (6) Usage of OSM tags is sometimes
incorrect, which affects the usefulness of produced parses.
The detailed analysis is used to eliminate some of the shortcomings – such as
incorrect tag usage – from NLMaps v2. Additionally, a new approach of
auto-generating NL-MRL pairs with probabilistic templates is used to create a
dataset of synthetic queries that features a significantly higher linguistic
diversity and a large set of different area names. The combination of the
improved NLMaps v2 and the new synthetic queries is called NLMaps v3.
A character-based GRU encoder-decoder model with attention [6] is used for
parsing NL queries into MRL queries using the configuration that performed best
in previous work [5]. This model is trained on NLMaps v3 and used as the parser
in the newly developed web interface. Mainly through advertising on the OSM talk
list and the OSM subreddit, 12 annotators are hired from all over the world to
use the web interface to issue new NL queries and to correct the parser-produced
MRL query if it is incorrect. They are assisted by completing a tutorial before
the annotation job and by help compiled from taginfo [7], TagFinder [8] and
custom suggestions for difficult tag combinations. The collected dataset
contains 3773 NL-MRL pairs and is called NLMaps v4.
With the help of NLMaps v4, an informative evaluation can be performed revealing
that a parser trained on NLMaps v2 parses achieve an exact match accuracy of
5.2 % on the MRL queries of the test split of NLMaps v4 while a parser trained
on NLMaps v3 performs significantly better with 28.9 %. Pre-training on
NLMaps v3 and fine-tuning on NLMaps v4 achieves an accuracy of 58.8 %.
Since the thesis’s goal is an online learning system – i.e. a system that
updates the parser directly after receiving feedback in the form of an NL-MRL
pair –, various online learning simulations are conducted in order to find the
best setup. In all cases, the parser is pre-trained on NLMaps v3 and then
receives the NL-MRL pairs in NLMaps v4 one by one, updating the model after each
step. The most simple variant of the experiment uses only the one NL-MRL pair
for the update, another variant adds NL-MRL pairs from NLMaps v3 to the
minibatch and a third variant additionally adds further “memorized” NL-MRL pairs
from previously given feedback to the minibatch. The main findings of the
simulation are that all variants improve performance on NLMaps v4 with respect
to the pre-trained parser, but with some of them the performance on NLMaps v3
degrades. The simple variant that updates only on the one NL-MRL pair is
paricularly unstable, while adding NLMaps v3 instances stabilizes the
performance on NLMaps v3 and improves the performance on NLMaps v4. Adding the
instances from memorized feedback further improves the performance to an
accuracy of 53.0 %, which is still lower than the offline batch learning
fine-tuning mentioned in the previous paragraph.
In conclusion, the thesis improves the existing NLMaps dataset and contributes
two new datasets – one of which is especially valuable since it consists of real
user queries – laying the groundwork necessary for further enhancing NLMaps
parsers. The current parser – achieving an accuracy of 58.8 % – can be used by
OSM users via the new web interface currently available at
https://nlmaps.gorgor.de/ for issuing queries and also for correcting incorrect
ones. Future work will concentrate on improving the web interface’s UX and
enhancing the parser’s performance in terms of speed and accuracy.
false
https://pretalx.com/sotm2021-academic/talk/GDMBWS/
https://pretalx.com/sotm2021-academic/talk/GDMBWS/feedback/
Track 1 - Talks
What has machine learning ever done for us?
Academic Talk
2021-07-11T10:45:00+00:00
10:45
00:20
Machine Learning is incredibly popular at this time among researchers working with OSM data and on OSM-related problems. But what impact has this work on ML had on the OSM database or OSM community? We investigate the impact on OSM, if any, the ML work within the academic research community has had over the last few years.
sotm2021-academic-10415-what-has-machine-learning-ever-done-for-us-
Peter Mooney
en
# What has machine learning ever done for us?
Peter Mooney and Edgar Galvan,<br />
Department of Computer Science, <br />
Maynooth University, Maynooth. <br />
Co. Kildare. Ireland. <br />
peter.mooney@mu.ie; edgar.galvan@mu.ie <br />
### Introduction and background
Recently, machine learning (ML) and artificial intelligence (AI) based approaches are being applied frequently to many different types of problems in OpenStreetMap (OSM). Indeed, ML and AI have being used extensively by the research community for a plethora of applications and problems both related and unrelated to OSM. Wagstaff (2012)[1] suggests ML offers "a cornucopia of useful ways to approach problems which defy manual solutions". In specific relation to the geospatial domain, ML approaches have been reported at least as early as a decade ago with work by authors such as Werder et al. (2010)[2] on interpretation of buildings in settlements and detecting road intersections from GPS traces by Fathi and Krumm (2010)[3]. Around this time, interest in the combination of ML and OSM began to emerge. Funke et al. (2015)[4] argued that many aspects of OSM data might be suitable for "extrapolation or classification using ML". Many examples have emerged with ML approaches being used to consider problems such as: predicting or recommending tagging for objects, object classification based on contextual or proximity information, tag usage checking, automated mapping approaches, to mention some problems. Jennings et al. (2019)[5] showed that Facebook’s recent mapping campaign in OSM used ML to detect road networks from satellite imagery which are then validated by OSM editors and the local OSM communities. Examples also exist where OSM is used in ML approaches for other geospatial classification problems (Wu et al. (2020)[6], Jacobs and Mitchell (2020)[7]) while authors such as Feldmeyer et al. (2020)[8] used machine and deep learning algorithms with OSM for developing socio-economic indicators. Audebert et al. (2017) provided additional examples and argued that OSM's richness means it can be used in difficult problems such as semantic labeling of aerial and satellite images.
In addition to the observations by Vargas-Munoz et al. (2021) in their recent review of ML approaches in OSM, we can usually observe ML and OSM interaction in one of three ways: (1) ML approaches are used to improve or correct OSM data, (2) instances where OSM is used as a means of training ML models for some specific task such as building segmentation, road speed estimation (Keller et al, 2020 [10]) or land use classification (Schultz et al. 2017 [9]) or (3) where the contribution patterns of OSM contributors are analysed using ML techniques as in work such as that by Jacobs and Mitchell (2020)[7]. In this submission we ask the following question. With all of the many applications and integration of ML and AI with OSM, over the past number of years, how many of these applications and approaches have been adopted or used by the OSM community? Furthermore, what are the benefits or impact of these efforts from the research community with ML and AI approaches to the OSM project and OSM community? We believe that there is significant scope for ML researchers to make impactful and helful contributions directly within OSM on problems such as tag updating and correction, added intelligence within OSM editing software, intrinsic quality analysis, etc.
### Methodology and Findings achieved
A systematic review of approximately 60 peer-reviewed academic journal and conference papers will be reported. These papers are selected on the following basis that the paper(s): (1) clearly outlines an ML or AI approach using OSM data, (2) tackle a problem known in the OSM community such as tag prediction, contribution patterns, or geometry correction. Paper metadata such as title, keywords, and abstract contents are used to select the papers. Manual checking of the papers is also undertaken to ensure that the content of each paper relates to our selection criteria. A classification of these papers will be developed based on the following set of questions:
* What are the most common ML approaches used by researchers for the three instances outlined above? For example, Learning Problems (supervised, self-supervised, reinforcement), Statistical Inference (Inductive, Deductive), etc.
* What are the most common types of problems in OSM tackled by ML approaches? For example, automated tagging, contribution pattern analysis, intrinsic quality analysis, object classification, etc.
* Are the approaches reproducible and replicable for other regions or areas within OSM? For example, is a particular ML approach limited to a specific geographical area or thematic area (such as roads, buildings, waterways, etc.) in OSM.
On this classification we then report a narrative on our findings on the benefits and impacts of these efforts to the OSM project and the OSM community. We are working on this analysis at the time of writing.
### Final Discussion of scientific contributions
As suggested by Jacobs and Mitchell (2020)[7], ML can "contribute to the diversification and quality of available assessment methods for OSM" while Feldmeyer et al. (2020)[8] argues that the application of ML to OSM can reveal the "untapped potential for knowledge generation" in OSM. In our work, we argue that we must not get carried away with the combination of ML and OSM purely for the sake of it. OSM, as a massive open geospatial database, is a very attractive source of (geo-)data for researchers and practitioners looking to train, benchmark and test ML approaches. Consequently, we can confidently state that, after well over a decade of reported results in this domain, researchers have produced many excellent research and knowledge outputs using the ML and OSM combination. Now we enter a phase of technological and scientific development with ML and OSM where we must ask how can all of this ML knowledge contribute effectively to the OSM database and OSM community.
Grinberger et al. (2019)[11] argue that while efforts to establish and strengthen interaction between the research community interested working with or in OSM and the OSM community itself have generally been positive. However, opportunities exist to enhance interactions between these two communities and perhaps ML could be the catalyst for a new interaction. Based on this the scientific contribution of this work is multi-faceted. Firstly, this paper will stimulate debate about the contribution of these ML approaches to the improvement of OSM data and enhancement of the OSM community. Secondly, this work will highlight situations where these ML approaches have delivered genuinely new and novel outputs of interest to OSM in general. Finally, this work will issue the challenge to the academic community to apply ML to several interesting and open problems which are of mutual interest to both the academic and OSM community.
false
https://pretalx.com/sotm2021-academic/talk/RHR7Q8/
https://pretalx.com/sotm2021-academic/talk/RHR7Q8/feedback/
Track 1 - Talks
Towards a framework for measuring local data contribution in OpenStreetMap
Academic Talk
2021-07-11T11:30:00+00:00
11:30
00:20
OpenStreetMap (OSM) constitutes a new open geographic database and offers several possibilities of adding local knowledge. While the importance of local knowledge is largely acknowledged in the OSM community, relatively few scientific studies have evaluated them. This study presents a framework to measure local data contribution in OSM in three case studies. The results highlight a framework for measuring local data in OSM as well as the distinct mapping stories of local OSM communities.
sotm2021-academic-10423-towards-a-framework-for-measuring-local-data-contribution-in-openstreetmap
Maxwell Owusu
en
OpenStreetMap (OSM) has proven to be a valuable source of spatial data for many applications, including humanitarian aid. Information on buildings and roads - that can be provided by remote mapping - is of highest concern for many humanitarian applications. However, further information - that can only be mapped on the ground - is of high importance for finer scale humanitarian action. Road surface information, type of material and information on the use of a building (health site, school,...) is highly relevant. OSM offers several possibilities of adding local knowledge [1]. Recent works deals with analyzing and classifying data production in OSM [2] and intrinsic analysis has gained popularity as an indicator for measuring quality of OSM data [3–6]. Nevertheless, relatively few scientific studies have touched on "local knowledge" and local data in OSM in sufficient detail.
The question of how much local knowledge is added and what kind of local data is added remains unanswered. Addressing this question is important since only local knowledge provides access to the plethora of contextual information that is necessary for many purposes. The term "local knowledge" is often debated in the OSM community due to its ambiguity. Consequently, it is hardly taken into account by researchers when evaluating OSM [1]. This study presents a metric to measure local data contributions in OSM and analyzes temporal patterns of local contributions at three case studies. The aim of the metric is to identify archetypes of places representing a variety of contextual information.
Firstly, we evaluated Rebacca Firth's framework on OSM contribution types that focused on the humanitarian context (see Twitter post: https://t.co/rDaSraiVZF). Secondly, we discussed with local community working groups how to measure local data contributions ("What exactly are local OSM data to you?"). The outcome of the community discussion provided valuable information to design a generalized workflow for measuring local data contribution in OSM. Subsequently, we identified aspects on which the local communities agreed with respect to perception of local data. Based on this first insights, we developed a classification schema for measuring local data in OSM that is "fit-for-purpose" for local OSM communities. This schema consists of four main levels and assigned OSM tags that could be used as indicators for each level. Thirdly, we explored the temporal evolution of local data in OSM for three unique regions. These regions mapping activities are influenced by local mapping organizations. (i) Ramani Huria in Dar es Salaam, Tanzania, focusing on flood resilience (ii) Crowd2MAP mainly operating in the Mara region, Tanzania and focusing on identifying features that can support the fight against girls and women at risk of female genital mutilation, and (iii) Power mapping project by Youth Mappers in the Koindugu, Sierra Leone, focusing on mapping electrical grid infrastructure. We used the ohsome API to access the full history of OSM. We determined the density and the ratio (as the sum of all OSM tags to the number of OSM elements) per month for each region and localness level.
The outcome of the community discussion showed that local mappers/editors had different perceptions about local knowledge. The type of local data produced depends on: (1) the context within which the data is produced and (2) the character/interest of the individual mapping them. However, the local data produced could be broadly categorized as "core" or "specific". The "core" category consisted of the objects that cut across almost all projects or activities (e.g., buildings, roads, place names and administrative boundary) and the category "specific" were special elements mapped as a results of a particular interest or aim of the project (e.g., culvert, drains, access types, parking type).
To develop a metric for local data analysis, we classified OSM data into four main levels based - level one consists of objects that can be derived easily by remote mapping from satellite images such as roads and building (this is information that does not require local knowledge), level 2 focuses on place names and administrative boundaries which are frequently imported, level 3 focuses on the presence of general (e.g., residential and commercial) or specific amenities (e.g., school, clinic, and point of interest) and level 4 focuses on micro-data that provides further contextual information about an object (e.g., road:maxspeed, surface condition). Level 1 and 2 mainly fall into the "core" category whereas level 3 and 4 mainly belong to the "specific" category (which will vary across different regions). Our results show that the amount of features in OSM decreased from level 1 to level 4. The ratio between level 1 and level 4 could be used as an indicator for how widely local information is present in OSM at a specific location. Thereby, it can provide insights on the quality of the OSM data and fitness-for-purpose for applications that need information beyond the existence of highways or buildings.
From the temporal analysis, we observed that the amount of features in OSM decreased from level 1 to level 4. The ratio between level 1 and level 4 shows how widely local information is present in OSM at a specific location. Thereby, it provides insight on the quality of the OSM data and fitness-for-purpose for applications that need information beyond the existence of highways or buildings. Most of the mapping in the selected region started in 2015. By digging deeper into the objected mapped, each selected region depicts unique characteristics which is largely shaped by the interest of contributors/organizations. Mapping patterns are clearly distinct from each region with respect to the development of tags. For example, there was high local data regarding waterways, drainage, and solid waste in Dar es-Salaam and very low in Mara region and Koindugu. It reveals the distinct mapping stories of individual or organization. Our results show further, that there is no common path from level 1 to Level 2 to level 3 among the different regions. For the case of Dar es Salaam, mapping of features of the three levels has happened more or less simultaneously. Mapping in the Mara region focused first on place names (level 2) and then on amenities (level 3) as well as buildings and roads (level 1). For Koindugu, mapping of level 2 started in 2011 already and was followed by mapping of buildings and roads (level 1) in 2014 and amenities (level 3) from 2017 on.
The classification schema helps to conceptualize a metric to measure localness of OSM data at different levels of details. This metric can be easily used to group OSM data into the categories "core" and "specific". By analyzing the temporal patterns, we identified that contribution of local data was highly unequal and largely depended on the interest of the mapper(s). The research shed light on the richness of contextual information in OSM as well as an indication for the quality of data. In future research we would like to extend the results presented here by including more regions and more perspectives from local OSM communities. By doing so we hope to be able to extend the definition of local data by considering the editors' local knowledge as well.
false
https://pretalx.com/sotm2021-academic/talk/NTNSQE/
https://pretalx.com/sotm2021-academic/talk/NTNSQE/feedback/
Track 1 - Talks
Community Interactions in OSM editing
Academic Talk
2021-07-11T12:15:00+00:00
12:15
00:20
We look at interactions between Corporate and Non-Corporate Editors as reflected through co-editing patterns in the OSM data. We use Social Network Analysis on 12 networks generated from four different locations and 3 different timepoints and our results show the vibrant co-production of OSM data generation. There are interactions between all editors but Corporate Editors tend to interact at a higher rate with each other. The seniority of editors and the interactions also differ between Corporate and Non-Corporate Editors.
sotm2021-academic-10380-community-interactions-in-osm-editing
Dipto SarkarJennings Anderson
en
OpenStreetMap (OSM) data is produced by a vibrant online community of mappers. To be more specific, OSM data produsers represent a plethora of individuals with different motivations, methods of data contribution, and usage (Budhathoki & Haythornthwaite, 2013; Coleman et al., 2009). Thus, OSM contributors have been aptly described as a community of communities (Solis, 2017). In recent years, corporate editing teams have introduced a new dynamic in the discussion on communities in OSM; editing teams hired by corporations, such as, Apple, Facebook, Microsoft, Uber, are capable of contributing thousands of changesets a day (Anderson et al., 2019; Anderson & Sarkar, 2020). Additionally, corporate editors (CEs) tend to focus their editing on particular types of map features. These two attributes of corporate editing can lead to CEs breaking off into a siloed group of their own with little or no interaction with the rest of the editors on the map.
Previous research on the OSM community using similar methods showed there was limited collaboration between editors with most objects being edited only a few times (Mooney & Corcoran, 2012). Senior editors in particular perform a majority of the mapping work on their own, but do interact with others through co-editing (Mooney & Corcoran, 2014). Since these studies were performed, the OSM community has grown significantly and the community dynamics have also evolved with more individual and organized participation (e.g. CE).
Here, we use a data driven approach to characterize the interactions between the CEs and the rest of the OSM community. We define interactions through editing patterns. That is, we construct a network of interactions where each node represents an editor, and two nodes are connected if they have edited the same map object. If the mapper of node A edits an object last edited by the mapper of node B, then an edge connecting these nodes exists and is directed from A to B.
We utilized the OSM-Interactions tilesets to construct these networks (Anderson, 2020). These vector tiles contain the editing history of all highway and building objects at zoom level 14. They include minor changes to the geometry of objects in which only nodes are moved, but the parent way is left untouched. In this way, we are capturing the complete history of map objects in OSM, as opposed to just changes to the basic OSM elements (primarily nodes or ways).
In keeping with the objects which are primarily edited by CEs, we focused only on highway and building objects for construction of the network. The nodes are further annotated with a binary category representing whether they are a CE or not. We classify a mapper as being a CE or not by comparing usernames in the network to the disclosed lists of usernames on a corporation’s OSM wiki or Github page.
We focus on 4 locations: Egypt, Jamaica, Thailand, and Singapore. We create networks for each of these locations at 3 timepoints, 2015, 2017, and 2019 to characterize the changes between over time. Thus, we constructed and analyzed 12 networks. The locations were chosen as they all have different groups of CEs active.
Across all networks, the Largest Connected Component (LC) accounted for 93.6% of all nodes highlighting significant interactions amongst all mappers. Within the LC, the rate of growth of CE nodes exceeds the rate of growth of non-CE nodes at rate of 11:1 between 2015 and 2019. However, both types of editors (CE and non-CE) have a comparable number of in and out degrees in each place, indicating that they edit other people’s work and have their work edited at a similar rate. In terms of who edits whose work, CE’s edit other CE’s work most often, but interaction between CEs and non-CEs have also grown through time, keeping the network connected. With regards to age of the mappers (calculated in terms of their enrollment date in OSM) and the volume of edits they perform, younger mappers in both groups tend to edit others' work at a higher rate than senior mappers, but there is more variation in these statistics for non-CE mappers. This is a finding contrary to previous research on editing interaction patterns mentioned above. Additionally, characterizing the time between edits show that edits made by CE’s persist for a slightly shorter duration than edits made by non-CE, primarily due to other CEs editing the same object soon after.
In conclusion, the editing networks highlight the vibrancy of data co-production. The volunteer editor and CEs are interacting with each other's edits to produce the map. The per-group interaction is nuanced and shows unique editing patterns which warrant further investigation. During the timespan of this study, the rate of growth of the CE community was faster than the non-CE community, but whether the pattern will hold over time and whether other locations exhibit the same pattern require more research.
false
https://pretalx.com/sotm2021-academic/talk/PPSHC3/
https://pretalx.com/sotm2021-academic/talk/PPSHC3/feedback/
Track 1 - Talks
Towards understanding the temporal accuracy of OpenStreetMap: A quantitative experiment
Academic Talk
2021-07-11T13:00:00+00:00
13:00
00:20
This talk presents results of an experiment conducted on the temporal accuracy of OpenStreetMap, and provides insights into the temporal dynamics with which changes in real-life appear in OSM.
sotm2021-academic-10398-towards-understanding-the-temporal-accuracy-of-openstreetmap-a-quantitative-experiment
Levente Juhász
en
The ability to provide timely information compared to traditional collection methods of geographic information is generally considered as one of the main advantages of volunteered geographic information (VGI) since its emergence in the 2000s (Goodchild, 2007). In addition to several anecdotal examples illustrating how VGI data can provide more up-to-date information than authoritative sources, the literature provides ample evidence on the usefulness of VGI in applications that require timely geodata, such as disaster management (Horita et al., 2013; Neis & Zielstra, 2014). For example, the Haiti earthquake relief effort in 2010 laid the foundations of how remote contributors of OpenStreetMap (OSM) and other platforms can make a difference and aid responding humanitarian agencies after a crisis (Zook et al., 2010). The Humanitarian OpenStreetMap Team has made numerous contributions and helped save lives at numerous instances ever since (Herfort et al., 2021). However, apart from these examples, the temporal dimension of VGI has not received much research attention outside the application of disaster management, and there is a huge gap between assessing temporal accuracy and other factors of data quality, such as spatial accuracy (Antoniou & Skopeliti, 2015; Yan et al., 2020). Aubrecht et al. (2017) highlighted the lack of formal acknowledgment of temporal aspects in the concept of VGI and proposed a framework called ‘Volunteered Geo-Dynamic Information’ to fully integrate spatial and temporal aspects of VGI. Other works utilizing the temporal component in VGI often focus on the behavior of contributors rather than the currency and temporal validity of map features they contributed (Bégin et al., 2018; Haklay et al., 2010; Neis & Zipf, 2012), or studied the evolution of data over time (Girres & Touya, 2010; Zielstra & Hochmair, 2011). While these approaches are useful, by nature they cannot provide a quantitative measure of how current OSM (or VGI in general) is. Arsanjani et al. (2013) noted during their investigations that the temporal accuracy of OSM could not be measured using their traditional extrinsic method, because OSM data was compared to authoritative data that did not contain temporal information (i.e. most recent street configuration regardless of when road segments were built or renovated). Another project, ‘Is OSM up-to-date?’ recognizes the lack of information on temporal accuracy and developed a tool that uses an intrinsic approach to visually show features that potentially contain outdated information (Minghini & Frassinelli, 2019). However, by nature, an intrinsic approach can also not provide an absolute measure of how up-to-date OpenStreetMap is.
This research attempts to fill a gap in the literature by conducting an experiment on the currency of VGI. Using OSM data as a case study, it will measure the temporal accuracy of selected map features. This research overcomes previous limitations by using official data provided by the Florida Department of Transportation (FDOT). The dataset contains details about state-funded highway construction projects, including the date these projects were completed, therefore accurately measuring the temporal accuracy of OSM features is possible by comparing dates projects were finished with the time at which corresponding OSM edits in the database were made. This time difference describes how long it took for the OSM community to adapt to real-world changes and update the map database accordingly.
The historical version of highway construction projects was filtered to projects completed between May 15, 2016 and April 1, 2021. Further, only a subset of projects were used, that resulted in either 1) new infrastructure (new roadways, roundabouts or highway ramps), 2) new lanes in existing roadways (excluding bike lanes), and 3) new bike lanes or paths. Other construction projects, such as traffic improvements, road resurfacing, regular maintenance (e.g. bridge rehabilitation), etc. were excluded, since a useful, high-quality road network database can be maintained without the addition of these information, therefore, they are less likely to migrate into OSM. The methodology uses augmented diffs from the Overpass API to find all changes that occurred on OSM highway features (creation, modification and deletion) and are spatially and temporally close to construction projects. These changes are then matched with a record from the highway construction dataset. Irrelevant changes (i.e. changes made to other highway features) are removed. This is done by manually interpreting and evaluating changes and construction projects using a description field (e.g. “SR 61 WAKULLA SPRINGS RD @ CR 2204 OAK RIDGE ROAD INTERSECTION - ROUNDABOUT”). The data extraction algorithm initially queries the Overpass API for changes one week beyond the completion date of a particular project. In case no relevant change can be found, iterative queries for 7-day-long time slices are made until a relevant change is found, or until the current date is reached. Lastly, the time difference between the end date of construction projects and the first OSM change that introduced the change in OSM are calculated. For example, the description field above mentioning State Road 61 (SR61) can be found with the following Overpass query (https://overpass-turbo.eu/s/16XV) that uses the location of the highway construction project. Interpreting whether an extracted change is relevant or not can also be verified using changeset comments: (https://www.openstreetmap.org/changeset/87938707). In this example, the changeset comment “Added new round about.” confirms that the OSM edit is related to the FDOT dataset. Comparing the construction end date (July 3, 2019) and the time when this change appeared in OSM (July 13, 2020) yields 1 year and 10 days, which is the time it took the OSM community to adapt a real-world change and bring the database up-to-date.
This talk will be structured as follows. First, results of a comprehensive literature review on the temporal aspect of OSM research will be given to highlight the lack of data-driven, quantitative research on the temporal component of OSM and VGI. Then, using the filtered FDOT construction dataset that contains 23 new highways and roundabouts, 64 new bike lanes and paths, and 129 new traffic lane additions, the results of an exploratory data analysis about the currency of OSM will be presented. The summary and descriptive statistics of a reasonably large sample will provide insights into the currency of OSM and the dynamics of temporal accuracy. Lastly, limitations of the experiment will also be discussed. These include the reference dataset, that does not contain federally or locally funded projects, therefore misses a large number of constructions, and the methodology, that cannot capture the diversity of the OSM community and also disregards changes beyond the transportation infrastructure.
This experiment is the first attempt to investigate the timeliness and currency of Volunteered Geographic Information using large sets of data. Future work will conduct analysis using more VGI data sources outside the domain of mapping applications (e.g. Points of Interest in check-in trackers and review applications), new methodology using tile-reduce, OSM QA tiles and vector tiles built from other datasets. The new methodology will be scalable and will allow for analysis across world regions. Furthermore, a rule-based decisions approach based on tags and semantics will be used to eliminate the need for manually checking and verifying whether VGI updates correspond to the reference dataset or not.
false
https://pretalx.com/sotm2021-academic/talk/LY9Z8C/
https://pretalx.com/sotm2021-academic/talk/LY9Z8C/feedback/
Track 2 - Panels and Workshops
A proposal for a QGIS Plugin for Spatio-temporal analysis of OSM data quality: the case study for the city of Salvador, Brazil
Academic Talk
2021-07-11T15:00:00+00:00
15:00
00:20
It Consists in a proposal for a QGIS Plugin for Spatio-temporal analysis of OSM data quality in an area of Brazil.
sotm2021-academic-10424-a-proposal-for-a-qgis-plugin-for-spatio-temporal-analysis-of-osm-data-quality-the-case-study-for-the-city-of-salvador-brazil
Elias Nasr Naim Elias
en
The development of methodologies to evaluate geospatial data quality is one of the most important aspects to be considered while obtaining these data. For the developing countries, such as Brazil, the lack of investment for the maintenance of the topographic mapping, especially on a big scale, is a recurrent challenge to the National Mapping Agencies (ANM) [1]. For example, studies reveal areas in Brazil that have never been mapped and that the topographic mapping in the 1:25.00 scale is nearly 5% of its extension [2].
The technological advances enabled a series of methodologies for obtaining geospatial data [3]. One example is presented as Volunteered Geographic Information (VGI) [4]. In this case, the update of information may occur faster and with a reduced cost in detriment to the traditional structures of topographic mapping [5]. A successful case of VGI is the OpenStreetMap (OSM) platform, which presents the growth in the number of contributors and contributions or mapped features. To comprehend the behaviour of the OSM features and their integration potential to the topographic mapping, different surveys worldwide have put efforts to evaluate its quality, whether by its extrinsic [6, 7] or intrinsic [8] aspects. In this regard, some studies have evaluated the quality of OSM's features by combining extrinsic and intrinsic aspects, like [9], that evaluated the positional precision of OSM based on the combination of the edition's history. Besides that, the most recent researches have focused on comprehending spatial and temporal aspects of events in OSM contributions [10], as well developing add-ons for evaluating data quality, as is presented by [11], that developed a QGIS toolbox to evaluate parameters of the intrinsic quality of OSM features.
The literature identifies as one of the main challenges for the integration processes, the heterogeneity of the data. Once the quality may vary according to the study area, the indicator used of even the spatial variations through time in the same region. In this context, to understand the adjustment of OSM's resources to the topographic mapping, it is crucial to connect aspects related to the quality and heterogeneity of data. Researches like [1] argue that, based on the obtained quality, the resources resulting from VGI may be used to integrate, detect changes or report errors. Therefore, classifying resources from OSM according to their usability in a certain region becomes essential, especially in developing countries like Brazil. Besides that, the research that explores issues of quality, heterogeneity, and contributions patterns of OSM is still not widespread in developing countries [12].
Given the importance to classify OSM features according to their usability for a given region, especially in developing countries, few researchers explore quality, heterogeneity, and contribution pattern issues of OSM in Brazil. we proposed a hypothesis that understanding aspects of the extrinsic and intrinsic quality of the quality of OSM features, related to spatiotemporal aspects of contributions in developing countries, will help decision making regarding the influence of the dynamics of insertion of features concerning quality.
Thus, this research has as an objective to evaluate the extrinsic quality of OSM features for the county of Salvador-Bahia-Brazil (the northeast region of the country). Therefore, we investigated indicators of positional accuracy, thematic accuracy and completeness, the visualisation of heterogeneity of data, and the analysis of the edition history. To accomplish the evaluation of extrinsic quality, the OSM features were compared to the topographic mapping of the country regarding the Cartographic and Cadastral System of the County of Salvador (SICAD, 2006) and features from the Urban Development Company of the State of Bahia (CONDER).
The analysis of positional and thematic accuracy was made through procedures of feature sampling. The analysis of completeness occurred from comparing the total of available features. The verified categories were features from the road system, religious, educational, and health buildings. We divided the municipality of Salvador into sub-regions to identify different local patterns of quality in the analysis of thematic accuracy and completeness. The visualisation allows obtaining the data's heterogeneity through a plugin developed in the software QGIS, making the planimetric positional evaluation for point and line features. The statistical procedures for developing the plugins were realised based on the Brazilian law to evaluate geospatial data quality analysis [13] and based on the method of double buffer proposed by [14]. The plugin is available, and it is possible to be accessed in the online repository https://github.com/eliasnaim/AcuraciaPosicional_PEC-PCD. Even though the final results comprehend aspects of Brazilian law, they can be replicated to obtain the discrepancies and posterior adjustments. We used the OHSOME Application Programming Interface (API) (identify the patterns concerning the OSM editing history. Thus, from the adaptations done in scripts given by researchers linked to OHSOME, it was possible to identify the aspects of OSM contributions between 2008 and 2020. We also tested the generation of regression curves and calculated the number of daily contributions to identify these patterns. These verifications were occasioned through the generation of an evolving rectangle of 5x5 km in the study area. The disposition of the rectangle was given through a visual analysis with a larger quantity of OSM features.
The evaluation of extrinsic evaluation highlighted the variability of the results obtained in [15]. In analysing the positional accuracy, the scale found varied from 1:20,000 to 1:30,000, while the discrepancies between the mapped coordinates and the reference one varied between 10.27m and 0.12m. In analysing completeness, the road system presented a percentage of 82%, while in the other features, the variation was from 29% to 46%. When analysing thematic accuracy, it turns out that the primary source of errors is related to the absence of names in editing. In the analysis referent to the history's growth of represented features, it was possible to notice a near-linear function, with an R2 value of 0.94. This aspect gives the initial premise that it is possible to model patterns of contributions and associate them to the saturation level of the quantity of added elements in a particular area. Besides that, it was possible to observe that the patterns of collaboration can be affected by different variables because it was noticed that in 2016, more than 800 features were added in a short period. These aspects can be related to events such as data importation or mapathons.
The development of add-ons for evaluating OSM data quality that departs from the making of statistical procedures up to visualising the heterogeneity of data will assist in the decision-making as to data quality.
The magnitude of discrepancies did not present patterns and that this may vary according to the period of edition and the database used for the contributions. We noticed the relevance in identifying the aspects of quality and heterogeneity in OSM contributions.
For Brazil, identifying these characteristics may numerally indicate the integration potential of these data to the authoritative mapping. Besides that, it will estimate the influence of unusual agents, like it is the case of data import in the contributions. The continuity of the studies is recommended to identify the causes of different patterns of growth and the continuity of studies to automatise the quality procedures.
false
https://pretalx.com/sotm2021-academic/talk/3MTGA3/
https://pretalx.com/sotm2021-academic/talk/3MTGA3/feedback/
Track 2 - Panels and Workshops
Introducing OpenStreetMap User Embeddings: Promising Steps Toward Automated Vandalism and Community Detection
Academic Talk
2021-07-11T15:45:00+00:00
15:45
00:20
We develop and test user embeddings approaches to vandalism detection in OSM. We successfully demonstrate improvements to previous vandalism detection methods, and additionally how the user embeddings can further be applied to detect different communities of mappers. We validated the embedding model with a prepared vandalism corpus that we are also releasing to the OSM community.
sotm2021-academic-10188-introducing-openstreetmap-user-embeddings-promising-steps-toward-automated-vandalism-and-community-detection
Yinxiao LiJennings Anderson
en
With more than 11B edits from 1.6M unique mappers and openly editable by anyone, the OpenStreetMap (OSM) database inevitably contains vandalism. Our approach to detecting it leverages the analytical power and scalability of machine learning through OSM user embeddings. Embeddings are effective in capturing semantic entity similarities that are not explicitly represented by the data. Since word embeddings were first introduced based on the assumption that words adjacent to each other share similar meanings [1,2], the concept of embeddings has been extended beyond word representations to any entity, so long as one can produce a meaningful sequence of the entities. Therefore, we build OSM user embeddings with mappers as entities by constructing sequences of mappers based on shared editing histories and similar behaviors.
**Methods**
_Creating a Vandalism Corpus_
Development of automated vandalism detection methods in OSM has been slow in part because there is no published corpus of bad or vandalized edits from which to train and validate [3]. Vandalized name attributes are especially problematic because this text is rendered on the basemap. The most infamous instance of this type of vandalism was the changing of "New York City" to an ethnic slur; this name attribute was subsequently rendered on maps drawing from OSM data [4]. As part of this work, we construct and make available the first OSM vandalism corpus for the name attribute of OSM features. Potential examples of vandalism are collected from the OSM Changeset Analyzer (OSMCha) web-based validation tool. These records are then manually reviewed by the Facebook mapping team to identify egregious name changes. Negative samples (non-vandalism) were randomly sampled from a previously validated vandalism-free snapshot of OSM. All of our examples are extracted from OSM data only, no external or conflated data sources.
_User Embeddings_
To construct meaningful sequences of OSM users where adjacent users share similar mapping patterns, we analyzed the edit history of every OSM object and the temporal/semantic editing patterns of individual mappers. These sequences were then fed into a word2vec skip-gram model to train OSM user embeddings.
**Shared object editing histories** are sequences of OSM users who have edited the same object, in chronological order of editing. These sequences represent mappers who share interest in the same objects on the map. This yields 2B sequences of mappers.
**Semantic and temporal mapping patterns** are sequences of OSM users that have shared editing characteristics with regard to how and when they edit the map. Starting with _changesets_, we extract the following keys for each OSM element edited in a given changeset when present: `addr:country`, `admin_level`, `amenity`, `building`, `highway`, `natural`, `place`, `source`. Additionally, we extract the following metadata: the presence of `name` tag, the `version` number, the editing software (e.g. iD editor, JOSM), and any hashtags (possibly denoting specific mapping campaigns). Finally, we group all of these edits by two types of temporal patterns: first, the date of the changeset, and second, the hour of the week of the changeset, per year (with 168 hours in a week, we aggregate across each _week-hour_ in a given year). This yields 30M sequences of mappers.
**Results**
_Community Detection_
OSM is comprised of many distinct groups of mappers; considering each of these groups a different sub-community makes OSM a "community of communities" [5]. The creation of the temporal and semantic editing patterns was specifically designed to create sequences of mappers with high likelihood of belonging to the same community. One type of easy-to-identify communities are corporate editing teams: groups of employees that are paid to edit OSM [6]. Results of corporate editing team detection can be easily validated against published lists of known editors.
The five largest corporate mapping teams are Apple (>1,200 mappers), Amazon (>700), Grab (>550), Facebook (>250), and Kaart (>200). These counts are based on extracting affiliation from a mapper’s OSM user profile, looking for sentences such as “I work for Amazon" and are likely an under-representation [7].
To validate the performance of the model’s ability to successfully identify members of an editing team based on editing semantics, we used the cosine similarity to compare users. First, we identified the 100 most similar users to the _top 10 most active mappers_ in each company (by number of changesets). Next, we confirm how many of the top 100 most similar users are also on that team. This is a measure of recall for our model.
Amazon is the most identifiable team, with all 100 of the most similar editors also belonging to the Amazon Logistics data team. The mean cosine similarity (`mcs`) among these 100 mappers is 0.98. Apple is the second most identifiable with 97% of the top 100 most similar mappers also belonging to the Apple data team and an `mcs` between the top 10 and these 97 users of 0.94. Third was Kaart, with 96% and `mcs=0.88`. Facebook was fourth with 87% and `mcs=0.87`. The Grab data team, however, was more difficult to identify: only 68% of the top 100 most similar mappers were also part of the Grab data team. The `mcs` between these 68 mappers and the top 10, however, is high at 0.94.
_Vandalism Detection_
To detect vandalism, we train a Gradient Boosting Decision Tree (GBDT) model, which consists of metadata, user reputation, object history, and content features. We applied OSM user embeddings into this model by creating two embedding features, `kmeans_cluster` and `cos_sim_last_5_users`. To create `kmeans_cluster`, we ran k-means clustering on OSM users and assigned a cluster to any user with an embedding, and then encoded the cluster based on the average number of edited changesets among this cluster. The idea behind `cos_sim_last_5_users` is that users who are similar to each other are more likely to edit the same objects. Starting with an edit to an OSM object, we compute the cosine similarity between the user responsible for the edit and the previous five mappers that edited the object.
Next, we trained a new model by injecting the embedding features, and we have seen a relative improvement of 1.3% in our primary metric, area under the receiver-operator curve (AUC-ROC). The feature importance of `kmeans_cluster` is ranked as high as 2/49, with a coverage of 99.9%, while `cos_sim_last_5_users` has an importance rank of 16/49, largely due to a relatively low coverage of 64%, meaning that the majority of edits in OSM create new objects, so there can be no editing history for these.
Because of the AUC improvements and high feature importance, Facebook has deployed this model in production to detect vandalism, as a part of the data validation in the Facebook Map and Daylight Map, a validated, vandalism-free distribution of OSM [8].
_Vandalism Corpus_
The accurately labeled dataset of vandalism to named elements in OSM is a tremendous asset to researchers hoping to further the work of automated vandalism detection. As part of the continual quality-assurance work at Facebook, teams of professional mappers are consistently labeling and improving this running list. As part of this work, we are publishing this fully labeled vandalism corpus for others in the OSM research community to use [9].
false
https://pretalx.com/sotm2021-academic/talk/9XQTVC/
https://pretalx.com/sotm2021-academic/talk/9XQTVC/feedback/
Track 2 - Panels and Workshops
An Automated Approach to Identifying Corporate Editing Activity in OpenStreetMap
Academic Talk
2021-07-11T16:30:00+00:00
16:30
00:20
The rise of organized editing practices in the OpenStreetMap community has outpaced research methods for identifying mappers participating in these efforts and evaluating their work. This research uses machine-learning to improve upon prior approaches to estimating corporate editing on OSM, contributing both a novel methodology as well as summary statistics that shed light on corporate editing behavior in OSM.
sotm2021-academic-10425-an-automated-approach-to-identifying-corporate-editing-activity-in-openstreetmap
Veniamin Veselovsky
en
In the past five years, the OSM community has seen a dramatic rise in organized editing, including corporate, humanitarian, and educational, on the platform. These new actors have continued the ongoing debate surrounding OSM’s relationship with organized editing, with new rules and best-practices being implemented to align the interests of the organizations with those of the community.
We became interested to study how the editing habits of these new actors differed from the community as a whole, but were quickly confronted by the challenge of producing accurate measures of their activities. In this paper we aim to fill this gap by creating computational methods of understanding different editing behaviours on OSM to classify editors as being corporate or volunteer. Classifying individual editors has been done in the past, on a more local level, for example in the recent analysis on editing in Mozambique. [1]
Studying corporate editing behaviour, first requires a list of corporate editors. In the past, researchers have searched individual "organized editing team” webpages. Instead, our paper presents a novel method for classifying users on the platform, by scraping user profiles. There are two possible approaches to extract corporate mappers based on user profiles. The first approach uses a clustering of the keywords within the profiles. Though effective at uncovering relations between users (like students, programmers, Garmin editors, Colorado mappers), this method failed to properly capture all known corporate groups. Instead we did a keyword search for corporations listed on the Organized Editing List and classified similar users together. We then divided this list into corporate or non-corporate. This simplification was done to align with past research into corporate editing [2].
Using this extracted list, we discern features that could act as “signals” for organized editors. Explicitly, which features from the changesets can point to an editor being corporate or volunteer. Do corporate editors edit specific types of items? Do their time series signatures differ?
For the creation of these features, we relied on Jennings Anderson’s past work on corporate editing for inspiration [2]. The first set of features came from OSM changeset metadata which is rich with user descriptive data like the editor used, comments, and source. We find that most organizations use editors like JSOM and iD. Next, we attempted to model which objects corporations edit by finding descriptive words like “service”, “road”, and “building” in the comments of the changeset. We observed that most corporations focus on services and roads, as opposed to buildings which tend to be dominated by volunteer mappers.
The third feature was motivated by the observation that as the interests of a corporation change, the editing of its mapping team can also change. This has led to the well documented phenomena of corporate mappers having a geographically dispersed editing pattern. This is markedly different from many volunteer mappers who often begin by mapping their local neighbourhoods. Using established metrics, we calculated the geographic dispersion for each user based on the latitude and longitude of their edits.
The metric we found most effective was the timeseries signature. Corporations have a traditional 9-5 mapping schedule, whereas non-corporate mappers tend to map far more haphazardly, including significant mapping on the weekend. When attempting to convert the timeseries signature into a usable metric, we came across a problem: timezones. All changesets in OSM are normalized to UTC time, this means that a user editing at 8am in Toronto, Canada and another user editing at 8pm in Beijing, China would in fact appear to be editing at the same time in OSM. Longitude and latitude data are not an effective method of extracting the mappers timezone, since editing on OSM is increasingly done remotely, through “armchair mapping”.
To utilize this strong signal, we developed a new method for normalizing a users time signature, and it was based on the observation that individual corporations have several key editing patterns, depending on where their employees are located. For example, Facebook has two such patterns, each displaced by around 8 hours. This motivated us to create a “corporate editing signature” and translate the corporate signatures to find the minimal distance between the two. After using this method of adjustment, we were able to significantly improve the alignment of the time-series. In other words, we were able to recover the local time zone of most of these corporate editors. Figure 1 illustrates corporate mappers before and after adjustment.
Figure 1. This plot shows how corporate time zones were recovered after minimizing distance between corporate actors and a “corporate mapping signature”.
Once we realigned each user using this method, we calculated the distance between a user's adjusted time signature and the “corporate signature”. This feature ended up acting as a key determinant of the likelihood of a given editor being corporate. Out of the top 100 editors (who had the smallest distance to the corporate signature) all of them belonged to corporations.
Utilizing the user features we predict whether an editor is corporate or not. We experimented with several classification algorithms, including logistic regression, k-nearest neighbours, support vector machines, and neural networks. The four most important features in the prediction task, ordered by impact on model, were the geo-score, time series score, first edit date, and the editor type. All models provided comparable results offering a high recall of 96%+ and predicting anywhere between 700 to 2,000 additional corporate mappers. Examining the newly predicted mappers reveals users that map for humanitarian groups like HOT, corporate mappers that the initial scrape didn’t pick up on, corporate mappers who reveal their association only in the hashtags, users who are likely corporate mappers with no ability to know for certain, and volunteers. We remove any “predicted mappers” who have known humanitarian associations because these users are beyond the breadth of this paper. We are now entering the stage of further validating the different models based on a manually annotated set of users that any of the models predicted to be corporate. We aim to find the model that predicts the most “corporate mappers” and the least volunteer mappers.
References
[1] Madubedube, A., Coetzee, S., & Rautenbach, V. (2021). A Contributor-Focused Intrinsic Quality Assessment of OpenStreetMap in Mozambique Using Unsupervised Machine Learning. ISPRS International Journal of Geo-Information, 10(3), 156. MDPI AG. Retrieved from http://dx.doi.org/10.3390/ijgi10030156
[2]Anderson, J., Sarkar, D., & Palen, L. (2019). Corporate Editors in the Evolving Landscape of OpenStreetMap. ISPRS International Journal of Geo-Information, 8(5), 232. MDPI AG. Retrieved from http://dx.doi.org/10.3390/ijgi8050232
false
https://pretalx.com/sotm2021-academic/talk/XXFEXQ/
https://pretalx.com/sotm2021-academic/talk/XXFEXQ/feedback/
Track 2 - Panels and Workshops
Involvement of OpenStreetMap in European H2020 Projects
Academic Talk
2021-07-11T17:15:00+00:00
17:15
00:20
During the past decades, the European Commission has invested billions in research through various programmes, such as H2020. In this study, we review exhaustively all the H2020 open deliverables to analyse how these public european projects are relying on OpenStreetMap.
sotm2021-academic-10397-involvement-of-openstreetmap-in-european-h2020-projects
Damien GrauxThibaud Michel
en
Since 1984, the European Commission has been supporting research through various successive programmes. Recently, from 2014 to 2020, the EU invested approximately 80 billion euros into its eighth programme, named Horizon 2020 [1]. Among various focuses such as the excellence of science or industrial secondments, H2020 emphasised on supporting an open access policy for all the research results [2]. Moreover, H2020 projects were strongly encouraged to use open source software and tools.
Practically, all the research domains were eligible to be supported by the H2020 programme, and therefore, the scopes of the projects vary from e.g. computer science, to philology passing by agriculture… Technically, as these projects are almost always involving several partners located in several European member states joining forces from multiple institutions, there is often a need to deal with data coming from different places. And, more generally, geo-data are often involved to tag information which may be research data, meeting localisation, partner addresses, etc.
In such a context where open source tools are recommended by the European Commission, we analyse the presence of OpenStreetMap in H2020 projects. In addition, we also review the presence of other geographic services such as Google, Bing and Baidu maps, in order to better understand how researchers tend to choose one over the other.
Thanks to the open access policy, participants of the H2020 projects had to make their results available. To do so, their various types of materials were submitted to the European portal which then offers them publicly. As a consequence, for each project, one can access the articles (through DOIs), the blog posts, the slide decks, the deliverables… In particular, in our study, we decided to focus on the deliverables as they are accessible on the EC portal directly and are the common reports written by the partners to describe their approaches. Indeed, these deliverables (usually written on a regular basis during the project) report on the findings and methodology set up to achieve the project’s goals and authors explain their architectural choices in depth such as describing the tools used. As a consequence, cartographic services, if involved at some stage in the project, are likely to be mentioned in these documents either as acronyms (e.g. OSM) or as website references (e.g. https://www.openstreetmap.org/).
In order to obtain the deliverables together with projects’ information, we combined two European sources of information to gather all the facets we wanted to cover: CORDIS [3] and Data.Europa [4]. In particular, we extracted from CORDIS various high-level information about the projects themselves: from their names and acronyms to their durations passing by the specific European call-for-fundings they answered and obtained their money from. This latter category can be useful in order to have a finer-grained understanding of the domains which are prone to involved cartographic services. Next in order, Data.Europa was used to download the deliverables themselves, which required several days of computing resources.
Overall, during the course of the H2020 programme, 33636 projects were funded by the European Commission. Depending on the type of action which was set by the projects, not all of them had some open deliverables written (and thereby available on the Europa platform). Actually, a large part of these projects did not have deliverables per se but rather articles or web posts. We indeed counted 25157 projects without deliverables which restricted our study to the remaining 8479 projects. Out of them, we listed a total of 92612 distinct deliverables to be analysed, representing more than 260GB.
Technically, once all these deliverables were downloaded, we searched them for various terms to know if some cartographic services are involved in the text. We therefore set up several regex rules (e.g. 'open.?street.?map’ or ‘[^a-z0-9]osm[^a-z0-9]’) which were run over the 92000+ deliverables. This allowed us to systematically count all the occurences of the considered cartographic solutions. In the end, we found that 1840 deliverables (from 651 projects) mention OpenStreetMap. More precisely, through all the H2020 deliverables, there are approximately: 18600 mentions to OSM, 2800 to GoogleMaps, 226 BingMaps and 4 to BaiduMaps. Empirically, we notice that 1) one order of magnitude separates the occurrences of each cartographic service and 2) OpenStreetMap is from far the most represented solution and thereby the one on which public European researchers rely the most. Contextually, it is also interesting to note that not all the deliverables (1796 of them) mentioning “point of interest” refer to a cartographic service.
Moreover, we also analysed the co-occurence cases, where different cartographic providers are jointly mentioned within a single deliverable. Notably, there are not that many. Indeed, only 59 deliverables mention both OSM and BingMaps, over the 226 occurrences of the latter; and only 291 deliverables mention both OSM and GoogleMaps, over the 2800 occurrences of GMaps. Besides, only 39 deliverables mention OSM, GoogleMaps and BingMaps. Such figures tend to suggest that once a group of researchers has chosen a cartographic solution, they tend to stick to it and do not try to compare them.
Furthermore, regarding OpenSeaMap, we counted 312 mentions from 27 deliverables, among which 20 ones mention both OSM and OpenSeaMap, showing how connected are the two initiatives.
In this study, we systematically analysed all the available H2020 deliverables, searching for cartographic service references, with a specific focus on OpenStreetMap. Our efforts show that OSM is the most used cartographic service in European H2020 projects in terms of mentions in the deliverable’s texts, followed by GoogleMaps with one order of magnitude less mentions. It is worth noting that these projects involving OSM were backed by almost 4 billion euros of public money.
Based on these first interesting results, we plan to extend our scope of analysis following three axes. First, we think that it could be worth reviewing also the other types of project’s results such as the articles or the software source code bases. Second, we hope our approach paves the road to similar reviews of public funded initiatives, and based on this observation we plan to apply our scripts to other European funding programmes. Third, additional cartographic services could also be integrated into our pipelines such as ApplePlans or other OSM-related initiatives like OpenCycleMap in order to extend the covered scope.
Finally, for reproducibility purposes, we also share on a public github repository [5] all the scripts necessary to download the deliverables and generate the statistics. Furthermore, https://dgraux.github.io/OSM-in-H2020 provides the reader with additional and detailed analyses together with visualisations, hoping these will help the community better understand the impact of OSM within the public European research landscape.
false
https://pretalx.com/sotm2021-academic/talk/DGPAWN/
https://pretalx.com/sotm2021-academic/talk/DGPAWN/feedback/