Extracting Geographic Information from Social Media Data, an approach using NER with Colombian spanish
In the past decade, there has been a growth of interest in exploring the large amount of data present on social media, generated by users around the world. Some calculations aim to confirm that are around 4.95 billion of users of social networks, and they are generating huge quantities of information that can be used for research in geography and spatial humanities. Multiple investigations about data contained in social networks as Twitter , Flickr, Reddit, TripAdvisor and other popular social networks could be found in general academic research, especially with in subjects like Natural Language Processing (NLP) that also has been growing exponentially in the past few years. With such a big source of information, there are a lot of work possibilities in different topics in which geography and spatial humanities must not be unconnected, because as the available information is generated by people, we can found a lot of different topics to study this kind of information.
Trying to understand geographic data from social media information is one important goal for researchers in recent years. But how can we obtain that geographic information from sources that are principally texts and pictures? In this research we try to answer that question, in that context, the main goal of this research is to make social media data a source from geographic information that could be used for several researching and decision making.
There has been some approaches to the main question by using social media data, for example, some researches tried to use geotagged pictures to find some spatial patterns of sentiments with photos from Instagram and Flickr. Another approach is using Named Entity Recognition (NER) to process TripAdvisor comments reviews using the text contents. In the case of twitter data, there has been 3 principal approaches to the matter: 1) Use the metadata of the information (as they call geo-tagged tweets); 2) Inferring the geographic location of the tweet using a combination of metadata, profile data and making predictions based on the language of the texts available in the content being able to summarize a location of the origin of the tweet, and finally, 3) one of the most common approach by using techniques as NER.
Except for a few cases of work with data from Indonesia, China and India and focused to the local languages, most of the work in this task has been in the English language or has used another approach like taking the words from the original language and translating it to English with automated translation methods. Is in this context than a necessity of working with models that can be trained to use NER approaches in Spanish language specifically for Spanish in Colombia has reached, and to make the testing task with twitter data of Spanish tweets of Colombia could be useful to contribute growing the NER tasks focused on identifying location in short texts as tweets. Furthermore, NER tasks are too general to named entities, so they are useful to find names, location, roles and organization, in this case, the main focus of this process is to use it focused in Locations.
To achieve that goal, the exploration of NER methods has been taking place by exploring some supervised trained models for this task, first, testing some of the available as Stanford NER and Spacy library NER and comparing it with the results of a trained supervised NER model using Colombian Spanish and Colombian toponyms. In this way we can see the improvements of the NER tasks in the recognition of locations for this specific case. By comparing the methodological approaches, and by generating the corresponding models we could say this approach of a Colombian Language NER is a big contribution in several fields: 1) the researching in NER tasks of the scientific community interested in NLP process and 2) the spatial humanities, geography community and institutions that can take another huge geographic information resource to further researching and decision making towards the geo-spatial understanding on the world.
As this work is part of a bigger effort to understand the geographical space in Colombia with the use of data presented in texts (short texts in the case of twitter) processed with NLP, testing NER tasks with Colombian Spanish to extract geographic locations is one of the first steps of the work, so that is why the future work will be related to use different approaches of unsupervised training as topic modeling and finally, trying to summarize that extraction with some topic and sentiment analysis in the tweets, all of this in an effort to contribute to the spatial humanities and digital humanities approaches.