Named Entity Recognition (NER) plays a crucial role in many Natural Language Processing and Information Retrieval tasks, such as document search, clustering, information extraction, etc. This task is specially challenging when performed on tweets due to their noisy nature, such as non-standard spelling and grammar, code-switching and informal or unstructured text. When considering the multilingual nature of Twitter, the lack of resources for Scandinavian languages also produces additional challenges. In this project the students will explore current challenges to perform NER on Norwegian tweets. The main tasks in this project are the compilation of a corpus of Norwegian tweets with information about their entities and then to apply different machine learning algorithms in order to train the recognizer. One of the approaches to be explored consists in benefiting from entity linking with knowledge bases like Wikipedia in order to find the entities in these tweets.
The project is part of the SmartMedia program at IDI. SmartMedia collaborates with the Norwegian media industry and is investigating the use of semantics and linked data in large-scale realtime news recommendation.
The project is supervised by Cristina Marco and Jon Atle Gulla.