Round Corner
Department of Computer and Information Science


Avoiding biases in building training datasets for supervised algorithms in NLP

RQ: How can we build a training dataset that is unbiased in the context of words used (vocabulary), sentiment(positive and negative connotation) and topic for the purpose of training a natural language processing engine to extract keywords, or do topic modelling? (Optimizing for TFIDF and LDA)

The problem domain is related to supervised machine learning, topic modelling and keyword extraction in the Natural Language Processing domain.

The students will have access to a database of a couple of hundreds annotated data points dataset, and their goal will be to analyze the dataset and form smaller datasets of examples and answer the following sub RQs: What key properties should each of the data points have in order to be put into the training data set?

What should the relationship be between the data points, and how can we automate the process of selecting a particular data point into a data set, without manual labor, with a minimum risk for biases?

Iris AI will give support in providing knowledge about Python libraries that could be used for the experiments.

Your profile
Coding experience: Ideally Python, but C or Java experience could be accepted too. Experience in Machine Learning: Preferably taken courses in Machine Learning and Big Data( or Data Science).

We are a startup company, and we’d love to engage with students who are at least intrigued by the idea of working with a small, fast-moving team with big ambitions.

Solid english skills is expected. Where you want to write your thesis from is irrelevant as we’re already a distributed team and work mainly on Hangout and Slack.

If we like each other and Iris AI progresses well there will be opportunities after completion of your master thesis.

About Iris AI
Iris AI is an Artificial Intelligence that will read all of the world's research and help us connect the dots.

Our first goal is an AI-assistant to help tech entrepreneurs and innovators navigate the world of science. It is solving the problem of the massive amount of knowledge we have, which is impossible to navigate for someone who is not a deep know the terminology of the fields and doesn't quite know what they're looking for yet as they're exploring opportunities. You should be able to input any scientific text over 500 words and immediately find all relevant research, across disciplines.

The first baby step product is live on our site, where you can explore the science around a TED talk. The next version of our tool is scheduled for launch in early September. The basic version will always be available for free for individual users, and we are targeting corporations as clients.

We are a fairly young startup, the team was formed August 2015 at Singularity University at NASA Ames Research Park in Silicon Valley. We are a Norwegian-registered company however our team is entirely distributed across Norway, Sweden, Finland, Spain and Ukraine. These first 9 months we have built our first tool, sold it to our first customer, been part of 500 Startups’ Nordic accelerator program, secured a seed investment of €300.000, launched an AI Training program with more than 300 trainers and initiated pilot partnerships with several multinational corporations.

We’re moving very fast, we’re sincerely ambitious and we believe we can have a positive impact on the world. And we’d like you to be part of this adventure.


Anders Kofod-Petersen Anders Kofod-Petersen
Adjunct Professor
360 IT-bygget
NTNU logo