The Finnish News Agency STT has made its news archive available to researchers. 2.7 million news articles produced from 1992 to 2018 have been downloaded to the Finnish Language Bank, for researchers to access in their effort to develop new language technology using machine learning and artificial intelligence.
The Language Bank of Finland is a service offered by the FIN-CLARIN research infrastructure, a part of the European CLARIN research infrastructure for language-related resources in Humanities and Social Sciences. National and regional Research & Education networks provide access to the infrastructure. Finnish R&E network CSC is responsible for the technical maintenance of the Language Bank of Finland.
Researchers from the University of Helsinki have already developed Valtteri, a news bot that has written over 750.000 news articles on the 2017 Finnish Municipal election results, in both Finnish Swedish, and English. Valtteri is part of the Immersive Automation research project aiming at automating news production.
So far, automated news reporting is primarily used for reporting on sports, elections, and similar topics, where it is possible to generate a story in a structured manner from structured data. As an example, a game of ice hockey consists of three periods resulting in a specific number of goals, which makes it suitable for robot reporting. With automated reporting online media can cover a great number of e.g. local hockey games, thus catering to small niche audiences.
The University of Helsinki researchers participating in the Immersive Automation project are now contributing to a new EU-funded research project named Embeddia, using the STT news archive to further develop enhanced news reporter applications.
The project takes its name from machine learning technologies known as “word embeddings”, where computers find relations between words and phrases based on the contexts of their occurrences. Embeddia aims at developing multilingual word-embedding models for computers to find connections between texts written in different languages, and to develop methods for automatic text generation across language barriers.
The project focuses on small language areas, such as Finland, Slovenia, and Croatia, that are lagging behind compared to the technologies already developed in the Anglo-Saxon language. One of the goals of the project is to simplify searching for information from online news across language barriers. The goal is to search and combine news written in several languages, and thus to improve people’s access to information. Also, the researchers are looking into ways to make computers write more vividly, using metaphors etc. to put a bit of colour into the language.
Embeddia brings together six European universities, the Finnish News Agency STT and three other media companies.