Oda Nedregård: NLP

This page lists some useful resources for students and researchers interested in text-as-data

Corpus

Collection of multiple Norwegian corpuses that are suitable for training large language models or conducting independent research. The corpus contains government reports, Stortingsforhandlingene, Evalueringsrapporter, laws and NOUs, online newspapers, Wikipedia, and out-of-copyrights books from the Norwegian National Library.

Norwegian Parliamentary Debates Dataset 1945–2024

with Jon Fiva and Henning Øien, Accepted, Nature Scientific Data (2024)

Data set with all Norwegian Parliamentary speeches in the period 1945 – 2024. We also include speaker and speech meta data (e.g., committee membership, district, minister, elected, deputy…). Can be merged with Fiva and Smith, 2022 for comprehensive background data on national-level politicians.

Norwegian NLP resources

List of useful Norwegian NLP resources, which covers both data/corpus and methods that supports NLP in Norwegian.

Methods

Intro to Quanteda

Well-documented and intuitive introduction to quantitative text analysis in R using the Quanteda package

Text Algorithms in Economics (Ash and Hansen, 2023)

Excellent overview of text algorithms in economics

Text as Data (Gentzkow, Kelly, and Taddy, 2019)

One of the classic overviews of text-as-data in economic research

Multilanguage Word Embeddings for Social Scientists (Wirshing et al., 2024)

Multi-language “a la carte” word embeddings (also works in Norwegian)

Other useful resources

Oslo-Bergen Tagger

Morphological and syntactic tagger for Norwegian. Useful for identifying grammatical morphemes and Part-Of-Speech tagging

Friends Don’t Let Friends Make Bad Graphs

Some data visualization pitfalls to avoid