This page lists some useful resources for students and researchers interested in text-as-data
Corpus
Norwegian Colossal Corpus
- Collection of multiple Norwegian corpuses that are suitable for training large language models or conducting independent research. The corpus contains government reports, Stortingsforhandlingene, Evalueringsrapporter, laws and NOUs, online newspapers, Wikipedia, and out-of-copyrights books from the Norwegian National Library.
Norwegian Parliamentary Debates 1981 – 2021, Coming soon
- All Parliamentary debates from the Norwegian Parliament (Storting) in the period 1981 – 2021. The data also contains info on committee membership, language, and other metadata. Can be merged with Fiva and Smith, 2022 for comprehensive background data on national-level politicians. Parliamentary speeches from 1994 – 2021 can be found in the replication package for ‘How Does Party Discipline Affect Legislative Behavior?’, Quarterly Journal of Political Science, 2024, which can be downloaded here.
Norwegian NLP resources
- List of useful Norwegian NLP resources, which covers both data/corpus and methods that supports NLP in Norwegian.
Methods
Intro to Quanteda
- Well-documented and intuitive introduction to quantitative text analysis in R using the Quanteda package
Text Algorithms in Economics (Ash and Hansen, 2023)
- Excellent overview of text algorithms in economics
Text as Data (Gentzkow, Kelly, and Taddy, 2019)
- One of the classic overviews of text-as-data in economic research
Other useful resources
Oslo-Bergen Tagger
- Morphological and syntactic tagger for Norwegian. Useful for identifying grammatical morphemes and Part-Of-Speech tagging