Helle Sjøvaag, Truls Pedersen og Eirik Stavelin var i Fukuoka, Japan på den årlige ICA-konferansen med paperet “Diversity as Proxy for Measuring the Quality of News: Operationalizations for a Large-Scale Research Project on the Norwegian Media Landscape”. Paperet ble presentert på pre-konferansen “Media Performance and Democracy: Defining and Measuring the Quality of News”, torsdag 9. juni. Presentasjonen er her gjengitt i sin heltet.
The aim of the project is to evaluate the effectiveness (or the impact) of Norwegian media regulation towards sustaining pluralist democracy, the infrastructure of which it is the state’s responsibility to uphold, according to the constitution. Democratic pluralist principles include representation and freedom of expression – ideals that come together to create a ‘marketplace of ideas’ where conflict resolution or political consensus is arrived at through debate. For this debate to be conducive to democracy, we need to ensure representation and quality of information. Because editorial media are largely privately owned commercial enterprises, the risk of market failure (in terms of monopoly of ideas) ensures that the media is regulated towards common good principles.
This is the theoretical principle for operationalizing quality of news as diversity. Diversity is here understood as heterogeneity, mobilized to analyze representation as the frequency and distribution of voices of topics in news and current affairs across the Norwegian media landscape. Our aim is to find out where in the media landscape diversity of voices and topics can be found, and which media structures enable media diversity.
We aim to measure the landscape overall, with a data set of 189 editorial outlets, including newspapers, television, radio, and pureplayers published at the local, regional and national level. Data is collected from websites automatically using custom written scrapers, and data collection runs for three-month periods (from October to December), starting in 2015 and continuing over the next three years. The data is collected as html files, stored on a local server, and assembled (or archived) using automatic coding of manifest content such as time stamps, page type, url etc., and textual elements including the title, lead-in, body, byline, and captions.
The problem of operationalizing quality as diversity indicators is that doing it digitally and automatically means you first have to translate rather normative diversity parameters into measurable (and extractable) digital units (or measuring points) – things that can be captured and classified according to rules a computer can understand. Then you have to carefully consider the aggregation that is enabled by the research design and the automated analysis of extracted features.
We define diversity as these four measureable things: 1. The range of sources present, including political affiliation, gender and geography. 2. The distribution of topics, with standard content analysis categorizations from hard to soft news. 3. Depth of coverage, including hyperlinking, story length, readability and format. And 4, Agenda setting, including origin, diffusion and overlap.
These are the questions we need to answer to find out where in the Norwegian media landscape diversity is most present; or what types of media contribute most to media diversity. A second stage here will therefore be to do statistical analyses using the metadata we have on the location, resources and ownership of each media outlet. More importantly, though, these are the questions that we think that we can answer using automatic, computer assisted methods – methods developed or customized by researchers at our department. This is how we intend to do that:
For the range of sources present we use named entity chunking to automatically extract from the collected textual elements (header, body, byline) the persons and organizations mentioned in the text. We have an algorithm to determine gender, which enables us to ascertain the presence of male and female voices in the news; and for geography we run a place approximation algorithm (and we also have structural metadata on the reach of each publication). The analysis of political affiliation is based on named entities combined with semantic network analysis. We do this analysis to answer questions about representation, to find out who speaks in the media, and who speaks in which media.
As for the distribution of topical categories, we have two approaches at our disposal. We do a basic topic modeling to establish a rough distribution of themes across the corpus; and we run an automatic content analysis algorithm that Eirik Stavelin has developed, which has a precision above .90 (well above human coders). The aim of the content analysis is to find out what is being talked about in the media, and where in the landscape most of the hard news occur.
For depth of coverage we do a basic word count to find out the distribution of story length from short press releases and longer features and reports across outlet types. We also apply a basic readability test to ascertain complexities in news coverage across the corpus. From the metadata we have information about format and page type, which helps us understand the presence of interactive features such as videos, quizzes, chatting, live feeds etc. in the data. From the collected hyperlinks we also perform a network analysis to ascertain the central nodes in the digital news ecology in Norway.
For the agenda setting analysis we have metadata (time stamps and hyperlinks) that allow us to do a news diffusion analysis, seeing how the news breaks and spreads across the landscape. We will also do an origin analysis using automatically extracted byline data, enabling us to find out how much of the news is agency material, and to find out how much overlap there is between newspapers belonging to the same corporation. Here we also plan to run textual overlap analysis using text recognition algorithms or perhaps trying existing tools to detect plagiarism used by universities.
This is our basic research design. It is designed to answer questions about heterogeneity in representation of interests and identities in the digital media landscape in Norway. It is also designed to answer questions about what types of media are most in need of regulatory protection (whether it is local newspapers, public service broadcasting, weekly papers, or radio etc). But relying on computational methods, while they allow for scale and precision, also comes with limitations. The main challenge here is reassembling the digital units to aggregate measurements. We need to make sure that the features we are able to extract can be used to say something about the quality of news enabled by the media policies that regulate media infrastructures.