Presentation @ ASNAC 2023: Measuring information disorder
In this paper, I introduce a robust framework for quantifying ‘information disorder’ within diverse information spaces, detailing its measurement methodology and demonstrating its applicability through two case studies. I define ‘information disorder’ as the probability of encountering consistent information for effective decision-making. An ‘information space’ is conceptualised as a multilayer graph, where each node signifies semantic content, such as a sentence in a Wikipedia article or a social media post. Nodes within this graph are connected through four types of edges that signify 1) agreement, 2) disagreement, 3) redundancy (i.e. equivalence), and 4) shared reference (i.e. pointing to the same domain/content).
I calculate the disorder metric based on a weighted sample of content, where weights are determined by the (approximated) probability that a recommender system will serve the content. This sampling approach not only reduces computational effort but also enables assessment when full data access is unavailable (which, of course, is generally the case). The density of each layer of the multilayer graph resulting from the sampling is used to compute a measure of information disorder using this simple formula: (agreement - disagreement) + redundant + SameSource. By definition, a negative score will indicate information disorder, as conflicting information outweighs concordant and redundant information and the presence of common sources. The same sample is also used to measure the degree of singularism and pluralism, providing additional insights about information diffusion within an information space. The degree of singularism/pluralism is computed by dividing the number of communities of a graph resulting from flattening the agreement, redundant and SameSource layers by the number of nodes. A value approaching zero will indicate extreme pluralism, and conversely, a value approaching 1 extreme singularism.
We apply this methodology to assess information disorder at multiple time points in different information spaces: the entire English Wikipedia at the onset of the COVID-19 pandemic and selected Reddit threaded conversations. The relations between semantic content nodes needed to define layers 1), 2) and 3) are coded using the text-embedding-ada-002 through OpenAI’s API using this prompt: “Considering their meaning, are the following two sentences in agreement, disagreement, or equivalent?”. Instead, the relations of layer 4) are defined by comparing the domain of the embedded URLs (when present). A measure of the disorder of an information system like a social media platform that is scalable because automatically generated leveraging the growing capabilities of LLMs is useful as it allows monitoring in near-real-time the overall health of an information space. But it becomes critical during an information crisis (e.g.a pandemic or an election) when disorders can cause severe harm to people and institutions.