Abraham Dada

A Short Essay: The Hidden Link Between Reinforcement Learning and Consciousness

Published: 14/03/2025

The Unexplored Link

Reinforcement learning (RL) involves an agent learning to maximise rewards through trial and error, strategically balancing exploitation—using known strategies to gain rewards efficiently—and exploration—searching for new strategies when existing ones fail. However, this framework may provide a deeper insight into the nature of consciousness itself. In biological systems, deterministic behaviours work well in stable, low-entropy environments, but when those behaviours fail—when an organism encounters unpredictability—'stochastic-heuristic exploration' becomes necessary. The failure of deterministic exploitation may be what triggers this meta-level heuristic exploration, a process that may be fundamental to phenomenological consciousness. This shift from exploitation to exploration could mirror how humans become consciously aware of problems: consciousness isn't always "on," but rather emerges when deterministic strategies break down, forcing the brain to generate novel abstractions to solve high-entropy problems.

This suggests that consciousness is not a passive state, but an adaptive mechanism that enables organisms to navigate uncertainty. Just as an RL agent explores new strategies when its learned policies fail, human consciousness may arise as a way to construct hypothetical models of reality—mental simulations that help resolve unpredictable scenarios (phenomenological consciousness). These models, whether linguistic, visual, or auditory, manifest as thoughts. A key question is whether these thoughts are completely novel or merely recombinations of existing memory structures. I align with the latter view: thoughts are not generated from nothing but emerge as recombinations of stored cognitive patterns, structured as 'cognitive hypergraphs' within memory. In this context, a cognitive hypergraph is a high-dimensional representation of interconnected concepts, memories, and experiences. Unlike traditional networks, where connections are strictly pairwise (node-to-node), hypergraphs allow for multi-way relationships—capturing the complex, non-linear ways in which concepts recombine to form new abstractions.

In this context, I speculate that concepts, that most see as irreducible and abstract, such as 'morality', 'ideas', or 'beliefs' can actually be mathematically explained:

The face is a root node in the cognitive hypergraph, deeply ingrained in human physiology. From birth, humans exhibit an innate bias toward detecting faces—newborns track face-like patterns more readily than other stimuli, suggesting a hardwired neural mechanism optimised for social recognition. This initial root node forms the foundation for an expanding network of abstractions.

➝ Basic sensory input → Pattern recognition (contrast, edges, shapes) → Face detection (innate physiological bias, newborn tracking response) → Recognisable faces ('Mum', 'Dad', caregiver, familiar individuals— faces repeatedly exposed to) → Grouping of faces (family, familiar people, strangers, in-groups, out-groups) → Generalised concept of people (social categories, demographics, populations, cultures, nations, collective identity)

Another root node emerges separately—rules. Initially, rules exist as purely reflexive, biological responses. The first rule a newborn follows is the withdrawal reflex—when touching a hot surface, the nervous system enforces an immediate reaction, encoding the fundamental axiom: avoid pain. Over time, this hypergraph evolves through reinforcement mechanisms. A child learns behavioural rules through direct conditioning—if they engage in an action and receive punishment, the association strengthens. Similarly, reward-based reinforcement solidifies positive behaviours. These rules extend beyond immediate caregivers as the child observes others in society, learning through vicarious reinforcement. Eventually, this expands into understanding structured systems of rules, culminating in abstract moral reasoning.

➝ Instinctual reflex (pain avoidance, nervous system enforcement) → Parental reinforcement (punishment, correction, reward) → Observed social reinforcement (watching others get punished or rewarded, vicarious learning, social conditioning) → Abstract rule formation (understanding authority, social norms, implicit and explicit behavioural expectations) → Generalised rule-based systems (laws, ethical structures, philosophical morality, universal moral codes, justice systems)

At a certain level of abstraction, these two initially separate hypergraphs—people and rules—begin to interconnect. Humans do not perceive others as isolated entities; we are recognised within a framework of expected behaviour. As the "people" hypergraph deepens, it integrates with rule-based systems, forming the foundation of morality. The concept of morality is not an isolated construct—it is an emergent hypergraph that arises from the structured overlap of learned social behaviour and the categorical understanding of people.

➝ People hypergraph (identifying individuals, generalising to groups, societies, cultures)
➝ Rules hypergraph (learning actions and consequences, abstracting into higher-order ethics and law)
➝ Intersection of both → Morality (a deep abstraction emerging from the interplay of social interactions and structured behavioural expectations)

In this sense, morality is not an independent or innate entity—it is a layered construct built upon the interplay between recognising social agents and understanding structured behavioural constraints. Just as early hypergraphs begin with sensory biases (e.g., face recognition) and expand into higher-order abstractions (e.g., cultural identity), morality is a high-dimensional emergent structure arising from the fusion of two fundamental cognitive networks, developed through reinforcement.

Humans Vs AIXI: Infinite Explorability

AIXI, proposed by Marcus Hutter, is a theoretical model of an optimal reinforcement learning agent that maximises expected rewards in any computable environment. It is based on Solomonoff Induction, meaning it assigns probabilities to all possible future outcomes based on all computable hypotheses, updating these probabilities as it interacts with the world. Formally, AIXI is defined as an incompressible agent that selects actions \(a_t\) to maximize the expected reward:

\[a_t = \arg\max_{a} \sum_{q} 2^{-\ell(q)} V^{q}(a_t, h_t)\]

where \(q\) represents all computable hypotheses, \(\ell(q)\) is their length in bits (favouring shorter, more probable programs), and \(V^q\) is the value function estimating expected rewards. However, AIXI is uncomputable because it requires enumerating all possible Turing machines, making it infeasible in practice.

Despite this, AIXI shares a deep conceptual similarity with human cognition: both exist within an infinitely explorative space but do not act on infinite exploration. Humans are infinitely explorative in potential, but rather than evaluating all possibilities, we navigate this space efficiently through heuristics such as emotions, and pre-existing cognitive hypergraphs.

Unlike AIXI, which theoretically considers all possibilities, humans operate under bounded rationality. We prioritise actions using emotion, intuition, memory, and salience weighting. Emotional salience helps assign weights to exploration paths, reducing computational overhead by guiding decisions toward high-value exploratory avenues. Additionally, our cognitive hypergraphs allow for rapid recombination of knowledge, making our exploration constrained yet efficient. This contrasts with AIXI's brute-force optimisation, as humans do not need to search the entire space of possibilities to find meaningful solutions. Evolution has equipped us with a dynamic balance between deterministic exploitation and stochastic exploration, allowing us to handle high-entropy environments without computational infeasibility.