HyperNetwork Text Autoencoder architecture (Horizontal, Taller Nodes)

Support me on Patreon

Beyond LLMs

A hypernetwork-based implicit text autoencoder

Beyond LLMs: Towards text understanding and generation via Hypernetwork Autoencoders

Author

Beyond Team

Date

May 13, 2026

Code

PyTorch

The era of Large Language Models (LLMs) has revolutionized how we interact with text. Models like GPT-4 are unparalleled at predicting the next token, retrieving information, and synthesizing known concepts. However, when it comes to true scientific discovery—inventing fundamentally new architectures, activation functions, or loss metrics—autoregressive next-token prediction often falls short. LLMs are inherently biased toward the distribution of their training data; they are designed to replicate the known, not necessarily to chart the unknown. If we want to discover entirely new machine learning concepts, we need a different approach. We need to step away from autoregressive generation and move toward representation learning.

Our motivation is simple: What if we could encode the entirety of machine learning literature into a continuous, mathematical latent space? By doing so, we could perform "latent walks"—mathematically interpolating between a paper on Activation Functions and a paper on Loss Functions—to see what novel, hybrid concepts emerge in the space between them. To achieve this, I’ve built a proof-of-concept for a novel text autoencoding architecture. Instead of relying on standard Transformers, this model utilizes a Hypernetwork to dynamically generate an Implicit Neural Representation for decoding text. Here is a deep dive into the architecture, our initial experiments on ML literature, and the promising results showing that this model can map and separate complex concepts without any human labeling.

Model Architecture

Standard sequence-to-sequence autoencoders typically use an RNN or a Transformer for both the encoder and decoder. Our architecture, the HyperNetworkTextAutoencoder, completely reimagines the decoding phase. Instead of passing hidden states step-by-step through a recurrent layer, our model uses a Hypernetwork to generate the literal weights of the decoder on the fly, conditioned entirely on the latent representation of the text.

Here is how the pipeline works:

The Encoder (Compressing the Concept): We start with a standard Bidirectional LSTM. A sequence of tokens is passed through an embedding layer and then processed by the BiLSTM. We extract the final hidden states, concatenate them, and project them down via a linear layer into a fixed-size vector, z. This z is our latent representation—the "DNA" of the sentence.
The Hypernetwork (Building the Decoder): This is where the magic happens. The vector z is passed into a set of linear layers (gen_W1 and gen_b1). However, these layers do not output probabilities or next-word logits. Instead, they output the weights (W1) and biases (b1) for a completely separate neural network.
The Decoder (Implicit Neural Representation): The generated network acts as an implicit function, mapping time (position) to vocabulary. We pass fixed, sinusoidal positional encodings (similar to those in Transformers) into this dynamically generated layer. Mathematically, for a given sequence of positions P, the hidden states of the decoder are computed as: Hidden = ReLU(W1 × P + b1). These hidden states are then projected across the vocabulary space to retrieve the text.

Why do this? By forcing the model to generate the weights of the decoder, we place massive pressure on the latent vector z. It cannot simply pass along superficial text features. It must capture the fundamental structure and semantic meaning of the sentence, as z literally dictates the physics (the weights) of the network that will reconstruct the text.

We also think that our decoding step leads to more stable learning than autoregressive models because it doesn't suffer from error propagation. In LLMs, a single wrong token prediction early in the sequence can derail the entire rest of the generation. Our implicit decoder generates the entire sequence in parallel, conditioned only on the latent vector z. It is less "greedy" and more "deliberative".

Test Results

To test if this architecture could serve as a foundation for concept discovery, we created a dataset (data.csv) consisting of dense definitions of Machine Learning concepts. We focused on two distinct topics: Activation Functions: (e.g., ReLU, Swish, GELU, Mish) Loss Functions: (e.g., MSE, Cross-Entropy, Huber, Triplet Margin) Note: The model is fed only the text statements. The "topic" labels are completely hidden from the model during training. The learning process is 100% unsupervised. We trained the model for 400 epochs on a modest vocabulary of 558 unique tokens. Result 1: Perfect Reconstruction Because the dataset is highly specific, our first goal was to ensure the Hypernetwork could actually solve the text generation routing problem. The model achieved a training loss of exactly 0.0000, demonstrating perfect memorization and reconstruction. Result 2: Unsupervised Topic Separation in Latent Space Perfect reconstruction is great, but a lookup table could do that. The true test of an autoencoder for concept discovery is whether its latent space geometry is meaningful. Does z group similar concepts together? We tested the cosine similarity of the z embeddings between three previously unseen sentences:

KL divergence loss quantifies how much one probability distribution q diverges from a target distribution p by computing $D_{KL}(p\|q) = \sum_x p(x) \log\frac{p(x)}{q(x)}$, the expected extra information needed to encode samples from p when using q instead....

MSE loss is the average of the squared differences between predicted values and true values, $MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$, which penalizes larger errors much more heavily than small ones....

The Parametric Rectified Linear Unit (PReLU) is a type of activation function that extends ReLU by introducing a learnable parameter $\alpha$ for negative input values, allowing the network to automatically determine the optimal slope for negative signals instead of defaulting them to zero....

Success

Example 1 Comparison	Cosine Similarity	Cosine Distance
Same Topic Pair (A vs B)	0.3978	0.6022
Different Topic Pair (A vs C)	0.0235	0.9765

Interactive 1: The table above shows the cosine similarity between the latent representations of sentences. Press the previous / next buttons to see how the model performs on arbitrary sentence comparisons. quotes (A and B) with black borders are on the same topic.

Discussion

Even though the model was never told what an "activation function" or a "loss function" is, it naturally pushed Sentence A and Sentence B closer together in the latent space (Positive Cosine Similarity) while pushing Sentence C into an entirely different region (Negative Cosine Similarity). The Hypernetwork forced the semantic "vibe" of the text into the geometry of the space. It realized that the syntactic and semantic structures required to describe a loss metric (penalties, errors, targets) require a fundamentally different set of generated decoder weights than those describing an activation function (non-linearity, gradients, bounding).

Next Steps

This proof-of-concept successfully validates our core hypothesis: A Hypernetwork-based implicit text autoencoder can accurately reconstruct dense scientific text and naturally organize concepts into a topologically meaningful latent space. This is step one. By proving that Topic A (Activations) and Topic B (Losses) occupy distinct neighborhoods in the latent space, we have paved the way for our ultimate goal: The Latent Walk. If we scale this model to ingest thousands of ML papers, we can pick two abitrary vector and mathematically interpolate a path between them. As we sample new z vectors along this path, we can pass them through the Hypernetwork to decode the text. What lies exactly halfway between the concept of "Rectified Linear Unit" and "Triplet Margin Loss"? Because our decoder generates text based on continuous coordinate spaces rather than rigid autoregressive tokens, these interpolations might yield entirely novel, mathematically coherent sentences describing architectures that haven't been invented yet. LLMs are the ultimate tools for exploiting the known. But architectures like the Hypernetwork Autoencoder might just become our compass for exploring the unknown.