JEPA is an example of a predictive coding architecture. Predictive coding is the idea that the brain works by predicting the next sensory input and learns from the error between the prediction and the actual input. Hierarchies of this kind of prediction allow higher level elements to predict the outputs of lower-level elements, building up deeper and more complex semantics.
The core idea in JEPA is to take two related things (say consecutive frames of a video) x and y, encode each of them into a latent space (an embedding), and then predict the embedding for y s(y) based on s(x). The encoders can be Transformer-based models — in practice models like I-JEPA have trained the x encoder and updated the weights via a moving average for the y (target) encoder.
The learning is not based on how well the end-result predicts the target e.g. how close the pixels of the next frame are predicted. Instead, it’s based on how well the latent representation of the next frame is predicted.
The advantage of working in the latent space for the prediction is the model can choose what level of detail it wants to capture, discarding some aspects and focusing on more foundational concepts. This helps build a more robust world-model, with the hope being that training in this way will then allow easier generalization to more tasks, with less data required
Similarities
This is somewhat similar to autoencoders. Autoencoders take an input, compress it in a latent space, then reconstruct the original from the latent space and propagate back the error. JEPA does a similar process across two different items with separate encoders, and only cares about error within the latent space.
Contrastive models embed two different items into the same space and try to increase similarity between the embeddings for things known to be similar and make them dissimilar to other items. This is used in CLIP and other multimodal text-image encoders, where the text and the image embed to the same space so that a text caption and a matching image are close in embedding space. This requires a lot of pairwise comparisons, while JEPA is a more straightforward s(x)->s(y) prediction in training.
Challenges
Because JEPA models leave you with a latent they need to be paired with a generator for getting an observable/human viewable output, which is a per-domain challenge. This makes it harder to evaluate how well the model is learning, beyond measuring loss.
Training stability can also be tricky — it is possible for the model to collapse and learn trivial representations to minimize prediction error. Even without complete collapse it can require some experimentation to ensure the model is learning a deep enough conceptual level. For example, I-JEPA, which worked in image space, found that using large enough masked patches was important to ensure the model captured sufficient detail.