## 1 Introduction

With the release of the seminal paper “Attention is all You Need” attention

, the field of natural language processing transformed into an area completely dominated by transformer architectures capable of distilling the world’s massive amount of unlabeled data into a formidable knowledge base. Gone were the days of recurrent and even convolutional neural networks as the attention-based transformer model lead to such advancements as BERT

bert and the GPT series of models gpt, gpt2, gpt3. The overwhelming success of these massive isotropic (maintaining the same representation shape through all layers) models left researchers wondering if the same techniques could be applied to computer vision, leading to the development of the image-GPT model

igpt.In addition to the isotropic architecture and unsupervised training objective, the image-GPT model is also unique among vision models in that it eliminates most, if not all, of the inductive biases traditionally associated with computer vision problems. Instead of continuous pixel values as input, image-GPT discretizes 24-bit RGB pixel values by clustering them into 9-bit one-hot vectors. Rather than locality-considering convolutions, image-GPT performs position-agnostic self-attention operations.

Instead of relying on these inductive biases, image-GPT’s extreme parameter count and depth allow it to learn these critical relationships that come inherent with the natural world from scratch. With the smallest model proposed in the paper igpt having 76 million parameters and the largest having 6.8 billion coupled with the extra-quadratic memory usage of the self-attention operation, there is unfortunately no way for the typical AI researcher to evaluate, let alone train these massive models.

To facilitate the advancement of AI research and avert a potential AI winter, it is necessary to distill advancements made capable by massive amounts of computing power into their core ideas that can be further studied by the community at-large. As such, it is critical to understand the features of these huge transformer models that can be adequately studied by the typical AI researcher. In this work, we explore small to moderately-sized networks in the style of image-GPT capable of being trained with our modest computing resources and examine the effects of including, removing, or even replacing the inductive biases excised by image-GPT.

## 2 Related Works

Given that the world is full of immeasurable amounts of unlabeled data, the goal of unsupervised representation learning is quite attractive. While most of the recent advancements in unsupervised learning have spawned, computer vision researchers have also developed methods of training on unlabeled data through self-supervised learning.

#### Unsupervised Natural Language Processing

While the transformer model was originally proposed for paired sequence to sequence translation attention, it was quickly adapted to the BERT objective to learn representations in a fully unsupervised manner by masking out certain tokens (words) in the input sequence with the goal of predicting these same tokens in the output sequence bert. The transformer model was also used to learn representations by training on the unsupervised auto-regressive objective wherein the model simply had to predict the next token in an incomplete sequence, leading to the development of the Generative Pre-Training (GPT) series of models gpt, gpt2, gpt3. It was these two unsupervised objectives coupled with the transformer architecture that eventually lead to the development of image-GPT: a Generative Pre-Training model for images that is functionally the same to BERT or GPT in that it simply treats pixels as words.

#### Contrastive Visual Learning

An aspect of computer vision that does not translate to natural language processing is the idea of data-augmentation or applying arbitrary transformations to an image. The idea of data augmentation lead to one of the most popular methods of self-supervised representation learning for vision: contrastive learning simclr. In this method of representation learning, sequences of specific stochastic transformations are applied independently. The model then compares embeddings of two transformed images. If they come from the same source image, then the model minimizes the distance between their embeddings, and if not, the the model maximizes the same distance. Since this method effectively assigns pseudo-labels for a discriminative task, it is considered to be self-supervised learning rather than pure unsupervised learning.

## 3 Model Architectures

Distance Preservation in Input Encodings | ||

One-Hot | Random Fourier () | Continuous |

Each pixel represents the cosine similarity between the encodings of the two pixel values at at (x,y) position with the top left representing (0,0) and the bottom right (255, 255). A one-hot representation destroys all distance while a continuous representation perfectly preserves it. A random fourier embedding allows us to preserve distance in a local band while also substantially increasing rank.

While the image-GPT model exclusively uses transformer blocks in its layers, we also explore two other types of isotropic blocks: mixer and convolutional. We also consider different methods of encoding the input that each carry their own level of inductive bias. Furthermore, image-GPT was trained on both the auto-regressive and BERT bert objectives while here, we only focus on the latter.

### 3.1 Block Types

Like image-GPT, all of our models are isotropic in shape such that they keep the same size representation throughout all layers. After an initial layer to encode the input to the desired dimensionality, the models then consist of blocks. In this work, we explore networks consisting of three types of blocks: transformer blocks attention, MLP-mixer blocks mixer, and traditional convolutional blocks.

For comparison’s sake, the three block types are kept as similar as possible such that they have the same number of non-linearities and the same activation dimensionality.

#### Convolutional Block

Each type of block contains a different level of inductive bias. The convolutional block carries the locality bias that comes with the filter in the convolution operation itself. The forward pass of the convolutional block with intermediate skip-connection resnet point can be described by the following equations:

(1) |

where GELU is the Gaussian Error Linear Unit activation gelu as used in image-GPT.

#### Mixer Block

We adapt this block from the recently proposed MLP-Mixer architecture mixer. The mixer block is similar to the transformer block except that it performs a token-mixing MLP instead of the self-attention operation. As such, it still retains the long-range connections that make self-attention desirable without the extra-quadratic activation memory cost. The mixer block carries a positional bias in its token-mixing MLP as the MLP carries different weights for each token position in the representation. The forward pass of the the mixer block (where are 2D representations made of column-stacked tokens) is described by:

(2) |

#### Transformer Block

Lastly, the transformer carries no such bias as the self-attention operator is completely agnostic to the position of its input tokens. The transformer is also typically a gatekeeper for training hardware due to the quadratic memory usage of the self-attention activations. The forward pass of the transformer block (where are again 2D representations made of column-stacked tokens) is described by:

(3) |

### 3.2 Input Encodings

In this work, we experiment with three different input encodings: one-hot pix2vec, continuous space, and random fourier features. The image-GPT model learns a one-hot pix2vec embedding from scratch where any and all preserved distance from the input space must be learned by the model itself. While this idea may seem counter-intuitive at first, giving the network a one-hot input seeds it with a representation that is already of maximal rank, allowing for easier discriminative tasks down-stream. With our other input encodings, we aim to further explore this trade-off between distance preservation and rank escalation.

#### pix2vec

Just like the word2vec embeddings used in language models, image-GPT learns a pix2vec embedding from one-hot encoded pixels to a vector space matching the latent dimensionality. To do this, image-GPT clusters all possible 24-bit RGB pixel values into a 9-bit one-hot space. Instead, we learn a separate 8-bit pix2vec embedding for each channel. In this case, the model must learn any required distance preservation separately for each channel.

#### Random Fourier Features

Inspired by the usage of positional encoding via exponential fourier features in NeRF mildenhall2020nerf, we experiment by encoding the pixel values themselves with random fourier features. Borrowing notation from the NeRF paper, the -frequency random fourier feature encoding is formally described by the following function:

(4) |

where . By using random fourier features instead of exponential fourier features, we can have arbitrarily many frequencies without destroying all distance. Here, directly acts as a slider between distance preservation and rank escalation as seen in Figure 3.

Distance Preservation in Random Fourier Encodings | ||

#### Raw Continuous Input

For our continuous input, we do not perform any per-pixel normalization. Instead, all 8-bit pixel values are simply mapped to the range by subtracting 127.5 and then dividing by the same. This method of encoding perfectly preserves all distance between pixel values but does not provide an inherent rank escalation.

As stated in the original BERT paper bert, the acronym stands for Bidirectional Encoder Representations from Transformers. To refer to the masked token prediction as the BERT objective even when discussing non-transformer architectures may be a potential misnomer, but we will continue referring to it as such for simplicity.

While other methods of self-supervised learning in vision rely heavily on pseudo-labels or input transformations, the BERT objective originates from natural language processing where input transformations do not make sense. As such, the BERT objective provides a goal for pure unsupervised learning by simply requiring the prediction of the masked parts of the input.

To describe the BERT objective, let be the set of masked input indices where each input index has a set chance of being in . Keeping with precedence bert, igpt, we fix this chance at 15%. Then the training objective is to minimize the negative log-likelihood of the predicted discrete values of the masked pixels given the unmasked pixels. Formally, over the training set , the BERT objective is to minimize

(5) |

Note that by this objective, our model must predict the discrete, one-hot encoded pixels regardless of the input encoding. While this is a much harder objective than a regression, it prevents the model from simply predicting the average of the surrounding pixels and, in doing so, forces the model to learn a meaningful representation of the input data.

## 4 Experiments

For our experiments, we implement our models and training procedures in TensorFlow

tensorflow2015-whitepaperusing the higher-lever Keras wrapper

chollet2015keras. Our training hardware consists of 4 NVIDIA TITAN Xp GPUs with 12GB of VRAM each for a total of 48GB of VRAM.### 4.1 Dataset

In the image-GPT paper, they train their models on the unsupervised objective using the entirety of the ~ 1.3 million sample ImageNet dataset

deng2009imagenetbefore evaluating it via linear probing and fine-tuning for ImageNet classification. Given our modest amount of compute and our focus on such scenarios, we instead train our unsupervised model on CIFAR-10

cifar and evaluate exclusively with linear probing on the same dataset.### 4.2 Training Procedure

To pre-train our models on the unsupervised BERT objective, we use the Adam optimizer kingma2014adam

. We begin with one epoch of linearly warming-up the learning rate from 0 to 0.01 before decaying back to 0 over the next 50 epochs using a cosine schedule

cosine.After we have pre-trained on the unsupervised task, we begin training linear probes to evaluate the efficacy of the learned representation. We use the same optimizer and learning rate as when training the unsupervised models, but we instead decay over 100 epochs or until convergence.

Unlike classical auto-encoders that contain an information bottleneck layer, our models are isotropic, so there is not an immediately clear answer as to which layer would provide the best features. The image-GPT paper also notes that their best unsupervised features come from a combination of several layers. To avoid this combinatoric search space of training given our modest amount of compute, we will simply consider the linear probes from each isolated layer. Since we are analyzing the effects of different inductive biases instead of reaching for state-of-the-art results, this seems okay for now, but it would be interesting to explore this avenue in future work.

For all except the 4-headed transformer models, we use a batch size of 128. For the 4-headed transformer models, we had to reduce the batch size to 32 to fit on our GPUs.

### 4.3 Selected Models

For all our models, we choose a latent channel dimension of 128. We experiment with our various input encodings (pix2vec, random fourier features, raw continuous) and block types (convolutional, mixer, transformer) for networks of size (number of blocks) 1, 3, and 6. However, training the multi-headed transformer models and then linearly probing each layer became prohibitively expensive with our hardware once we reached size 6, so we halted this training after the first few to focus on other settings.

## 5 Results

We present our results on networks of size 1, 3, and 6. Results for networks of size 12 can be found in the supplementary material.

### 5.1 Architecture Class

Distance Preservation in Learned pix2vec Embeddings | |||

Convolution | MLP-Mixer | 1-Head Transformer | 4-Head Transformer |

Looking at the left part of Figure 6, we see that transformer networks consistently have the worst performance on the unsupervised task while convolutional networks perform the best. As noted by the image-GPT authors, we see a strong correlation between the unsupervised generative model’s performance and the accuracy of the best linear probe. As such, we also see that the transformer networks also perform the worst on the classification task while convolutional networks again perform the best, having by far the best accuracies.

We can see the direct result of the different model classes unsupervised validation loss in Figure 1. In the convolutional setting (top), we see that the horse is nearly perfectly reconstructed by the network. In the MLP-mixer setting (middle), we see all the colors of in the reconstructed frog seem to be from the correct palette, but some of them are still misplaced. Lastly, in the transformer setting, we see the model repeatedly predict completely incorrect colors in the reconstructed bird, explaining the transformer models’ consistently much higher unsupervised loss.

If we look at only models that used the pix2vec input embeddings, we can gain further insight to this phenomenon. Figure 5 shows the learned pix2vec embeddings for the various types of architectures. We see that the level of distance preservation is highest in the convolutional models with that of the MLP-mixer models being slightly worse. However, the distances learned by the transformer models degrade to almost random patterns aside from a very narrow local band just off the main diagonal. Note that it is not possible for there to be 0 similarity between pixel values here as in Figure 2 because the learned embedding is an under-complete transformation of the one-hot encoding. In other words, it is not possible to have a set of 256 orthogonal vectors in a 128 dimensional space.

While there is certainly more going on in later layers in the network, it is interesting to note that the level of distance preservation in the learned pix2vec embedding strongly correlates with the relative performance of those model classes as a whole.

### 5.2 Encoding Method

Looking at the right part of Figure 6, we see that the best overall model in both terms of unsupervised performance and classification accuracy used the learned pix2vec embedding from the one-hot pixels. However, if we consider a fixed unsupervised loss, it appears the network using the continuous raw pixels would perform the best with the pix2vec network actually performing the worst. Furthermore, the classification performance of the random fourier encoded networks seems to lie in the space between that of the continuous and one-hot pix2vec encoded networks. Examining Figure 3 we can see how lowering would eventually lead to perfect distance preservation and raising would converge to a random over-complete transformation of a one-hot embedding.

Having examined the overall trends of varying the input encoding method, we make further observations by examining the effect of varying the encoding method over a fixed architecture class. In the left part of Figure 7, we see that the random fourier encoded network actually outperforms the other two. Similarly, in the right part of Figure 7, we see that the one-hot pix2vec encoded network outperforms the others. These results come in contrast to the global trend of the continuous-encoded networks performing the best overall, suggesting that specific inductive biases (encoding and architectural) may pair better with each other than others.

### 5.3 General Trends

As stated earlier, one extremely noticeable trend from our experiments is the correlation between the performance of the unsupervised generative model and that of the down-stream classification task just as in image-GPT igpt. This bodes well for the research community, as it implies breakthroughs discovered on small-scale isotropic unsupervised models may also transfer well to extremely large-scale models.

Another point of note is the abject failure of transformer models to compete with MLP-mixer or convolutional models as a whole. While transformer models are state-of-the-art for unsupervised visual learning with large-scale isotropic networks, it seems like this does not transfer well to small-scale problems. With our limited computing resources, it is difficult to determine if this is a data or architecture problem, though it is likely a combination of both.

Lastly, we identify the overall effect of inductive biases on our objective. Note that out of the encoding types, the raw continuous encoding performed the best on the classification task (with respect to unsupervised generative performance). Furthermore, out of the architecture classes, the convolutional networks outperformed both the MLP-mixer and transformer networks. Interestingly, this means that the two “hardest” inductive biases (locality in convolutions and perfect distance in continuous pixels) performed the best on our small-scale problem while both of these biases are removed in the extremely-large scale image-GPT igpt. This supports the idea that small-scale networks do not have the capacity to re-learn these truths of nature from scratch and benefit greatly from having them explicitly baked into the model.

## 6 Conclusions

In this work, we study small-scale isotropic networks for unsupervised visual learning and analyze the impacts of the inductive biases that have been ablated out in extremely large-scale isotropic networks for the same task igpt. While these massive networks can learn these truths from scratch over immense amounts of data, smaller networks seem to benefit greatly from having these inductive biases baked into the model. Specifically, we have shown that using a convolutional architecture with a continuous input outperforms using a transformer with one-hot input on small-scale networks, in contrast to large-scale networks like image-GPT.

Furthermore, our experimental results show that different types of input encodings yield better results with different types of isotropic network architectures, suggesting the possibility of as-of-yet untested inductive biases that may yield improved results when added to large-scale isotropic networks as well.

We have also shown that the general trend of a better unsupervised generative model yielding better results for down-stream classification translates well to small-scale isotropic networks regardless of model architecture or input encoding, suggesting that improvements made to the small-scale isotropic generative models may also lead to improvements for large-scale models as well.

Overall, we have provided insight as to the benefits of inductive biases in small-scale isotropic networks and illuminated ways for the typical AI researcher with modest compute to join in on current advancements and continue investigating these isotropic networks for pure unsupervised visual learning. Rather than throwing as much data and compute power as possible at an extremely large network, building a better understanding of the model’s components, even at a small scale, will lead to quicker and greater advancements in the field.

## 7 Acknowledgements

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE1745016.

Comments

There are no comments yet.