Neural Style Transfer with TensorFlow

AI is sweeping every field it touches, and art is no exception. Here, we see how to implement neural style transfer with TensorFlow.

Neural Style Transfer with TensorFlow

This portrait is that of a gentleman named Edmond Belamy and it went under the hammer at Christie's, the famous auction house, for a jaw-dropping $432,500. However, the signature on the piece, spotted by eagle-eyed readers (in the bottom right corner), would appear very strange.

The first piece of art generated by an artificial intelligence

The signature is this equation:

$$\min \limits_{G} \max \limits_{D} \mathbb{E}_{x} [log (D(x))] + \mathbb{E}_{z} [log (1- D(G(z)))]$$

The artwork was made by a generative adversarial neural network, which is just a fancy term for a neural network which tries to get better by fighting itself. The network has two parts. One which makes an educated guess, and the other which tries to discriminate, and differentiate the unwanted images from the good ones. This information flows back to the guesser, which now attempts to generate better guesses. The two adversarial components thus improve the entire model. The above image was from a series of art pieces on the fictional Belamy family, by Obvious Art.

The tutorial requires some knowledge of TensorFlow (ideally you should have built CNNs before trying this out).

Implementing Neural Style Transfer

In this article however, we focus on a different type of generation, with neural style transfer techniques. This was first covered in the paper by Gatys et al., A Neural Algorithm of Artistic Style. The process is quite different from GANs. We make use of a traditional convolutional network to transfer features from an image, and then rebuild them with stylistic content fed from another image.

If you have messed around with convolutional neural networks, you would know that they work by learning features from the images in the dataset, and then attempt to extract those features from a new image shown to it. Similarly, we can extract these features from the middle of the network, and try to reconstruct the original image with them. No doubt, the result would be slightly different. Now what if we modify these features, and instead add a stylistic component to them? The reconstructed image would be quite different, and would show traces of the style modification to its features. This is the key idea behind neural style transfer.

We run into a tiny problem. To extract features from a network, we should have already trained it. For this purpose, instead of designing a model ourselves, and then training it from scratch, we will use the VGG model from TensorFlow. To follow along with the code used for this article in one place, you can go here.

Visual Geometry Group Model

The VGG model is a set of deep convolutional networks which was the first runner up in the ILSVRC-2014, losing by a narrow margin to GoogLeNet. The VGG model takes in an input $224 \times 224 \times 3$ image and applies a series of convolutions and poolings to it, in order to extract features. The model can have varying depths. For the purpose of this style transfer, the VGG19 model was used but feel free to experiment with the 16 layers deep version as well!

VGG16 model

We can visualize what kind of features are extracted from an image, using this picture of the BITS Pilani Rotunda taken from reddit as an example. Below the picture, is a set of selected images from its feature map, and in it, we can see how the neurons in the network fire. It consists of outlines, light spots, dark spots and so on, and I do encourage you to try this out with a variety of images. As we go into the deeper layers of the model, we find the patterns getting more obscure, as they simply start representing just the presence or absence of a feature. In our style transfer, we will simply take these features from the content image to rebuild our generated image.

The BITS Pilani Rotunda
Transition of features across layers of the network

To build this yourself, you'll need the following code. First we import all the libraries and initialize an instance of the VGG model. We then dump out all the layers which we won't require and consider only 5 layers for our needs. We then redefine our model with only those layers.

from tensorflow.keras.applications.vgg19 import VGG19, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.models import Model
import matplotlib.pyplot as plt
from numpy import expand_dims

model = VGG19()

ixs = [2, 5, 10, 15, 20]
outputs = [model.layers[i+1].output for i in ixs]
model = Model(inputs=model.inputs, outputs=outputs)

Then we run some standard Keras and numpy code to convert our images into a form the model can understand. The image_path will be a variable which you will have to add and set to the path of the target image. The VGG model works best with images of the shape (None, 224, 224, 3) so we set that as the size. We make it an array, reshape it, and process it for the model.

image = load_img(image_path, target_size=(224, 224))
image = img_to_array(image)
image = expand_dims(image, axis=0)
image = preprocess_input(image)

We now put the model to work and get it to predict the image. Our modified models means that the predict method returns the feature maps of the image we fed in, and not the actual prediction.

feature_maps = model.predict(image)

Finally, we iterate through the maps and use matplotlib to show them to us.

square = 8

for fmap in feature_maps:
	fig = plt.figure(figsize=(32, 19))
	ix = 0
	for _ in range(square):
		for _ in range(square):
			ax = fig.add_subplot(square, square, ix+1)
			plt.imshow(fmap[0, :, :, ix], cmap='gray')
			ix += 1

Starting with Style Transfer

Now, let's get around to the task of the actual style transfer. For this, we need two images - a content image and a style image. The content image would be broken down into its features and reconstructed from it. As usual, we start off by importing the necessary libraries.

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.applications.vgg19 import preprocess_input
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
import time
import PIL.Image

We now write some code to load our images in. Again, content_path and style_path are two variables which you will have to define to the path of the image.

def load_image(image):
  dim = 224
  image = plt.imread(image)
  img = tf.image.convert_image_dtype(image, tf.float32)
  img = tf.image.resize(img, [dim, dim])
  img = img[tf.newaxis, :]
  return img
content = load_image(content_path)
style = load_image(style_path)

Let's also build a method to go the other way round, a method to convert the tensor to an image.

def tensor_to_image(tensor):
  tensor = tensor * 255
  tensor = np.array(tensor, dtype=np.uint8)
  if np.ndim(tensor) > 3:
    assert tensor.shape[0] == 1
    tensor = tensor[0]
  return PIL.Image.fromarray(tensor)

We import the VGG model, and as before, make some changes to it for the purpose of style transfer.

def custom_vgg_model(layer_names, model):
  outputs = [model.get_layer(name).output for name in layer_names]
  model = Model([vgg_model.input], outputs)
  return model

vgg_model = tf.keras.applications.VGG19(include_top=False, weights='imagenet')
vgg_model.trainable = False

We also initialize the layers we want to consider for the transfer. To remove a layer, simply add a # just before the layer to comment it, thus ignoring it from the code. Generally, you would want to block out most of the content layers, but you can experiment with this yourself.

content_layers = ['block1_conv2',

style_layers = ['block1_conv1',

num_content_layers = len(content_layers)
num_style_layers = len(style_layers)

We will now write a function to make the custom model, flexible to the selections of the above block.

def custom_vgg_model(layer_names, model):
  outputs = [model.get_layer(name).output for name in layer_names]
  model = Model([vgg_model.input], outputs)
  return model

We make use of a "gram matrix" to delocalise the features from the style image. To make this, we take a convoluted version of the style image. Let's assume it is of size $m \times n \times f$, with $f$ feature filters. This is then reshaped into an $f \times mn$ matrix, which is then multiplied with its transpose to yield a final matrix of size $f \times f$, which will have the features we need.

def gram_matrix(tensor):
  temp = tensor
  temp = tf.squeeze(temp)
  fun = tf.reshape(temp, [temp.shape[2], temp.shape[0]*temp.shape[1]])
  result = tf.matmul(temp, temp, transpose_b=True)
  gram = tf.expand_dims(result, axis=0)
  return gram

Finally, since we want to make a model, we put all of this into a class. We write a standard constructor for the class, and a member function to re-scale the images, which will enable us to pull out the features from them.

class Style_Model(tf.keras.models.Model):
  def __init__(self, style_layers, content_layers):
    super(Style_Model, self).__init__()
    self.vgg =  custom_vgg_model(style_layers + content_layers, vgg_model)
    self.style_layers = style_layers
    self.content_layers = content_layers
    self.num_style_layers = len(style_layers)
    self.vgg.trainable = False

  def call(self, inputs):
    inputs = inputs*255.0
    preprocessed_input = preprocess_input(inputs)
    outputs = self.vgg(preprocessed_input)
    style_outputs, content_outputs = (outputs[:self.num_style_layers],
    style_outputs = [gram_matrix(style_output)
                     for style_output in style_outputs]

    content_dict = {content_name:value
                    for content_name, value
                    in zip(self.content_layers, content_outputs)}

    style_dict = {style_name:value
                  for style_name, value
                  in zip(self.style_layers, style_outputs)}

    return {'content':content_dict, 'style':style_dict}

We also add an instance of the class, and proceed to initialize the extraction functions.

extractor = Style_Model(style_layers, content_layers)
style_targets = extractor(style)['style']
content_targets = extractor(content)['content']

Building the Loss Function

If the image generated is very different from the content image, that's definitely not desirable. We describe this difference as the content loss. Let's assume $T$ is the target image generated and $C$ is the original content image.

$$ L_{content} = \frac{1}{2} \sum  (T-C)^2 $$

The content loss is relatively simple. We usually stick to only one layer, as the inclusion of too many layers confuses the network. However, it is different with the style image where we only want to capture the stylistic details, but not image formations. We take a blend of different layers to capture the stylistic strokes. The gram matrix we made helps in ensuring that these strokes don't get localised to its position in the image. For this, we can take a layer weighted average for the loss, with the weights being $w_i$, and the style information from the layer being $S_i$.

$$ L_{style} = \sum w_i (T_i - S_i)^2 $$

However on running the model, we find another problem. There are a lot of style artifacts which get repeated across the image. These high frequency stylings can be reduced by introducing another regularization variable called the variation loss. If we run passes through the styled image to find the edges using a Sobel filter, we will find those artifacts. Adding this to the loss will also reduce them by smoothing them out.

We now define some weights for the various losses to control their influence. You can add another dict for weighting the content layer, but that might not exactly work well with the model. Again, you are free to try out such an implementation. You do not have to comment out bits here, as the weights will simply be ignored if the layer doesn't exist. The final equation for the total loss would look like this.

$$ L_{total} = a L_{content} + b L_{style} + c L_{variational} $$

style_weight = 2e-5
content_weight = 5e5
total_variation_weight = 10

style_weights = {'block1_conv1': 1,
                 'block2_conv1': 2,
                 'block3_conv1': 7,
                 'block4_conv1': 1,
                 'block5_conv1': 4}

The final custom loss function for this would be the code implementation of the math above.

def total_loss(outputs, image):
    style_outputs = outputs['style']
    content_outputs = outputs['content']
    style_loss = tf.add_n([style_weights[name] * tf.reduce_mean((style_outputs[name]-style_targets[name])**2)
                           for name in style_outputs.keys()])
    style_loss *= style_weight / num_style_layers

    content_loss = tf.add_n([tf.reduce_mean((content_outputs[name]-content_targets[name])**2)
                             for name in content_outputs.keys()])
    content_loss *= content_weight / num_content_layers
    variation_loss = total_variation_weight * tf.image.total_variation(image)
    loss = style_loss + content_loss + variation_loss
    return loss

The Style Transfer!

After getting the loss function sorted, we simply have to reduce the loss, as with all neural networks. We use the trusty Adam optimizer for the task. You can tune Adam's hyper parameters yourself, for an image. The abnormally high value of the epsilon seemed to work well for this one.

opt = tf.optimizers.Adam(learning_rate=0.02, beta_1=0.99, epsilon=1e-1)

Let's use this penciled image of a house as our style image and the same picture of the Rotunda as the content image.

Style Image - Pencil Shading

To train the model, we build another method for it. To prevent the function from being agonizingly slow, we also decorate it. We clip the outputs of the model to force the values to remain between 0 and 1, in order to prevent the image from getting washed out or darkened. GradientTape is the wonderful overseer provided by TensorFlow. It perches above the running of the model and records all of its mathematical activities to enable automatic differentiation⁠—which is vital for calculating new parameters.

def train_step(image):
  with tf.GradientTape() as tape:
    outputs = extractor(image)
    loss = total_loss(outputs, image) 

  grad = tape.gradient(loss, image)
  opt.apply_gradients([(grad, image)])
  image.assign(tf.clip_by_value(image, clip_value_min=0.0, clip_value_max=1.0))

Finally, we let the training happen, and look at the result. The variables epochs and steps_per_epoch can be modified according to the needs of the user.

target_image = tf.Variable(content)
epochs = 4
steps_per_epoch = 50

step = 0
outputs = []

for n in range(epochs):
  Tic = time.time()
  for m in range(steps_per_epoch):
    step += 1
  Toc = time.time()
  print("Epoch " + str(n+1) + " took " + str(Toc-Tic) + " sec(s)")
Generated Image - Take 1

Shoot. The generated image is absolutely terrible. To improve it, we will need to tweak the hyper parameters for the model. Let's take a look at the hyper parameters we have - the various loss weights, and the choice of the layers to consider. Changing the style layers might do something creative, with the first layers targeting low level arbitrary features, while the last layers focus on intense, major features of the style image. Increasing the weighting of the style loss or the content loss would respectively reduce its influence on the image. Too much tweaking would however confuse the network, resulting in components of the style image getting imposed on the content image.

Generated Image - Take 69

Finally, after delicate adjustment to find that narrow range where all goes well, we settle on the third and fifth convolutional layers as the heaviest weighted ones. We also give the content around $10^{10}$ times more weight than the style, logarithmically symmetrical around 1. At the end we get a sort of color pencil-ized version of the original image.

Generated Image - Take 420

Finally, in case you get too tired with the hyper parameter search, fear not, for the good folks at TensorFlow maintain a module called tensorflow_hub. This module has many reusable models, one of which we used for style transfer.

import tensorflow_hub as hub
hub_module = hub.load('')
stylized_image = hub_module(tf.constant(content), tf.constant(style))[0]
img = tensor_to_image(stylized_image)
Stylized BITS Pilani Rotunda, by tensorflow_hub

The transition of the image along the style transfer⁠—whose code eagle-eyed observers would have spotted in the training⁠—is available with the full code in one place here.

The original paper for neural style transfer by Gatys et al. :

A paper by the Tencent AI Lab for real time style transfer from video input : Real Time Neural Style Transfer for Videos

A paper by Li et al with the more mathematical aspects to help choosing the right hyper parameters for your model :

A neural style transfer tutorial by TensorFlow :

The documentation of the tensorflow_hub implementation of fast style transfer :