A Neural Algorithm of Artistic Style - A.Gatys

This is one of first paper on style transfer.

Paper: https://arxiv.org/pdf/1508.06576.pdf

Summary:

This paper shows how to transfer style of one image to other image preserving its content. In this paper, a photograph is chosen, and then style images is chosen from some famous paintings and then style form paintings is applied on content photograph.

Key Points:

When we train the CNN for object recognition, then as we go down the network, the input image is transformed into representations that increasingly care about the actual content of the image compared to its detailed pixel values.
Higher layers in the network capture the high-level content in terms of objects and their arrangement in the input image.
This paper make use of idea to update the input images using back prop instead of weights. Here we differentiate the loss wrt input image instead of weight, to minimise the loss.
Max pooling is replaced by average pooling for better gradient flow and also average fooling generate more smooth results.

Method:

The network used here is VGG19, they use only conv layers of the network.

Let $\vec{p}$ be the photograph and $\vec{a}$ be the artwork. The loss function we minimise is: $L_{total}(\vec{p},\vec{a},\vec{x} ) = αL_{content}(\vec{p},\vec{x} )+ βL_{style}(\vec{a},\vec{x} )$ So let $\vec{p}$ and $\vec{x}$ be the original content image and the image that is generated and $P^l$ and $F^l$ their respective feature representation in layer l. We then define the squared-error loss between the two feature representations $L_{content}(\vec{p},\vec{x},l ) = \frac{1}{2}\sum_{i,j}(F_{ij}^l-P_{ij}^l)^2$ from which the gradient with respect to the image $\vec{x}$ can be computed using standard error back-propagation. Thus we can change the initially random image $\vec{x}$ until it generates the same response in a certain layer of the CNN as the original image $\vec{p}$ .

On top of the CNN responses in each layer of the network we built a style representation that computes the correlations between the different filter responses, where the expectation is taken over the spatial extend of the input image. These feature correlations are given by the Gram matrix Gl ∈ RNl×Nl , where Gl ij is the inner product between the vectorised feature map i and j in layer l: $G_{ij}^l = \sum_{k}F_{ik}^lF_{jk}^l$ To generate a texture that matches the style of a given image (Fig 1, style reconstructions), we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimising the mean-squared distance between the entries of the Gram matrix from the original image and the Gram matrix of the image to be generated. So let ?a and ?x be the original image and the image that is generated and Al and Gl their respective style representations in layer l. The contribution of that layer to the total loss is then $E_l = \frac{1}{4N_l^2M_l^2}\sum_{ij}(G_{ij}^l-A_{ij}^l)^2$ Total loss is $L_{style}(\vec{a},\vec{x}) = \sum_{l=0}^Lw_lE_l$ where wl are weighting factors of the contribution of each layer to the total loss.

References:

Gatys, L. A., Ecker, A. S. & Bethge, M. Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks. arXiv:1505.07376 [cs, q-bio] (2015). URL http://arxiv.org/abs/1505.07376. ArXiv: 1505.07376.
Mahendran, A. & Vedaldi, A. Understanding Deep Image Representations by Inverting Them. arXiv:1412.0035 [cs] (2014). URL http://arxiv.org/abs/1412.0035. ArXiv: 1412.0035.

PreviousAn intriguing failing of convolutional neural networks and the CoordConv solution NextDeep Learning Book

Last updated 6 years ago