Robust Statistical Frontalization of Human and Animal Faces

The unconstrained acquisition of facial data in real-world conditions may result in face images with significant pose variations, illumination changes, and occlusions, affecting the performance of facial landmark localization and recognition methods. In this paper, a novel method, robust to pose, illumination variations, and occlusions is proposed for joint face frontalization and landmark localization. Unlike the state-of-the-art methods for landmark localization and pose correction, where large amount of manually annotated images or 3D facial models are required, the proposed method relies on a small set of frontal images only. By observing that the frontal facial image of both humans and animals, is the one having the minimum rank of all different poses, a model which is able to jointly recover the frontalized version of the face as well as the facial landmarks is devised. To this end, a suitable optimization problem is solved, concerning minimization of the nuclear norm (convex surrogate of the rank function) and the matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _1$$\end{document}ℓ1 norm accounting for occlusions. The proposed method is assessed in frontal view reconstruction of human and animal faces, landmark localization, pose-invariant face recognition, face verification in unconstrained conditions, and video inpainting by conducting experiment on 9 databases. The experimental results demonstrate the effectiveness of the proposed method in comparison to the state-of-the-art methods for the target problems.


Introduction
Face frontalization refers to the recovery of the frontal view of faces from images captured in unconstrained conditions. Accurate face frontalization is a cornerstone for many face analysis problems. For example, recently, it has been shown that well-designed face frontalization can help in achieving state-of-the-art performance in face recognition in unconstrained conditions (Taigman et al. 2014;Hassner et al. 2015). 1 An essential step towards face frontalization is facial landmark localization. State-of-the-art landmark localization methods (Tzimiropoulos et al. 2013;Saragih et al. 2011;Asthana et al. 2013;Xiong and De la Torre 2013;Ren et al. 2014;Kazemi and Sullivan 2014) model the problem discriminatively by capitalizing on the availability of annotated data (in terms of facial landmarks) (Sagonas et al. 2013b, a, 1 Some recent works based on deep learning shown that it may not be necessary to perform face frontalisation in order to achieve state-ofthe-art performance (Schroff et al. 2015). Nevertheless, we believe that even in this cases face frontalisation is beneficial and could lead to even further performance improvement. . Unfortunately, the annotation of facial landmarks is laborious, expensive, and time consuming process. This is even more the case for faces that are not in frontal pose. 2 In many cases, even accurate 2D landmark localization is not enough for successful face frontalization. That is, the frontalization step often requires both landmark localization and pose correction by usually resorting to 3D face models (Taigman et al. 2014;Yi et al. 2013;Sun et al. 2014;Hassner et al. 2015). In general, 3D model-based methods employ a 3D dense surface model in order to compute the 3D face shape, as well as the pose of the face depicted in an image. Then, the recovered shape is used to synthesize the frontal view of the face. However, such methods cannot be widely applied since they require: (a) a method for accurate landmark localization in various poses, (b) fitting learned 3D generic model of face, which is expensive to built, and (c) a robust image warping algorithm for frontal view image reconstruction (Taigman et al. 2014). As an alternative to this process, the authors of (Hassner et al. 2015) propose to avoid 3D face model fitting by employing a single 3D reference mesh.
In contrast to the 3D-model based methods, the patchbased methods approximate 3-D pose transformations as a set of linear transformations of 2D image patches. For instance, the Lucas-Kanade algorithm is employed to align patches of non-frontal faces to the corresponding one in frontal facial images (Ashraf et al. 2008). In (Chai et al. 2007;Li et al. 2012a), face frontalization is obtained via locally linear regression of patches, while (Ho and Chellappa 2013) employs a Markov Random Field (MRF). The main drawback of the latter is that for each non-frontal image, an exhaustive batch-based alignment algorithm is applied (trained on frontal patches), resulting in a time consuming procedure. In addition, the semantic correspondence between the non-frontal (test) and frontal (train) patches can be lost when significant pose variations occur. It is worth mentioning that, the patch-based methods are not able to handle adequately local non-linear deformations, which appear within the patch.
Furthermore, pose normalization is beneficial for finegrained categorization (i.e., subcategory recognition) in different classes of objects e.g., cats and dogs (Parkhi et al. 2012;Liu et al. 2012), flowers (Angelova et al. 2013;Nilsback and Zisserman 2006), birds Gavves et al. 2013), and cars Lin et al. 2014). Current state-of-the-art methods (Branson et al. 2014;Zhang et al. 2014) rely upon the use of 2D annotations in order to build convolutional neural networks for pose-normalized representations of objects. However, these methods can not be applied widely in different objects since object-specific annotations are required. Clearly such a procedure is cost-prohibitive. On the other hand, the use of 3D models is limited to 3D CAD car models , while 3D models of other arbitrary objects such as cats, dogs, and rabbits are either limited or do not exist at all and in general is expensive to acquire.
In this paper, we propose a unified method for joint face frontalization (pose correction) and landmark localization, using a small set of frontal images only. The key motivational observation is that for facial images lying in a linear symmetric space, the rank of a frontal facial image is much smaller c Face alignment and frontal view reconstruction are performed simultaneously. Finally, d face recognition is performed using the frontalized image than the rank of facial images in other poses. To demonstrate the above observation, 'Neutral' images of twenty subjects from Multi-PIE database (Gross et al. 2010) under poses −45 • to 45 • were warped into a reference frontal-pose frame and the nuclear norm (convex surrogate of the rank) of each shape-free texture was computed. In Fig. 1a the average value of the nuclear norm for different poses is reported. Clearly, the frontal pose has the smallest nuclear norm value compared to the corresponding values computed for other poses. Furthermore, the above observation was verified in case where the faces are not warped into a reference frame. To this end, the images used in the previous experiment were aligned based on the outer corner of their eyes. Then, using the landmark points of each aligned face we found the corresponding face convex hull and set equal to zero all the pixels that do not belong to this. Subsequently, the same bounding box was used in order to crop the face area in each image. In Fig. 1b the average value of the nuclear norm computed from the cropped images for different poses is reported. As it can been observed the frontal pose has the smallest nuclear norm value compared to the corresponding values computed for other poses. However, severe deviations from the above linear facial model occur in the presence of pose, occlusions, expressions, and illumination changes. The proposed method: (a) approximately removes deformations due to pose and expressions by exploiting a motion model, (b) models occlusion/specular highlights and warping errors as noise (that is sparser than the actual signal), and (c) handles illumination variations by employing in-the-wild frontal facial images by solving a suitable optimization problem, involving the minimization of the nuclear norm and the matrix 1 norm. The flowchart of the proposed method (coined as RSF-Robust Statistical Face Frontalization) is depicted in Fig. 2.
The most closely related work to the RSF is the Transform Invariant Low-rank Textures (TILT) (Zhang et al. 2012), where texture rectification is obtained by applying a global affine transformation onto a low-rank term, modelling the texture. By blindly imposing low-rank constraints without regularization, for non-rigid alignment opposite effects may occur. As recently demonstrated (Cheng et al. 2013a;Sagonas et al. 2014;Cheng et al. 2013b), non-rigid deformable models cannot be straightforward combined with optimization problems (Peng et al. 2012) that involve low-rank terms without a proper regularization. To overcome the aforementioned problems, a model of frontal images is employed in this work.
The contributions of the paper are summarized as follows: -A novel method, i.e., the RSF, 3 for joint landmark localization and face frontalization is proposed to adequately model pose, occlusions, expressions, and illumination variations using a statistical model of frontal images, lowrank, and sparsity. Furthermore, the RSF is extended to F-RSF for handling multi-channel image representations (i.e., features such as SIFT, IGO, HoG etc) and to RSF-V for joint frontalization and alignment in a batch of images or videos. -The performance of RSF is assessed by conducting extensive experiments using human faces, cat faces, and face sketches from 9 databases. The effectiveness of the RSF-V is demonstrated in video-based face verification and video inpainting. -We demonstrate, for the first time, that it is possible to improve the state-of-the-art results in generic landmark localization, pose-invariant face recognition, and unconstrained image and video face verification tasks by using a model of frontal images only. This finding is surprising since it implies that when phenomena are properly modelled, simple statistical linear models suffice.
The proposed methodology can aid the design of applications in two ways: (a) in case there exist many available annotated data, it can largely boost the performance of learning-based recognition methods (as frontalisation achieves in (Taigman et al. 2014)) and (b) it can aid in achieving state-of-the-art (or competitive) results in challenging settings where there is still of lack of data [e.g. the restricted protocols of LFW (Huang et al. 2007)] or in cases in which annotated data are expensive to acquire (e.g., landmark localisation). The remainder of the paper is organized as follows. In Sect. 2 basic notations and definitions are introduced. The RSF, F-RSF, and RSF-V methods are detailed in Sects. 3 and 4, respectively. In Sect. 5 the experimental results are presented. Section 6 concludes the paper.

Notations and Preliminaries
Throughout the paper, scalars are denoted by lower-case letters, vectors (matrices) are denoted by lower-case (uppercase) boldface letters i.e., x, (X). I denotes the identity matrix. The ith column of X is denoted by x i . A vector x ∈ R m·n (matrix X ∈ R m×n ) is reshaped into a matrix (vector) via the reshape operator : The rank(X) is the rank of a matrix X (i.e., the maximum number of linearly independent rows or columns in X). The 1 and the 2 norms of x are defined as x 1 = i |x i | and x 2 = i x 2 i , respectively. The matrix 1 norm is defined as X 1 = i j |x i j |, where | · | denotes the absolute value operator. The Frobenius norm is defined as X F = i j x 2 i j , and the nuclear norm of X (i.e., the sum of singular values of a matrix) is denoted by X * . X T is the transpose of X. If X is a square matrix, X −1 is its inverse, provided that the inverse matrix exists. The i-th vector of the standard basis in R m·n is denoted as q (i) A shape instance consisting of N landmark points is denoted as s = [x (1) , y (1) , . . . , x (N ) , y (N ) ]. A small set of shape instances {s i } is used to learn a point distribution model (PDM). First, all the shapes are put into correspondence by removing the global similarity transforms via Generalized Procrustes Analysis. Then, a principal component analysis (PCA) is applied on the aligned shapes, resulting in a number of N S eigen-shapes U S and the mean shapes. Given a PDM S = {s, U S ∈ R 2N ×N S } a new instance is generated as s =s+U S p, where p is the N S ×1 vector of shape parameters.
The warp function x(W(z; p)) X(W(z; p)) denotes the warping of each 2D point z = [x, y] within a shape instance to its corresponding location in a reference frame. To simplify the notation x(p) X(p) will be used throughout the paper instead of x(W(z, p)) X(W(z, p)) . Finally, the reference frame is defined when p = 0, such that x(p) = x X(p) = X .

Problem Statement
Let X ∈ R h×r be an image depicting a non-frontal view of a face and s ∈ R 2N ×1 an initial estimation of N landmark points, describing the shape. To create a shape-free texture, the input image is warped into a frontal-pose reference frame by employing a warp function W(·). In many cases the warped image X(p) ∈ R m×n can be corrupted by sparse errors of large magnitude. Such sparse errors indicate that only a small fraction of the image pixels may be corrupted by non-Gaussian noise and occlusions. In this paper, the goal is to recover the clean frontal view (i.e., a low-rank image L ∈ R m×n ) of the X(p) such that: X(p) = L + E, where E ∈ R m×n is a sparse error matrix, accounting for gross errors. This formulation leads to the following optimization problem: (1) In (Zhang et al. 2012), TILT transforms the above nonconvex problem into convex (Candès et al. 2011) and subsequently solves efficiently the relaxed problem in an alternating fashion (Bertsekas 1982). However, by minimizing the non-regularized rank of the image ensemble, tends to unnaturally deform the subject's facial appearance resulting in false face alignment (Sagonas et al. 2014;Cheng et al. 2013a). Figure 3a, b show the initial position of the landmarks used as initialization (Zhu and Ramanan 2012) and the corresponding result obtained by the TILT, respectively. As it can be seen the result is very poor which is expected due to the lack of regularization in the rank constraint. In order to solve the above problem and ensure that unnatural faces will not be created, a statistical model built from frontal images is utilized. In particular, based on the observation that the frontal view of a face is in a low-rank subspace (please refer to Fig. 1), it can be expressed as a linear combination of a small number of precomputed orthonormal bases (i.e. U = [u 1 |u 2 | · · · |u k ] ∈ R m·n×k , U T U = I) that span a generic (clean) frontal view subspace, that is L = k i=1 R m×n (u i )c i . Therefore, the deformed corrupted input image is written as: To match the specifications of the frontal image and the sparse error one can find the low-rank frontal image, the linear combination coefficients, the increments of warp parameters, and the error matrix by solving the following optimization problem: where λ is a positive weighting parameter that balances the rank of L and the sparsity of the E. Problem (2) is difficult to be solved since: (a) both rank function and 0 -norm are non-convex, discrete valued functions, minimization of which is NP-hard (Natarajan 1995;Vandenberghe and Boyd 1996), and (b) the constraint X(p) = L + E is non-linear.
To alleviate this problem, the nuclear-and the 1 -norms are adopted as convex surrogates to rank function and 0 -norm (Fazel 2002;Donoho 2006). To address the non-linearity of the above mentioned equality constraint, a first order Taylor linear approximation is applied on the vectorized form of the constrained: x(p + Δp) ≈ x(p) + J(p)Δp, where vec(X(p)) = vec(L + E) = Uc + e = x(p) and J(p) = ∇x(p) ∂W ∂p is the Jacobian matrix with the steepest descent images as its columns. Consequently, the RSF solves the following optimization problem:

Alternating-Direction Based-Method Algorithm
To solve (3), the augmented Lagrangian (Bertsekas 1982) is introduced: L(L, c, Δp, e, M) = L * + λ e 1 + a T H (1) (Δp, c, e) where M = {a ∈ R m·n , B ∈ R m×n } are the Lagrange multipliers for the equality constraints in (3) and μ > 0 is a penalty parameter. Equivalently, (4) can be rewritten as follows: By employing the alternating directions method of multipliers (ADMM) (Bertsekas 1982), (3) is solved by minimizing (4) with respect to each variable in an alternating fashion. Finally, the Lagrange multipliers are updated at each iteration.
Let t be the iteration index. For notation convenience we will write L( when all the variables except L [t] are kept fixed. Accordingly, given L [t] , c [t] , Δp [t] , e [t] , M [t] and μ [t] , the iterations reads as follows: L(e [t] ).
Algorithm 1: Solving (4) by the ADMM method Data: Test image X, initial shape parameters p , clean frontal-view face subspace U, and the parameter λ Result: The low-rank clean frontal image L, the sparse error e, the coefficient vector c, and the shape parameters p. while not converged do //Outer loop X(p) ← Warp and normalize the image; J(p) ← Compute the Jacobian matrix; /μ [t] ; Update the Lagrange multipliers by (18) Step 2: Update c: (12) is a quadratic problem which admits a closed form solution given element-wise by: Step 3: Update Δp: The increment of the parameters Δp is computed by solving the least square problem (14): Step 4: Update e: The closed-form solution of (16) is given by applying element-wise the shrinkage operator onto: , namely: Step 5: Update Lagrange multipliers a, B and μ : The Lagrange multipliers and the parameter μ are updated by: Convergence Criteria: The inner loop of the Algorithm 1 terminates when: The Alg. 1 terminates when the change of the L * + λ E 1 between two successive iterations is smaller than a predefined threshold 3 or the maximum number of the outers' loop iterations is reached. Computational Complexity: The dominant cost of each iteration of Algorithm 1 is that of the Singular Value Decomposition (SVD) algorithm involved in the computation of the SVT operator in update of L (Step 1). Consequently, the computational complexity of Algorithm 1 is O(T (min(m, n) 3 + n 2 m)), where T is the total number of iterations until convergence.
Convergence: Regarding the convergence of the Algorithm 1 there is currently no theoretical proof known for the ADMM in problems with more than two blocks of variables. However ADMM has been applied successfully in non convex optimization problems in practice (Sagonas et al. 2014;Peng et al. 2012;Panagakis et al. 2015;Georgakis et al. 2016;Papamakarios et al. 2014). In addition, the thorough experimental evaluation of the proposed method, presented in Sect. 5, indicates that the convergence of Algorithm 1 is empirically proved for data that RSF tested. In Fig. 4, the empirical convergence curves of the inner loop of Algorithm 1 for the cases of human and cat faces are depicted. The low-rank and sparse error matrices produced after 30, 50 and 117 iterations, respectively, are also shown.

Feature-Based RSF (F-RSF)
In this section, we extend the RSF in order to be applied on images represented by multi-channel features, e.g, SIFT (Lowe 1999), HoGs (Dalal and Triggs 2005), IGOs (Tzimiropoulos et al. 2012) etc. The proposed extension is coined as Feature-based RSF (F-RSF). Given an input image Q ∈ R h×r and a feature extraction function K : R h×r → R h·r ×G , the feature-based representation of the image is defined as X = [x 1 , . . . , x G ] ∈ R h·r ×G , where G is the number of the channels. Then, the problem of recovering the clean-frontal view in the feature space is formulated as follows: where L j is the low-rank image, c j is the linear combination coefficients, e j is the sparse error, and J j is the Jacobian for each channel j = {1, 2, . . . , G}. The shape parameters p and the corresponding increments Δp are the same for all the channels. Furthermore, U j are bases matrices computed using the j channel of expressionless clean frontal images. To minimize (20), the ADMM method is applied on the augmented Langragian: where M j = {a j , B j } G j=1 are the Lagrangian multipliers. Similarly to Algorithm 1, the proposed ADMMbased solver (outlined in Algorithm 2), minimizes (21) with respect to each variable in an alternating fashion and finally the Lagrange multipliers are updated at each iteration.

Robust Face Frontalization in Videos
Recognizing faces in videos is a task of paramount importance due to the wide range of commercial and surveillance applications. In recent years, the increasing popularity of commercial cameras, smart-phones, and video repositories such as Youtube has led to an increase of videos taken under uncontrolled (in-the-wild) conditions. The major problem in the recognition of a person in an in-the-wild video is that the appearance of the face dramatically changes under different poses, expressions, occlusions, and illumination conditions. In order to tackle these issues the method proposed in Sect. 3 can be applied independently in each frame of the video. Therefore, given a video sequence {X (i) ∈ R h×r } F i=1 and the initial position of the landmarks in each frame the corresponding low-rank images and corrected landmarks are produced. Then, the recognition can be performed by employing only the frontalized images {L (i) } F i=1 (Fig. 5). However, by processing independently each frame rather than all frames together we do not take in consideration the temporal correlation among the frames. In case where all the frames are well-aligned the image ensemble D = [vec(X (1) (p (1) ))| · · · | vec(X (F) (p (F) ))] ∈ R m·n×F lies in a low-rank subspace. By rectifying that fact, the problem of face frontalization in video can be formulated as follows: Algorithm 2: Solving (21) by the ADMM method Data: Feature-based representation of the test image X, initial shape parameters p, clean frontal-view face subspaces {Uj } G j=1 , and the parameter λ. Result: The low-rank clean frontal images Lj , the sparse errors ej , the coefficient vectors cj , and the shape parameters p, j = {1, 2, · · · , G}. while not converged do //Outer loop for j = 1 : G do xj (p) ← Warp and normalize the image corresponds to channel j; Jj (p) ← Compute the Jacobian matrix; end end ; Update the Lagrange multipliers aj , Bj and μ: , 10 10 ); Check convergence conditions; end t ← t + 1; end p ← p + Δp; end

Fig. 5
Robust Face Frontalization in Video: Given a video sequence consisting of F frames of the same subject, the results from a detector and a statistical model U a constrained low-rank minimization problem is solved. The frontal images, the increments of parameters, and sparse error matrices {L, Δp, E} F i=1 are computed subject the frontalized version of each frame is a low-rank image as well as the ensemble of all frontalized images is low-rank where q (1) T , q (2) T , . . . , q (F) T are the standard bases of R F×1 and O ∈ R m·n×F is a sparse error matrix. To minimize (22), the ADMM is applied on the augmented Lagrangian: yielding a similar to Algorithm 1 procedure. In (23), are the Lagrangian multipliers.

Experimental Evaluation
The performance of the RSF is assessed in five different tasks:  (Zhang et al. 2011b;Wang and Tang 2009), and CAT (Zhang et al. 2008) databases. Furthermore, the YTF (Wolf et al. 2011) database is employed in order to evaluate the performance of RSF-V for the video face verification task.

Data Description
Let us first provide a brief description of the databases used in the evaluation studies.  CAT: The CAT (Zhang et al. 2008) database consists of 10,000 cat images obtained from flickr.com. Annotations regarding 9 points for each cat head are provided. A subset of 350 images was used in the conducted experiments. The selected images were re-annotated by employing a dense mark-up scheme consisting of 48 points (Sagonas et al. 2015).

Experimental Setup
In all the experiments, the orthonormal clean frontal subspace U was constructed by employing only frontal view face images without occlusions. The images were warped in a reference frame by using the W (cf. Sect. 2). Subsequently, PCA was applied on the warped shape-free textures. Then, the first k eigen-images with the highest variance were used to form the U. In Table 1, information regarding the construction of U, as used in our experimental evaluation, are provided.

Reconstruction of Frontal View
The ability of the RSF to reconstruct the frontal view from non-frontal images of unseen faces is investigated in this section. Given the test image and initial landmarks a warped  (3) is solved again (INNER loop of Algorithm 1). Finally, after the convergence of Algorithm 1, the final frontalized test image, location of the landmarks, and error sparse error matrix are produced. All the frontalization presented in this Section were created by using the U W , U C , and U S . Unless otherwise stated, throughout the experiments, the parameters of the Algorithm 1 were fixed as follows: λ = 0.3, ρ = 1.1, 1 = 10 −5 , 2 = 10 −7 , and 3 = 10 −3 .
In Fig. 6a, b the frontalized views of unseen faces from the LFPW, Helen, AFW and LFW databases are illustrated. Figure 6c, d depict the frontal reconstructed views from the non-frontal images of subject with id '00268' from FERET and images from Multi-PIE with (a) 'Surprise' at −30 • , (b) 'Scream' at −15 • , (c) 'Squint' at 0 • , (d) 'Neutral' at +15 • , and (e) 'Smile' at +30 • . The efficacy of the RSF is also assessed by creating the frontal view of face sketches and cat faces. The obtained reconstructions for these objects are depicted in Fig. 6e, f. By visually inspecting the results, it is clear that the RSF is robust to many variations such as pose, expression, and sparse occlusions. This attributed to the fact that the matrix 1 -norm was adopted for sparse non-Gaussian noise characterization.
In order to assess the effectiveness of the RSF in handling different illumination conditions, we conducted the following experiment. We selected 'Neutral' images of three subjects from the Multi-PIE database under poses −15 • to 15 • . For each pose and subject, 11 images captured under 11 different illumination conditions were used. Then, the images of each subject (30 in total) were frontalized by employing the RSF with the basis matrix U W . The obtained frontalized views of all subjects are depicted in Fig. 8. As it can been observed, the RSF reconstructs successfully the frontal view of the unseen subject and in most of cases removes the illumination effects.
As an additional example, 100 images (10 images for each subject) of 10 subjects from CACD database  were frontalized by employing the Algorithm 1. In Fig. 7 the averages of input, frontalized, and sparse error matrices are depicted. As it can been observed, the averages of faces after frontalization are much sharper, and detailed than the average input images, indicating the frontalization quality achieved by the RSF.
To quantitatively assess the quality of the frontalized images the following experiment was conducted. 'Neutral' images of 20 different subjects from Multi-PIE under poses −30 • to 30 • (5 for each subject, 100 in total) were selected. The images of each subject were frontalized by employing the RSF. The Root Mean Square Error (RMSE) between each frontalized image and the real frontal image of the subject is used as the evaluation metric. The average RMSE of the RSF is 0.0817. The performance of the RSF with respect to RMSE is compared with that obtained by the frontalization method of the DeepFace (Taigman et al. 2014) which achieved an average RMSE of 0.1025. It is worth noting that, even though DeepFace employs a 3D model to handle out-of-planar rotations, the RSF performs better without using any kind of 3D information.

Landmark Localization
The performance of the RSF for the generic alignment problem is assessed by conducting experiments on (a) in-the-wild faces, (b) sketch faces and (c) cat faces. To this end, the performance of the RSF is compared to that obtained by the TILT (Zhang et al. 2012), AAMs (Matthews and Baker 2004), CLMs (Saragih et al. 2011), and SDM (Xiong and De la Torre 2013). In order to fairly compare the competing methods, the same training data (the same images which were used to build the U W ), initialization, and feature representation were employed. For all experiments the simple representation of pixel intensities (PIs) was used. The average point-to-point Euclidean distance of N landmark points normalized by the Euclidean distance of the outer corner of eyes is used as the evaluation measure. More specifically, by denoting the ground truth and fitted shapes of an image i as s gt and s f respectively and the Euclidean distance between the outer corners of the eyes as d outer , the fitting error is given by: gt ) 2 . In addition, the cumulative error distribution curve (CED) for each method was computed by using the fraction of test images for which the average error was smaller than a threshold. Finally, the imple-

Aligning in-the-Wild Face Images
The in-the-wild face databases LFPW, HELEN and AFW were employed in order to assess the performance of the RSF in the problem of generic face alignment. The results produced by the detector in (Zhu and Ramanan 2012) were used to initialize all the methods. The annotations provided in (Sagonas et al. 2013b(Sagonas et al. , a, 2016 have been employed for evaluation purposes. The error for each method was computed based on N = 49 interior landmark points (excluding the points correspond to face boundary). Finally, the bases matrices U L , U H and U W were used by the RSF.  The CEDs produced by all methods for the LFPW (test set), the HELEN (test set), and the AFW databases are depicted in Fig. 9a. Clearly, the RSF outperforms the TILT-PIs, the AAMs-PIs, the CLMs-PIs, and the SDM-PIs. More specifically, for normalized error of 0.05 the RSF yield an 20.1, 21.5 and 24.6 % improvement compared to that obtained by the AAMs-PIs in the LFPW, HELEN and AFW databases, respectively. TILT performs worst overall which can be explained by the fact that it minimizes the unconstrained rank of the image ensemble. The discriminative methods SDM and CLMs yield poor performance because they were trained with only 500 frontal images. In general the discriminative methods require large amount of annotated data in order to yield powerful classifiers and functional mappings. In contrast, AAMs which are generative models, achieved better results than the CLMs and SDM. In Table 2 the proportion of images with normalized error lower than 0.02, 0.03, and 0.05 for the competing methods are reported. A few fitting examples from the test databases are depicted in Fig. 12. Furthermore, we computed the average time, in CPU seconds, that each method requires to fit one image. By inspecting Table 3 we observe that the CLM, AAMs, and SDM are faster than the RSF. This is attributed to the high Step 1). The computational complexity of the RSF can be reduced by using fast variants of the Singular Value Thresholding operator e.g., (Cai and Osher 2010;Oh et al. 2015), in order to solve the nuclear norm regularized least squared problem (10). However, such modification is out of the scope of our paper.
We also compared RSF to the state-of-the-art methods SDM (Xiong and De la Torre 2013), LBF (Ren et al. 2014), and ERT (Kazemi and Sullivan 2014). The authors provided pre-trained model and code was used for the SDM, while the LBF and ERT were trained and tested by using the available implementations. 4 In particular, the LBF and ERT were trained using the AFW and train sets of LFPW and HELEN. The parameters were set as explained in corresponding papers. The CEDs from this experiment are shown in Fig. 9b. The RSF achieves comparable performance with that obtained by the competing methods, but it uses only a small set of frontal images for training. This is in contrast to all other methods that were trained on thousand images captured under several variations including different poses, illuminations and expression (i.e., train sets of the used databases). Furthermore, the SDM method takes full advantage of SIFT-a powerful hand-crafted feature-while the RSF employs only simple PIs. Figure 12a illustrates fitting examples produced by RSF.
The performance of the F-RSF on generic face alignment is also assessed by conducting experiments on the LFPW and HELEN databases. To this end, the same initializations and procedure described before was followed. The dense-SIFT features with G = 36 channels were used by the F-RSF. In order to build the basis matrices U j , j = {1, 2, . . . , G} we computed the dense SIFT features of the clean frontal images and then the images correspond to each channel j were used to compute the U j . The performance of the F-RSF is compared against that obtained by the RSF-PIs and state-of-the-art methods SDM, LBR, and ERT. The CEDs produced by the competing methods are presented in Fig. 9b. As it can been seen the F-RSF outperforms the RSF-PIs, SDM, and LBF while performs very closely to the state-ofthe-art method ERT.
Even though, the intrinsic motivation of the RSF is to deal with gross, but sparse, non-Gaussian noise that often appears in face image acquired under real world conditions (e.g., device artifacts such as pixel corruptions, missing and incomplete data such as partial image texture occlusions, or localization errors). The RSF can implicitly handle data contaminated by Gaussian noise by vanishing the error term. That is by setting the weighting parameter in optimization problem (2) λ → ∞, i.e. E = 0. In this case, the 2 norm μ 2 ||H (1) (Δp, c)|| 2 2 appearing in the augmented Lagrangian function (5) is deemed as the appropriate regularized for handling Gaussian noise.
The effectiveness of the RSF-PIs under Gaussian noise is assessed in face frontalization and landmark localization. In both experiments the parameter λ was set equal to 10000. In Fig. 10 the frontalized faces obtained by the 1 -RSF-PIs and 2 -RSF-PIs using the U W are depicted in rows 2 and 3, respectively. As it can been seen the faces produced by the

Aligning Cat and Sketch Face Images
RSF is a general technique and we demonstrate that by its ability to align face sketches and cat faces. To this end, we use the FS and CAT databases. The matrices U C , U S were employed and the fitting error in case of CAT was calculated based on N = 37 interior landmark points (excluding the points of boundary). The results obtained by the compared methods are summarized in Fig. 9c, d and Table 2. The quality of fitting results produced by the methods can be seen in Fig. 12. The RSF outperforms all other methods, and demonstrates the ability to handle any face-like objects.

Pose-Invariant Face Recognition
The performance of the RSF on pose invariant face recognition with one gallery image per person is assessed by conducting experiments on the Multi-PIE and FERET databases. The experiment proceeds as follows. First, the frontal views of all images used in this experiment were reconstructed following the methodology described in Sect. 5.3 by employing the U W . In order to remove the surrounding black pixels, the reconstructed frontal views were cropped. Subsequently, the Image Gradient Orientations (IGOs) features (Tzimiropoulos et al. 2012) were used for image representation. Let us denote an image in vectorial form as v with size d ×1, thus d is the number of pixels. Moreover, g x , g y denote the image gradients and φ = arctan(g x /g y ) the corresponding gradients orientation vector. The normalized gradients extraction function F : R d×1 → R 2d×1 is defined as where cos(φ) = [cos(φ(1)), . . . , cos(φ(d))] and sin(φ) = [sin(φ(1)), . . . , sin(φ(d))]. The dimensionality of IGOs was reduced by applying PCA. Finally, the classification was performed by employing the Collaborative Representation based Classifier (CRC) in (Zhang et al. 2011a). The performance of the RSF is compared to 2D based methods: LGBP (Zhang et al. 2005) and PIMRF (Ho and Chellappa 2013), 3D based methods: 3DPN (Asthana et al. 2011), EGFC (Li et al. 2012b), and PAF (Yi et al. 2013), as well as the Deep learning based methods: SPAE (Kan et al. 2014) and DIPFS . It should be noticed that all methods were evaluated under the fully automatic scenario, where both the bounding box of the face region and the facial landmarks were located automatically.

Results on FERET
One frontal image, denoted as 'ba', from each of the 200 subjects was used to form the gallery set, while the images captured at 6 different poses i.e., −40 • to 40 • were selected as the probe images. Before comparing RSF with existing methods, the impact of number of eigen-images, i.e, k in recognition performance was investigated. To this end, the clean frontal subspace U W with k ∈ {50, 150, 250, 350, 450} was used in order to frontalize the images. Figure 13 shows the recognition accuracy obtained for each k. It is clear that the more eigen-images are used the better the performance. In particular, a steep improvement is observed in large poses such as −40 • and 40 • . The self-occlusions appearing in large poses result in high variability of the textures in these cases, which explains why using more eigen-images leads to an improve to performance.
In Table 4 the recognition rates achieved by the competing methods in the different poses are reported. Clearly, the RSF (recognition accuracy 98.58 %) outperforms both the 2D and 3D state-of-the-art methods. It is worth mentioning that the PIMRF employs 200 images from the FERET database (different from the test set) in order to train the frontal synthesizer. Consequently, the different lighting conditions of the database are taken into account. This is not the case for the RSF where only frontal images from a generic in-thewild database (i.e., the LFPW and HELEN) have been used. Even though the RSF does not use any kind of 3D information, it performs comparably to the PAF where an elaborated 3D model (trained from 4624 facial scans) has been used.

Results on Multi-PIE
The images of 137 subjects (Subject ID 201: 346) with 'Neutral' expression and poses −30 • to +30 • captured under 4 different sessions were selected. The gallery was created by the frontal images of the earliest session for each subject, while the rest of images including frontal and non-frontal  Table 5. The RSF outperforms four out of five methods that is compared to. The RSF also performs comparable to the DIPFS though only using 500 frontal images outside the Multi-PIE. It should be noticed that in DIPFS the positions of eyes which were used to align both the train and test images were located manually. In contrary RSF is a fully automatic method and all the landmarks were automatically detected. Furthermore, the U LFW used by the RSF was built by images outside the Multi-PIE, while only images from Multi-PIE used by DIPFS to build the deep-learning feature extractor.

Image Face Verification on LFW Database
The performance of the RSF in the face verification under in-the-wild conditions is assessed by conducting experiment in the LFW database, using the image-restricted, no outside data results setting. The standard evaluation protocol, which splits the View 2 dataset into 10 folds, with each fold consisting of 300 intra-class pairs and 300 inter-class pairs, was employed. In Fig. 14 sample images pairs of the same and different persons are depicted. As it can been seen in the case of same pair there is a big change in appearance of the subject (different pose and illumination conditions, sunglasses).
In this experiment the basis U W and the detector in (Zhu and Ramanan 2012) were not used since they are based on images outside the database. To create the initializations and a new U LFW , the method for automatic construction of deformable models presented in (Antonakos and Zafeiriou   2014) was employed. The goal of this method is to build a deformable model using only a set of images with the corresponding face bounding boxes. To define the face bounding boxes without using a pre-trained detector, the deep funneled images of the LFW  were employed. Therefore, since these images are aligned, the exact face bounding box is known. Subsequently, a deformable model was built automatically from the training images of each fold. The created model was fitted to all images and those (from training images) with fitted shapes similar to the mean shape were selected to build the basis U LFW . In each fold the images were frontalized using the U LFW and they were cropped subsequently. The gradient orientations φ 1 , φ 2 of each image pair were extracted and the cosine of difference between them Δφ = φ 1 − φ 2 was normalized to the range  Table 6. In order to make the Table self-contained the results achieved using multiple descriptors and flipped training images are also reported. By inspecting Table 6, it can be seen that the RSF outperforms the APEM-SIFT, MRF-MLBP, Eigen-PEP, and the Spartans and performs comparably to the recently published MRF-MLBP-CSDKA and POP-PEP. It is worth mentioning that, the MRF-MLBP-CSDKDA employs an MRF, which has Recently, a new frontalization version of the LFW named LFW3D has been proposed in (Hassner et al. 2015). In order to compare the quality of frontalizations between the RSF and LFW3D, the same classification framework as before was applied on LFW3D. The achieved accuracy is 79.28 % while the accuracy achieved by the RSF is 88.81 %. This is a quite interesting result since the proposed RSF method does not use any kind of 3D information. This is due to the fact that in RSF sparse noise such as occlusions and illuminations is removed from the frontalized images.

Video Face Verification on YouTube Faces Database
The YTF (Wolf et al. 2011) was employed in order to assess the performance of the RSF-V in the problem of video-based face verification. The standard restricted evaluation protocol of 10 folds, with each fold consisting of 250 intra-class and 250 inter-class pairs, was adopted. The experiment proceeds as follows. First, the RSF-V was employed in order frontalize the frames of each video. Then, the mean appearance of each video was computed based on the frontalized frames. Subsequently, for each pair of videos the Δφ were extracted from the corresponding mean appearances and their dimensionality was reduced by applying PCA. Finally, a RBF-SVM classifier was used in order to predict the labels of the test pairs.
Given that the RSF-V was trained using only the provided images, we selected to compare its performance against that obtained by methods trained without flipped images. As shown in Table 7 the RSF-V outperforms all the compared methods that use only the provided images of the database. Please note that RSF-V achieves state-of-the-art results by employing only frontal images and IGOs features computed in one scale. Futhermore, in order to show the effectiveness of the video-based RSF i.e, the V-RSF against to single frame RSF the following experiment has been conducted. We followed the same procedure like before and instead of producing the frontalized frames using the RSF-V, we applied the RSF independently in each frame. Then, the frontalized frames were used to compute the mean appearance in each video. Subsequently, the same feature extraction and classification steps were applied. The classification accuracy achieved by the frame-by-frame RSF is 0.8051 ± 0.025 while the accuracy of the RSF-V is 0.8320 ± 0.015. This improvement indicates that the incorporation of the temporal information in case of the RSF-V leads to frontalizations of better quality.

Video Inpainting
The ultimate goal of video inpainting is to restore damaged areas or to remove unwanted elements from an image sequence. In order to investigate the effectiveness of the proposed method in this task, two image sequences were used: one from the movie 300 and another one depicting a woman during the make-up procedure (acquired from Youtube). The selected sequences are very challenging due to the presence of variations in poses, expressions, illumination conditions, image quality and occlusions. More specifically, occlusions due to hands, fingers, brushes, rings, and earrings are present in the videos. In addition the usage of different powders and creams had as result the change of the face appearance.
The aim of this experiment was to remove the unwanted elements from the faces in the whole sequence and produce a clean version of it. To this end, the position of the face in each frame was found by the detector in (Zhu and Ramanan 2012) and then the methods presented in Sects. 3 and 4 were employed in order to generate the clean frontal version of the face in each frame. Subsequently, the frontalized images were warped from the reference frame back to the original frame by using the corrected landmark points and the inverse warp function W −1 . Figure 15 depicts results obtained for some representative frames of the test video. The frontalized and error images recovered from RSF and RSF-V are presented in Fig. 15b, c, respectively. As it can be observed (specifically inside the red dotted boxes), the results of RSF-V are of better quality which is attributed to information that all the faces of the subject span a lowrank subspace. By visually inspecting the results of inverse warping (Fig. 15d) it can be noticed that all occlusions have been removed and the recovered face is of a high quality. A video demonstrating the RSF-V is available at: https://www. youtube.com/watch?v=kSnFehb55O4&fmt=22 (When you  (Zhu and Ramanan 2012) is used in order to locate the face in each frame. Then, the frontalized and error images for each frame are produced by employing b RSF or c RSF-V. By using d the landmark points obtained by RSF-V, the frontalized clean image is e back warped into input frame. As it can be observed, especially in the area defined by the red boxes, the quality of the frontalizations obtained by the RSF-V c are better than those produced by the RSF (b) watch the video please make sure you have enabled the full quality and resolution).

Conclusions
In this paper, to the best our knowledge, we presented the first method that jointly performs landmark localization and face frontalization using only a simple statistical model based on few hundred frontal images. The proposed method outperforms state-of-the-art methods for face landmark localization that were trained on thousands of images in many poses and achieves comparable results in pose invariant face recognition and verification without using 3D elaborate models or features extracted by employing Deep-Learning methodologies.