![]() |
OpenCV
4.2.0
Open Source Computer Vision
|
In this tutorial you will learn:
This sample requires:
face-detection-adas-0001
;facial-landmarks-35-adas-0002
.We will implement a simple face beautification algorithm using a combination of modern Deep Learning techniques and traditional Computer Vision. The general idea behind the algorithm is to make face skin smoother while preserving face features like eyes or a mouth contrast. The algorithm identifies parts of the face using a DNN inference, applies different filters to the parts found, and then combines it into the final result using basic image arithmetics:
Briefly the algorithm is described as follows:
Generating face element masks based on a limited set of features (just 35 per face, including all its parts) is not very trivial and is described in the sections below.
This sample is using two DNN detectors. Every network takes one input and produces one output. In G-API, networks are defined with macro G_API_NET():
To get more information, see Declaring Deep Learning topologies described in the "Face Analytics pipeline" tutorial.
The code below generates a graph for the algorithm above:
The resulting graph is a mixture of G-API's standard operations, user-defined operations (namespace custom::
), and DNN inference. The generic function cv::gapi::infer<>()
allows to trigger inference within the pipeline; networks to infer are specified as template parameters. The sample code is using two versions of cv::gapi::infer<>()
:
More on this in "Face Analytics pipeline" (Building a GComputation section).
The unsharp mask \(U\) for image \(I\) is defined as:
\[U = I - s * L(M(I)),\]
where \(M()\) is a median filter, \(L()\) is the Laplace operator, and \(s\) is a strength coefficient. While G-API doesn't provide this function out-of-the-box, it is expressed naturally with the existing G-API operations:
Note that the code snipped above is a regular C++ function defined with G-API types. Users can write functions like this to simplify graph construction; when called, this function just puts the relevant nodes to the pipeline it is used in.
The face beautification graph is using custom operations extensively. This chapter focuses on the most interesting kernels, refer to G-API Kernel API for general information on defining operations and implementing kernels in G-API.
A face detector output is converted to an array of faces with the following kernel:
The algorithm infers locations of face elements (like the eyes, the mouth and the head contour itself) using a generic facial landmarks detector (details) from OpenVINO™ Open Model Zoo. However, the detected landmarks as-is are not enough to generate masks — this operation requires regions of interest on the face represented by closed contours, so some interpolation is applied to get them. This landmarks processing and interpolation is performed by the following kernel:
The kernel takes two arrays of denormalized landmarks coordinates and returns an array of elements' closed contours and an array of faces' closed contours; in other words, outputs are, the first, an array of contours of image areas to be sharpened and, the second, another one to be smoothed.
Here and below Contour
is a vector of points.
Eye contours are estimated with the following function:
Briefly, this function restores the bottom side of an eye by a half-ellipse based on two points in left and right eye corners. In fact, cv::ellipse2Poly()
is used to approximate the eye region, and the function only defines ellipse parameters based on just two points:
cv::ellipse()
documentation);The use of the atan2()
instead of just atan()
in function custom::getLineInclinationAngleDegrees()
is essential as it allows to return a negative value depending on the x
and the y
signs so we can get the right angle even in case of upside-down face arrangement (if we put the points in the right order, of course).
The function approximates the forehead contour:
As we have only jaw points in our detected landmarks, we have to get a half-ellipse based on three points of a jaw: the leftmost, the rightmost and the lowest one. The jaw width is assumed to be equal to the forehead width and the latter is calculated using the left and the right points. Speaking of the \(Y\) axis, we have no points to get it directly, and instead assume that the forehead height is about \(2/3\) of the jaw height, which can be figured out from the face center (the middle between the left and right points) and the lowest jaw point.
When we have all the contours needed, we are able to draw masks:
The steps to get the masks are:
mskSharpG
);mskBlurFinal
);cv::gapi::threshold()
)cv::gapi::bitwise_not
) to get the background mask (mskNoFaces
).Once the graph is fully expressed, we can finally compile it and run on real data. G-API graph compilation is the stage where the G-API framework actually understands which kernels and networks to use. This configuration happens via G-API compilation arguments.
This sample is using OpenVINO™ Toolkit Inference Engine backend for DL inference, which is configured the following way:
Every cv::gapi::ie::Params<>
object is related to the network specified in its template argument. We should pass there the network type we have defined in G_API_NET()
in the early beginning of the tutorial.
Network parameters are then wrapped in cv::gapi::NetworkPackage
:
More details in "Face Analytics Pipeline" (Configuring the pipeline section).
In this example we use a lot of custom kernels, in addition to that we use Fluid backend to optimize out memory for G-API's standard kernels where applicable. The resulting kernel package is formed like this:
G-API optimizes execution for video streams when compiled in the "Streaming" mode.
More on this in "Face Analytics Pipeline" (Configuring the pipeline section).
In order to run the G-API streaming pipeline, all we need is to specify the input video source, call cv::GStreamingCompiled::start()
, and then fetch the pipeline processing results:
Once results are ready and can be pulled from the pipeline we display it on the screen and handle GUI events.
See Running the pipeline section in the "Face Analytics Pipeline" tutorial for more details.
The tutorial has two goals: to show the use of brand new features of G-API introduced in OpenCV 4.2, and give a basic understanding on a sample face beautification algorithm.
The result of the algorithm application:
On the test machine (Intel® Core™ i7-8700) the G-API-optimized video pipeline outperforms its serial (non-pipelined) version by a factor of 2.7 – meaning that for such a non-trivial graph, the proper pipelining can bring almost 3x increase in performance.
<!— The idea in general is to implement a real-time video stream processing that detects faces and applies some filters to make them look beautiful (more or less). The pipeline is the following:
Two topologies from OMZ have been used in this sample: the face-detection-adas-0001 and the facial-landmarks-35-adas-0002.
The face detector takes the input image and returns a blob with the shape [1,1,200,7] after the inference (200 is the maximum number of faces which can be detected). In order to process every face individually, we need to convert this output to a list of regions on the image.
The masks for different filters are built based on facial landmarks, which are inferred for every face. The result of the inference is a blob with 35 landmarks: the first 18 of them are facial elements (eyes, eyebrows, a nose, a mouth) and the last 17 — a jaw contour. Landmarks are floating point values of coordinates normalized relatively to an input ROI (not the original frame). In addition, for the further goals we need contours of eyes, mouths, faces, etc., not the landmarks. So, post-processing of the Mat is also required here. The process is split into two parts — landmarks' coordinates denormalization to the real pixel coordinates of the source frame and getting necessary closed contours based on these coordinates.
The last step of processing the inference data is drawing masks using the calculated contours. In this demo the contours don't need to be pixel accurate, since masks are blurred with Gaussian filter anyway. Another point that should be mentioned here is getting three masks (for areas to be smoothed, for ones to be sharpened and for the background) which have no intersections with each other; this approach allows to apply the calculated masks to the corresponding images prepared beforehand and then just to summarize them to get the output image without any other actions.
As we can see, this algorithm is appropriate to illustrate G-API usage convenience and efficiency in the context of solving a real CV/DL problem.
(On detector post-proc) Some points to be mentioned about this kernel implementation:
cv::Mat
from the detector and a cv::Mat
from the input; it returns an array of ROI's where faces have been detected.cv::Mat
data parsing by the pointer on a float is used here.borders
is created and then intersected with the face rectangle (by operator&()
) to handle such cases and save the ROI which is for sure inside the frame.Data parsing after the facial landmarks detector happens according to the same scheme with inconsiderable adjustments.
There are some points in the algorithm to be improved.
The input of the facial landmarks detector is a square ROI, but the face detector gives non-square rectangles in general. If we let the backend within Inference-API compress the rectangle to a square by itself, the lack of inference accuracy can be noticed in some cases. There is a solution: we can give a describing square ROI instead of the rectangular one to the landmarks detector, so there will be no need to compress the ROI, which will lead to accuracy improvement. Unfortunately, another problem occurs if we do that: if the rectangular ROI is near the border, a describing square will probably go out of the frame — that leads to errors of the landmarks detector. To aviod such a mistake, we have to implement an algorithm that, firstly, describes every rectangle by a square, then counts the farthest coordinates turned up to be outside of the frame and, finally, pads the source image by borders (e.g. single-colored) with the size counted. It will be safe to take square ROIs for the facial landmarks detector after that frame adjustment.
-->