How Robots Perceive an Object and Estimate its Pose

A number of preparations are required for robots to perform tasks in daily spaces such as homes or offices. Unlike industrial robots that repeatedly perform fixed tasks, collaborative robots frequently experience human contact or unexpected changes in the environment. Therefore, it is important for collaborative robots to perceive their surroundings and understand information on the current situation. This article introduces robot vision technology used in object perception and pose estimation.

1. Research Outline

AMBIDEX Demo – Object perception and carrying out dishwashing tasks with vision technology

For robots, perceiving surroundings and target objects means figuring out information on an object in 3D space, through 2D videos captured with a camera. Understanding this 3D information is a matter of estimating 6 degrees of freedom (DoF) pose including the object’s 3DoF position and 3DoF orientation. Ultimately, object pose estimation can be defined as seeking the most accurate relation between an object in a 2D video, seen by the robot, and its coordinates in real space.

Figure 2. Pose estimation pipeline

The image above shows the overall object pose estimation pipeline. Algorithms are largely divided into two: image retrieval-based method and keypoint detection-based method. Both methods use deep learning, thereby requiring training data. However, there are insufficient open datasets containing 3D information, and direct data collection is also quite difficult. To address this, NAVER LABS is researching dataset generation methods using 3D models and simulations. Let us explain in detail each process.

2. Training Dataset Generation

In order to estimate an object’s position using deep learning, two types of training datasets are required. One is the mask, which is the object section segmented from the video, and the other is the 6DoF pose of the object relative to camera coordinates. Moreover, deep learning requires a vast amount of training data, but it takes considerable time and effort to manually generate mask and 6DoF pose data. Furthermore, it is quite difficult for a person to measure the 6DoF pose of an object shown in a video.

To solve the issue, NAVER LABS first created a precise 3D model of the object and generated a training dataset through various simulation methods. Training dataset generation is divided into two different methods: synthetic training data generation that generates a video with the object in virtual space, and real training data generation that captures the object in real space.

1) Synthetic Training Data Generation
For this method, a 3D model of the object is first placed in a virtual space. Then, camera viewpoints are generated to capture all sides of the object, and a video is created from the given location. Since the camera position and the object position are known values, mask and 6DoF pose can be easily obtained. Depending on how similar the virtually created video is compared to real-environment videos, this method can be divided into simple rendering-based method and 3D physics engine-based method.

a. Simple Rendering-Based Method
As shown in the figure below, the rendering-based method generates a video of the object simply by adjusting the camera viewpoint, without considering background and lighting. If necessary, a random background can be generated and inserted behind the object’s mask area. Although it is relatively easy to generate a training dataset with this method, additional data augmentation is required for successful training.

Figure 3. Rendering from various camera viewpoints

b. Effective Training Data Generation via 3D Engine Simulation
YCB-Video and LineMOD dataset are some of the open datasets used in deep neural network training for object 6DoF pose estimation. These datasets are the result of information obtained with sensors in real-environments. However, once the different conditions, including type, size, weight, price, etc. of the target object, increase, the time and effort needed for dataset preparation and manual annotation increase exponentially.

Figure 4. 3D synthetic training dataset generated with 3D physics engine

To overcome this, we created a simulator based on Unreal Engine, usually used for game development. Our simulator can capture realistic RGB images by creating various virtual environments, applying random location, texture, lighting, etc. by importing mesh and material designed with 3D modeling tools. In addition, it can automatically generate depth, segmentation, and 6DoF pose information of each object because all spatial information is provided in a 3D virtual environment.

Figure 5. Depth map, instance mask and surface normal map data derived from 3D engine simulation

Based on Unreal Engine and V-Ray, NAVER LABS’ physics engine simulator can simulate the movement of an object affected by gravity while generating an unlimited number of random conditions including different material, light, texture, and location & rotation, allowing users to obtain realistic images.

By applying image modification such as rotation, flip, add noise, RGB channel adjustment, affine transform, etc., with various image processing algorithms on the 3D engine simulator outcome, we can further improve data generalization

Figure 6. NAVER LABS-customized dataset generation screen for robot grasping

2) Real Training Data Generation
Though nearly an infinite number of training datasets can be generated with the above-mentioned synthetic data generation method, it is difficult to fully create an image identical to reality, even with 3D physics engines. As the so-called domain gap issue occurs, datasets trained in real-environments are required.

To collect real training data, we created a marker board consisting of multiple visual markers, and measured the camera’s 6DoF pose relative to the standard coordinate system. To collect mask and object 6DoF position training data, we used the method of projecting the object’s 3D model onto the real-life video.

The figure below is an example of a real training dataset generated with the method mentioned above. The left is a video taken in real-life, and the right is a video of the mask and object pose shown in the form of a bounding box.

Figure 7. Training data generation in real-environments

3. Object Pose Estimation

1) Instance Segmentation
Instance segmentation is a task that segments the position of each object (instance) in units of pixels when there are various pre-defined types (class) of objects within an image. It is the first stage in estimating each object’s six degrees of freedom pose. In order to use instance segmentation in robotic applications, it needs to operate in real-time. We used real-time algorithms and experimented with the YCB dataset, an open dataset containing multiple objects and their positions.

Various problems arose during the experimental process. First, in the case of the YCB dataset, it included various objects but was difficult to use immediately for instance segmentation training. This was because no image contained multiple objects of the same type. Besides, when an untrained object was included, there were cases of misclassification or failing to perceive the object of interest due to occlusions. Not to mention, differences in lightening and other factors depending on the experiment environment, greatly influenced the recognition rate.

To overcome these limitations, we added multiple variables and used them as training data. We generated random objects in front of and behind the object of interest or overlaid multiple objects of the same type and added them as training data. Signal noise was also added such as blurred images, changed colors, etc. As a result, we significantly reduced the difference in recognition rate between training data and the actual experiment environment.

Figure 8. Comparison of before (left) and after (right) improvement

Figure 9. Our segmentation algorithm in motion

2) Image Retrieval-Based Method
The next stage is estimating an object’s 3D position and pose based on the results of instance segmentation. To do this, we utilized visual localization technology. In visual localization, an image is analyzed to estimate the exact position of a user, i.e. where the picture was taken from. It should be noted that NAVER LABS’ VL accuracy is at a world-class level. These two seemingly unrelated problems are actually mathematically identical, in the sense that they are both figuring out the relationship between the camera’s and subject’s position. The only difference is data domain, whether the photo is of a location or of an object.

Figure 10. Image retrieval-based pose estimation algorithm

We implemented object pose estimation by adopting local feature matching and image retrieval used in visual localization. Image retrieval is a problem of selecting the most similar reference image from the dataset to the query image. We first generate a reference dataset by collecting images viewed from various directions using the target object’s 3D model. When the input image is received, the algorithm finds the image with the most similar orientation among reference images. Then, local feature matching is carried out, looking for identical points between the input image and the selected reference image. After aligning the matched points, the algorithm can finally estimate the position and the pose of the object.

Figure 11. Object pose estimation algorithm demonstration video

This is a demonstration video of the completed algorithm. The two images above are the RGB image and depthmap used as inputs. The bottom left is the result of instance segmentation. Using instance segmentation information and the input information, we performed 6DoF pose estimation, shown on the bottom right. The estimated 6DoF pose is expressed in a box format.

3) Keypoint Estimation-Based Method of 6DoF Object Pose Estimation
Another method of estimating 6DoF pose is with an approach using keypoints. When only an image is given as real-time input data, what type of object each pixel belongs to (semantic segmentation) and the number of points (keypoints) for the object is estimated through neural networks. Then, 6DoF pose is estimated through perspective-n-point algorithm that connects keypoints in the three-dimensional space of the modeled object.

However, results were unsatisfactory when experimenting with actual data, images of objects rendered on black backgrounds, because we could not obtain an adequate segmentation mask. Therefore, we synthesized the background from the COCO dataset, leading to successful semantic segmentation module training, and were able to obtain normal results for actual data as well.

Figure 12. Training data with black background (top), synthetic background (center), and enhanced color jitter (bottom);
training data (leftmost column) and pose estimation results (remaining columns)

4) Pose Refinement
When a robot makes contact with an object, the object’s pose inevitably changes. Additionally, there can be errors in image retrieval-based or keypoint detection-based object pose estimation, in case of inaccurate local feature matching or keypoint detection. The figure on the left below shows an example of an inaccurate estimation result visualized on the input video. Therefore, it is necessary to refine the estimated pose for more accurate estimation, as shown on the right.

Figure 13. Example of before (left) and after (right) pose refinement

After setting the initial value as the pose estimated with the aforementioned methods, random noises are added within a certain range. Utilizing identical methods to simple rendering-based training dataset generation using object 3D models, a virtual video is created with noises added to the pose. Edges are extracted from both the input video (for pose estimation) and the virtual video. If the edge extracted from the virtual video matches the edge from the input video exactly, the pose in the virtual video becomes the refined pose. The video below shows refined pose estimation by applying the Markov Chin Monet Carlo (MCMC) method.

Figure 14. Example of pose refinement algorithm implementation, with MCMC method

4. Conclusion

NAVER LABS is advancing vision technology with the aim to provide robot services that can interact with dynamically changing external environments. In particular, perception in three-dimensional space is an area of active research, with continued performance enhancement with the development of deep learning-based techniques. NAVER LABS is also continuously expanding vision research, such as integration with deep learning algorithms or perception of unknown or asymmetric objects. We welcome anyone who wishes to join us in this exciting challenge.

LIST

LAVER LABS