From depth images to complete point clouds
By Heid, Rafael and Kornivetc, Aleksandra and Kraft, Angelie and Stoppel, Stefanie
Video
About us
We are a subgroup of the Computer Vision Master Project 2019/20 and have recently completed our work which we are glad to present at the EXPO.
Introduction & Motivation
Imagine you own a household robot that cleans, cooks and does errands for you. When you tell it to go fetch your favorite coffee mug, the robot not only needs to know what that mug looks like in order to locate it in your apartment, it should also anticipate its full 3D shape in order to grasp it and bring it to you. We as humans have no problem identifying and grabbing a mug, as we know from experience what a „typical“ mug’s 3D shape looks like from all sides and thus can interact with it easily. In order to transfer some of these abilities to robotic grasping, we created an end-to-end pipeline that takes an RGB-D image of an object as input and infers its complete 3D representation in the form of a point cloud.
Figure 1: Example of the intermediate and end results of our pipeline for the power drill. From left to right: a) shows the segmentation result for all objects, b) displays the partial point cloud created from the segmentation mask and depth information, c) shows the completed point cloud and d) shows the ground truth point cloud for comparison.
Dataset
Our work is based on the YCB-Video dataset which consist of 92 videos and 320,000 synthetic images of household objects captured using an RGB-D camera. The featured objects are commonly used for robotic grasping tasks and hence were a good fit for our work. We chose 4 objects from the dataset for our project: the banana, the power drill, the scissors and the bleach cleanser bottle.
Deep Neural Networks
Figure 2: Visualization of inputs and outputs of the two neural networks.
In order to infer a completed point cloud representation from an RGB-D image, our system needs to tackle two tasks:
- Locating and extracting the pixel coordinates of a specific object from an RGB image. Combining these segmented pixels with the depth information yields a partial point cloud of the object of interest. This task is handled by the first network: the Vanilla SegNet (VSN).
- Completing the partial point cloud to a full, 360 degree shape which we call a completed point cloud. Our second network, the Morphing and Sampling Network (MSN), is responsible for this part of the pipeline.
End-to-end pipeline
Figure 3: Visualization of our end-to-end pipeline and how our 3 system components interact.
Our system consists of a pipeline featuring 3 main components which were containerized using Docker. This ensures portability across different platforms and offers the advantage that there is no need to install any additional dependencies (such as different CUDA versions, Python packages etc.) on the computer where the pipeline is run, as all of them are packaged together with the individual components.
All components read and write to/from the local file system and register watchers on the same local directory. This approach enables the different processes to be triggered automatically by file changes from the components earlier in the pipeline.
1. Web Component
This component features a RESTful web client for uploading YCB-Video frames and specifying the object one wants to complete. Upon successful completion the web interface displays the partial and completed point clouds in separate views.
Technologies: Flask framework, Vue.js, three.js, Docker
2. VSN Component
The second component contains the Vanilla SegNet (VSN) neural network model for semantic segmentation and its dependencies.
Technologies: PyTorch, Open3D, Docker
3. MSN Component
The last component contains the Morphing and Sampling Network (MSN) for point cloud completion and all of its dependencies.
Technologies: PyTorch, Open3D, Docker
Results
Figure 4 below shows examples of the networks’ outputs for each of the objects. For each row, b) is the output of the semantic segmentation network based on input a). The differently colored areas indicate which pixels cluster together and were assigned to the same class. As you can see, the sillhouettes appear visually accurate and detailed. The incomplete point cloud that was extracted from the original RGB-D image with the help of the object mask is visible in c). At this point, the point cloud is not only incomplete but also noisy. Nevertheless does the completed version in d) look sound and appears close to the ground truth in e).
For evaluation of the quality of the networks, we used common metrics that quantify the difference between the network’s output and the ground truth. We used IoU (Intersection over Union, high is good) for the semantic segmentation task and EMD (Earth Mover’s Distance, low is good) for the point cloud completion task. Please refer to our project report which is referenced below for more information on the quantitative analysis.
Figure 4: Examplary outputs for each object on input images that the networks have not seen before. a) RGB input. b) Semantic segmentation mask. c) Incomplete point cloud. d) Completed point cloud. e) Ground truth point cloud.
References
- If you’d like to read about our work in-depth, please refer to our project report.
- Find our repository here if you want to try out the system yourself. Mind that you need a GPU to run it.
- (Vanilla) SegNet paper by Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla.
- Morphing and Sampling Network paper by Minghua Liu, Lu Sheng, Sheng Yang, Jing Shao, Shi-Min Hu.