arXiv_tro.pdf

Page 1 of 14

IEEE TRANSACTIONS ON ROBOTICS 1

Deep Dexterous Grasping of Novel Objects from a

Single View

Umit Rusen Aktas1

, Chao Zhao1

, Marek Kopicki1

, Ales Leonardis1

and Jeremy L. Wyatt1

Abstract—Dexterous grasping of a novel object given a single

view is an open problem. This paper makes several contributions

to its solution. First, we present a simulator for generating and

testing dexterous grasps. Second we present a data set, generated

by this simulator, of 2.4 million simulated dexterous grasps of

variations of 294 base objects drawn from 20 categories. Third,

we present a basic architecture for generation and evaluation of

dexterous grasps that may be trained in a supervised manner.

Fourth, we present three different evaluative architectures, em- ploying ResNet-50 or VGG16 as their visual backbone. Fifth, we

train, and evaluate seventeen variants of generative-evaluative

architectures on this simulated data set, showing improvement

from 69.53% grasp success rate to 90.49%. Finally, we present a

real robot implementation and evaluate the four most promising

variants, executing 196 real robot grasps in total. We show that

our best architectural variant achieves a grasp success rate of

87.8% on real novel objects seen from a single view, improving

on a baseline of 57.1%.

Index Terms—Deep learning, generative-evaluative learning,

grasping.

I. INTRODUCTION

If robots are to be widely deployed in human populated

environments then they must deal with unfamiliar situations.

An example is the case of grasping and manipulation. Humans

grasp and manipulate hundreds of objects each day, many

of which are previously unseen. Yet humans are able to

dexterously grasp these novel objects with a rich variety of

grasps. In addition, we do so from only a single, brief, view

of each object. To operate in our world, dexterous robots must

replicate this ability.

This is the motivation for the problem tackled in this paper,

which is planning of (i) a dexterous grasp, (ii) for a novel

object, (iii) given a single view of that object. We define

dexterous as meaning that the robot employs a variety of dex- terous grasp types across a set of objects. The combination of

constraints (i)-(iii) makes grasp planning hard because surface

reconstruction will be partial, yet this cannot be compensated

for by estimating pose for a known object model. The novelty

of the object, together with incomplete surface reconstruction,

and uncertainty about object mass and coefficients of friction,

renders infeasible the use of grasp planners which employ

classical mechanics to predict grasp quality. Instead, we must

employ a learning approach.

This in turn raises the question as to how we architect the

learner. Grasp planning comprises two problems: generation

and evaluation. Candidate grasps must first be generated

*This work was primarily supported by FP7-ICT-600918

1 University of Birmingham, School of Computer Science, UK.

jeremy.l.wyatt@gmail.com

Learned Generative Model Learned Evaluative Model

1 2 3

Scene Depth Image Execution of Top Grasp

1 2 3

Grasps ranked by generative

model likelihood

Grasps ranked by predicted

success probability

Fig. 1: The basic architecture of a generative-evaluative

learner. When shown a novel object the learned generative

model (GM) produces many grasps according to its likelihood

model. These are then each evaluated by a learned evaluative

model (EM), which predicts the probability of grasp success.

The grasps are then re-ranked according to the predicted

success probability and the top ranked grasp is executed.

according to some distribution conditioned on sensed data.

Then each candidate grasp must be evaluated, so as to produce

a grasp quality measure (e.g maximum resistable wrench),

the probability of grasp success, the likely in-hand slip or

rotation, etcetera. These measures are then used to rank grasps

so as to select one to execute. Either or both a generative or

evaluative model may be learned. If only a generative model is

learned then evaluation must be carried out using mechanically

informed reasoning, which, as we noted, cannot easily be

applied to the case of novel objects seen from a single view.

If only an evaluative model is learned then grasp generation

must proceed by search. This is challenging for true dexterous

grasping as the hand may have between nine and twenty

actuated degrees of freedom. Thus, for dexterous grasping of

novel objects from a single view, it becomes appealing to learn

both the generative and the evaluative model.

The contributions of this paper are as follows. First, we

present a data-set of 2.4 million dexterous grasps in simulation

that may be used to evaluate dexterous grasping algorithms.

Second, we release the source code of the dexterous grasp

simulator, which can be used to visualise the dataset and

Page 2 of 14

IEEE TRANSACTIONS ON ROBOTICS 2

gather new data.1 Third, we present a generative-evaluative

architecture that combines data efficient learning of the gen- erative model with data intensive learning in simulation of an

evaluative model. Fourth, we present multiple variations of the

evaluative model. Fifth, we present an extensive evaluation of

all these models on our simulated data set. Finally, we compare

the two most promising variants on a real robot with a data-set

of objects in challenging poses.

The model variants are organised in three dimensions. First,

we employ two different generative models (GM1 [1] and

GM2 [2]), one of which (GM2) is designed specifically for

single view grasping. Second, we use two different back-bones

for the evaluative model, VGG-16 and ResNet-50. Third, we

experiment with two optimisation techniques–gradient ascent

(GA) and stochastic annealing (SA)–to search for better grasps

using the evaluative model as an objective function.

The paper is structured as follows. First, we discuss related

work. Second, the basic generative model is described in

detail and the main features of the extended generative model

are sketched. Third, we describe the design of the grasp

simulation, the generation of the data set. Fourth, we describe

the different architectures employed for the evaluative model.

Fifth, we describe the evaluative model training, the optimi- sation variants for the evaluative model and the simulated

experimental study. Finally, we present the real robot study.

II. BACKGROUND AND RELATED WORK

There are four broad approaches to grasp planning. First,

we may employ analytic mechanics to evaluate grasp quality.

Second, we may engineer a mapping from sensing to grasp.

Third, we may learn this mapping, such as learning a genera- tive model. Fourth, we may learn a mapping from sensing and

a grasp to a grasp success prediction. See [3] and [4] for recent

reviews of data driven and analytic methods respectively.

Analytic approaches use mechanical models to predict grasp

outcome [5], [6], [7], [8]. This requires models of both object

(mass, mass distribution, shape, and surface friction) and

manipulator (kinematics, exertable forces and torques). Several

grasp quality metrics can be defined using these [9], [10],

[11] under a variety of mechanical assumptions. These have

been applied to dexterous grasp planning [12], [13], [14], [15],

[16], [17]. The main drawback of analytic approaches is that

estimation of object properties is hard. Even a small error in

estimated shape, friction or mass will render a grasp unstable

[18]. There is also evidence that grasp quality metrics are not

well correlated with actual grasp success [19], [20], [21].

An alternative is learning for robot grasping, which has

made steady progress. There are probabilistic machine learning

techniques employed for surface estimation for grasping [22];

data efficient methods for learning dexterous grasps from

demonstration [23], [1], [24]; logistic regression for classifying

grasp features from images [25]; extracting generalisable parts

for grasping [26] and for autonomous grasp learning [27].

1The code and simulated grasp dataset are available at

https://rusen.github.io/DDG. The web page explains how to download

the dataset, install the physics simulator and re-run the grasps in simulation.

The simulator acts as a client alongside a simple web server to gather new

grasp data in a distributed setup.

Deep learning is a recent approach to grasping. Most work is

for two finger grippers. Approaches either learn an evaluation

function for an image-grasp pair [28], [29], [30], [31], [32],

[33], learn to predict the grasp parameters [34], [35] or jointly

estimate both [36]. The quantity of real training grasps can be

reduced by mixing real and simulated data [37].

A small number of papers have explored deep learning as

a method for dexterous grasping. [43], [44], [45], [42], [41].

All of these use simulation to generate the training set for

learning. Kappler [41] showed the ability of a CNN to predict

grasp quality for multi-fingered grasps, but uses complete point

clouds as object models and only varies the wrist pose for the

pre-grasp position, leaving the finger configurations the same.

Varley [44] and later Zhou [42] went beyond this by varying

the hand pre-shape, and predicting from a single image of the

scene. Each of these posed search for the grasp as a pure

optimisation problem (using simulated annealing or quasi- Newton methods) on the output of the CNN. They, also, take

the approach of learning an evaluative model, and generate

candidates for evaluation uninfluenced by prior knowledge.

Veres [45], in contrast, learns a deep generative model. Finally

Lu [43] learns an evaluative model, and then, given an input

image, optimises the inputs that describe the wrist pose and

hand pre-shape to this model via gradient ascent, but does not

learn a generative model. In addition, the grasps start with

a heuristic grasp which is varied within a limited envelope.

Of the papers on dexterous grasp learning with deep networks

only two approaches [44], [43] have been tested on real grasps,

with eight and five test objects each, producing success rates

of 75% and 84% respectively. An key restriction of both of

these methods is that they only plan the pre-grasp, not the

finger-surface contacts, and are thus limited to power-grasps.

Thus, in each case, either an evaluative model is learned but

there is no learned prior over the grasp configuration able to be

employed as a generative model; or a generative grasp model

is learned, but there is no evaluative model learned to select

the grasp. Our technical novelty is thus to bring together a

data-efficient method of learning a good generative model with

an evaluative model. As with others, we learn the evaluative

model from simulation, but the generative model is learned

from a small number of demonstrated grasps. Table I compares

the properties of the learning methods reviewed above against

this paper. Most works concern pinch grasping. Of the eight

papers on learning methods for dexterous grasping, two [44],

[43] are limited to power grasps. Of the remaining five, three

have no real robot results [45], [42], [41]. Of the remaining

four, two we directly build on here, the third being a extension

of one of those grasp methods with active vision. Finally,

our real robot evaluation is extensive in comparison with

competitor works on dexterous grasping, comprising 196 real

grasps of 40 different objects.

III. DATA EFFICIENT LEARNING OF A GENERATIVE GRASP

MODEL FROM DEMONSTRATION

This section describes the generative model learning upon

which the paper builds. We employ two related grasp gen- eration techniques [1], [2], which both learn a generative

Page 3 of 14

IEEE TRANSACTIONS ON ROBOTICS 3

References Grasp type Robot Clutter Model Novel

2-fing. >2-finger >2-finger results free objects

power dexterous

[26], [25], [27], [29], [31], [33], [38] ! ! ! !

[32], [37], [39], [40] ! ! ! ! !

[41] ! !

[42] ! ! !

[43], [44] ! ! ! !

[23] ! ! !

[45], [42], [41] ! ! !

[1], [2], [24] ! ! ! !

[46] ! ! ! ! !

This paper ! ! ! !

TABLE I: Qualitative comparison of grasp learning methods.

model of a dexterous grasp from a demonstration (LfD). Those

papers both posed the problem as one of learning a factored

probabilistic model from a single example. The method is split

into a model learning phase, a model transfer phase, and the

grasp generation phase.

A. Model learning

The model learning is split into three parts: acquiring an

object model; using this object model, with a demonstrated

grasp, to build a contact model for each finger link in contact

with the object; and acquiring a hand configuration model

from the demonstrated grasp. After learning the object model

can be discarded.

1) Object model: First, a point cloud of the object used

for the demonstrated grasp is acquired by a depth camera,

from several views. Each point is augmented with the esti- mated principal curvatures at that point and a surface normal.

Thus, the j

th point in the cloud gives rise to a feature

xj = (pj , qj , rj ), with the components being its position

pj ∈ R

, orientation qj ∈ SO(3) and principal curvatures

rj = (rj,1, rj,2) ∈ R

. The orientation qj is defined by

kj,1, kj,2, which are the directions of the principal curvatures.

For later convenience we use v = (p, q) to denote position

and orientation combined. These features xj allow the object

model to be defined as a kernel density estimate of the joint

density over v and r.

O(v, r) ≡ pdf O(v, r) '

j=1

wjK(v, r|xj , σx) (1)

where O is short for pdf O, bandwidth σx = (σp, σq, σr), KO

is the number of features xj in the object model, all weights

are equal wj = 1/KO, and K is defined as a product:

K(x|μ, σ) = N3(p|μp, σp)Θ(q|μq, σq)N2(r|μr, σr) (2)

where μ is the kernel mean point, σ is the kernel bandwidth,

Nn is an n-variate isotropic Gaussian kernel, and Θ corre- sponds to a pair of antipodal von Mises-Fisher distributions.

2) Contact models: When a grasp is demonstrated the final

hand pose is recorded. This is used to find all the finger links

L and surface features xj that are in close proximity. A contact

model Mi

is built for each finger link i. Each feature in the

object model that is within some distance δi of finger link Li

contributes to the contact model Mi for that link. This contact

model is defined for finger link i as follows:

Mi(u, r) ≡ pdfM

(u, r) '

XMi

j=1

wijK(u, r|xj , σx) (3)

where u is the pose of Li relative to the pose vj of the j

surface feature, KMi

is the number of surface features in the

neighbourhood of link Li

, Z is the normalising constant, and

wij is a weight that falls off exponentially as the distance

between the feature xj and the closest point aij on finger link

increases:

wij =

(

exp(−λ||pj − aij ||2

) if ||pj − aij || < δi

0 otherwise,

(4)

The key property of a contact model is that it is conditioned

on local surface features likely to be found on other objects,

so that the grasp can be transferred. We use the principal

curvatures r, but many local surface descriptors would do.

B. Hand configuration model

In addition to a contact model for each finger-link, a model

of the hand configuration hc ∈ R

D is recorded, where D is the

number of DoF in the hand. hc is recorded for several points

on the demonstrated grasp trajectory as the hand closed. The

learned model is:

C(hc) ≡

γ∈[−β,β]

w(hc(γ))ND(hc|hc(γ), σhc

) (5)

where w(hc(γ)) = exp(−αkhc(γ) − h

c k

); γ is a parameter

that interpolates between the beginning (h

) and end (h

)

points on the trajectory, governed via Eq. 6 below; and β is a

parameter that allows extrapolation of the hand configuration.

hc(γ) = (1 − γ)h

c + γht

(6)

C. Grasp Transfer

When presented with a new object onew the contact models

must be transferred to that object. A partial point cloud of

onew is acquired (from a single view) and recast as a density,

Onew, again using Eq. 1. The transfer of each contact model