Page 1 of 14
IEEE TRANSACTIONS ON ROBOTICS 1
Deep Dexterous Grasping of Novel Objects from a
Single View
Umit Rusen Aktas1
, Chao Zhao1
, Marek Kopicki1
, Ales Leonardis1
and Jeremy L. Wyatt1
Abstract—Dexterous grasping of a novel object given a single
view is an open problem. This paper makes several contributions
to its solution. First, we present a simulator for generating and
testing dexterous grasps. Second we present a data set, generated
by this simulator, of 2.4 million simulated dexterous grasps of
variations of 294 base objects drawn from 20 categories. Third,
we present a basic architecture for generation and evaluation of
dexterous grasps that may be trained in a supervised manner.
Fourth, we present three different evaluative architectures, em- ploying ResNet-50 or VGG16 as their visual backbone. Fifth, we
train, and evaluate seventeen variants of generative-evaluative
architectures on this simulated data set, showing improvement
from 69.53% grasp success rate to 90.49%. Finally, we present a
real robot implementation and evaluate the four most promising
variants, executing 196 real robot grasps in total. We show that
our best architectural variant achieves a grasp success rate of
87.8% on real novel objects seen from a single view, improving
on a baseline of 57.1%.
Index Terms—Deep learning, generative-evaluative learning,
grasping.
I. INTRODUCTION
If robots are to be widely deployed in human populated
environments then they must deal with unfamiliar situations.
An example is the case of grasping and manipulation. Humans
grasp and manipulate hundreds of objects each day, many
of which are previously unseen. Yet humans are able to
dexterously grasp these novel objects with a rich variety of
grasps. In addition, we do so from only a single, brief, view
of each object. To operate in our world, dexterous robots must
replicate this ability.
This is the motivation for the problem tackled in this paper,
which is planning of (i) a dexterous grasp, (ii) for a novel
object, (iii) given a single view of that object. We define
dexterous as meaning that the robot employs a variety of dex- terous grasp types across a set of objects. The combination of
constraints (i)-(iii) makes grasp planning hard because surface
reconstruction will be partial, yet this cannot be compensated
for by estimating pose for a known object model. The novelty
of the object, together with incomplete surface reconstruction,
and uncertainty about object mass and coefficients of friction,
renders infeasible the use of grasp planners which employ
classical mechanics to predict grasp quality. Instead, we must
employ a learning approach.
This in turn raises the question as to how we architect the
learner. Grasp planning comprises two problems: generation
and evaluation. Candidate grasps must first be generated
*This work was primarily supported by FP7-ICT-600918
1 University of Birmingham, School of Computer Science, UK.
jeremy.l.wyatt@gmail.com
Learned Generative Model Learned Evaluative Model
1 2 3
Scene Depth Image Execution of Top Grasp
1 2 3
Grasps ranked by generative
model likelihood
Grasps ranked by predicted
success probability
Fig. 1: The basic architecture of a generative-evaluative
learner. When shown a novel object the learned generative
model (GM) produces many grasps according to its likelihood
model. These are then each evaluated by a learned evaluative
model (EM), which predicts the probability of grasp success.
The grasps are then re-ranked according to the predicted
success probability and the top ranked grasp is executed.
according to some distribution conditioned on sensed data.
Then each candidate grasp must be evaluated, so as to produce
a grasp quality measure (e.g maximum resistable wrench),
the probability of grasp success, the likely in-hand slip or
rotation, etcetera. These measures are then used to rank grasps
so as to select one to execute. Either or both a generative or
evaluative model may be learned. If only a generative model is
learned then evaluation must be carried out using mechanically
informed reasoning, which, as we noted, cannot easily be
applied to the case of novel objects seen from a single view.
If only an evaluative model is learned then grasp generation
must proceed by search. This is challenging for true dexterous
grasping as the hand may have between nine and twenty
actuated degrees of freedom. Thus, for dexterous grasping of
novel objects from a single view, it becomes appealing to learn
both the generative and the evaluative model.
The contributions of this paper are as follows. First, we
present a data-set of 2.4 million dexterous grasps in simulation
that may be used to evaluate dexterous grasping algorithms.
Second, we release the source code of the dexterous grasp
simulator, which can be used to visualise the dataset and
Page 2 of 14
IEEE TRANSACTIONS ON ROBOTICS 2
gather new data.1 Third, we present a generative-evaluative
architecture that combines data efficient learning of the gen- erative model with data intensive learning in simulation of an
evaluative model. Fourth, we present multiple variations of the
evaluative model. Fifth, we present an extensive evaluation of
all these models on our simulated data set. Finally, we compare
the two most promising variants on a real robot with a data-set
of objects in challenging poses.
The model variants are organised in three dimensions. First,
we employ two different generative models (GM1 [1] and
GM2 [2]), one of which (GM2) is designed specifically for
single view grasping. Second, we use two different back-bones
for the evaluative model, VGG-16 and ResNet-50. Third, we
experiment with two optimisation techniques–gradient ascent
(GA) and stochastic annealing (SA)–to search for better grasps
using the evaluative model as an objective function.
The paper is structured as follows. First, we discuss related
work. Second, the basic generative model is described in
detail and the main features of the extended generative model
are sketched. Third, we describe the design of the grasp
simulation, the generation of the data set. Fourth, we describe
the different architectures employed for the evaluative model.
Fifth, we describe the evaluative model training, the optimi- sation variants for the evaluative model and the simulated
experimental study. Finally, we present the real robot study.
II. BACKGROUND AND RELATED WORK
There are four broad approaches to grasp planning. First,
we may employ analytic mechanics to evaluate grasp quality.
Second, we may engineer a mapping from sensing to grasp.
Third, we may learn this mapping, such as learning a genera- tive model. Fourth, we may learn a mapping from sensing and
a grasp to a grasp success prediction. See [3] and [4] for recent
reviews of data driven and analytic methods respectively.
Analytic approaches use mechanical models to predict grasp
outcome [5], [6], [7], [8]. This requires models of both object
(mass, mass distribution, shape, and surface friction) and
manipulator (kinematics, exertable forces and torques). Several
grasp quality metrics can be defined using these [9], [10],
[11] under a variety of mechanical assumptions. These have
been applied to dexterous grasp planning [12], [13], [14], [15],
[16], [17]. The main drawback of analytic approaches is that
estimation of object properties is hard. Even a small error in
estimated shape, friction or mass will render a grasp unstable
[18]. There is also evidence that grasp quality metrics are not
well correlated with actual grasp success [19], [20], [21].
An alternative is learning for robot grasping, which has
made steady progress. There are probabilistic machine learning
techniques employed for surface estimation for grasping [22];
data efficient methods for learning dexterous grasps from
demonstration [23], [1], [24]; logistic regression for classifying
grasp features from images [25]; extracting generalisable parts
for grasping [26] and for autonomous grasp learning [27].
1The code and simulated grasp dataset are available at
https://rusen.github.io/DDG. The web page explains how to download
the dataset, install the physics simulator and re-run the grasps in simulation.
The simulator acts as a client alongside a simple web server to gather new
grasp data in a distributed setup.
Deep learning is a recent approach to grasping. Most work is
for two finger grippers. Approaches either learn an evaluation
function for an image-grasp pair [28], [29], [30], [31], [32],
[33], learn to predict the grasp parameters [34], [35] or jointly
estimate both [36]. The quantity of real training grasps can be
reduced by mixing real and simulated data [37].
A small number of papers have explored deep learning as
a method for dexterous grasping. [43], [44], [45], [42], [41].
All of these use simulation to generate the training set for
learning. Kappler [41] showed the ability of a CNN to predict
grasp quality for multi-fingered grasps, but uses complete point
clouds as object models and only varies the wrist pose for the
pre-grasp position, leaving the finger configurations the same.
Varley [44] and later Zhou [42] went beyond this by varying
the hand pre-shape, and predicting from a single image of the
scene. Each of these posed search for the grasp as a pure
optimisation problem (using simulated annealing or quasi- Newton methods) on the output of the CNN. They, also, take
the approach of learning an evaluative model, and generate
candidates for evaluation uninfluenced by prior knowledge.
Veres [45], in contrast, learns a deep generative model. Finally
Lu [43] learns an evaluative model, and then, given an input
image, optimises the inputs that describe the wrist pose and
hand pre-shape to this model via gradient ascent, but does not
learn a generative model. In addition, the grasps start with
a heuristic grasp which is varied within a limited envelope.
Of the papers on dexterous grasp learning with deep networks
only two approaches [44], [43] have been tested on real grasps,
with eight and five test objects each, producing success rates
of 75% and 84% respectively. An key restriction of both of
these methods is that they only plan the pre-grasp, not the
finger-surface contacts, and are thus limited to power-grasps.
Thus, in each case, either an evaluative model is learned but
there is no learned prior over the grasp configuration able to be
employed as a generative model; or a generative grasp model
is learned, but there is no evaluative model learned to select
the grasp. Our technical novelty is thus to bring together a
data-efficient method of learning a good generative model with
an evaluative model. As with others, we learn the evaluative
model from simulation, but the generative model is learned
from a small number of demonstrated grasps. Table I compares
the properties of the learning methods reviewed above against
this paper. Most works concern pinch grasping. Of the eight
papers on learning methods for dexterous grasping, two [44],
[43] are limited to power grasps. Of the remaining five, three
have no real robot results [45], [42], [41]. Of the remaining
four, two we directly build on here, the third being a extension
of one of those grasp methods with active vision. Finally,
our real robot evaluation is extensive in comparison with
competitor works on dexterous grasping, comprising 196 real
grasps of 40 different objects.
III. DATA EFFICIENT LEARNING OF A GENERATIVE GRASP
MODEL FROM DEMONSTRATION
This section describes the generative model learning upon
which the paper builds. We employ two related grasp gen- eration techniques [1], [2], which both learn a generative
Page 3 of 14
IEEE TRANSACTIONS ON ROBOTICS 3
References Grasp type Robot Clutter Model Novel
2-fing. >2-finger >2-finger results free objects
power dexterous
[26], [25], [27], [29], [31], [33], [38] ! ! ! !
[32], [37], [39], [40] ! ! ! ! !
[41] ! !
[42] ! ! !
[43], [44] ! ! ! !
[23] ! ! !
[45], [42], [41] ! ! !
[1], [2], [24] ! ! ! !
[46] ! ! ! ! !
This paper ! ! ! !
TABLE I: Qualitative comparison of grasp learning methods.
model of a dexterous grasp from a demonstration (LfD). Those
papers both posed the problem as one of learning a factored
probabilistic model from a single example. The method is split
into a model learning phase, a model transfer phase, and the
grasp generation phase.
A. Model learning
The model learning is split into three parts: acquiring an
object model; using this object model, with a demonstrated
grasp, to build a contact model for each finger link in contact
with the object; and acquiring a hand configuration model
from the demonstrated grasp. After learning the object model
can be discarded.
1) Object model: First, a point cloud of the object used
for the demonstrated grasp is acquired by a depth camera,
from several views. Each point is augmented with the esti- mated principal curvatures at that point and a surface normal.
Thus, the j
th point in the cloud gives rise to a feature
xj = (pj , qj , rj ), with the components being its position
pj ∈ R
3
, orientation qj ∈ SO(3) and principal curvatures
rj = (rj,1, rj,2) ∈ R
2
. The orientation qj is defined by
kj,1, kj,2, which are the directions of the principal curvatures.
For later convenience we use v = (p, q) to denote position
and orientation combined. These features xj allow the object
model to be defined as a kernel density estimate of the joint
density over v and r.
O(v, r) ≡ pdf O(v, r) '
X
KO
j=1
wjK(v, r|xj , σx) (1)
where O is short for pdf O, bandwidth σx = (σp, σq, σr), KO
is the number of features xj in the object model, all weights
are equal wj = 1/KO, and K is defined as a product:
K(x|μ, σ) = N3(p|μp, σp)Θ(q|μq, σq)N2(r|μr, σr) (2)
where μ is the kernel mean point, σ is the kernel bandwidth,
Nn is an n-variate isotropic Gaussian kernel, and Θ corre- sponds to a pair of antipodal von Mises-Fisher distributions.
2) Contact models: When a grasp is demonstrated the final
hand pose is recorded. This is used to find all the finger links
L and surface features xj that are in close proximity. A contact
model Mi
is built for each finger link i. Each feature in the
object model that is within some distance δi of finger link Li
contributes to the contact model Mi for that link. This contact
model is defined for finger link i as follows:
Mi(u, r) ≡ pdfM
i
(u, r) '
1
Z
K
XMi
j=1
wijK(u, r|xj , σx) (3)
where u is the pose of Li relative to the pose vj of the j
th
surface feature, KMi
is the number of surface features in the
neighbourhood of link Li
, Z is the normalising constant, and
wij is a weight that falls off exponentially as the distance
between the feature xj and the closest point aij on finger link
Li
increases:
wij =
(
exp(−λ||pj − aij ||2
) if ||pj − aij || < δi
0 otherwise,
(4)
The key property of a contact model is that it is conditioned
on local surface features likely to be found on other objects,
so that the grasp can be transferred. We use the principal
curvatures r, but many local surface descriptors would do.
B. Hand configuration model
In addition to a contact model for each finger-link, a model
of the hand configuration hc ∈ R
D is recorded, where D is the
number of DoF in the hand. hc is recorded for several points
on the demonstrated grasp trajectory as the hand closed. The
learned model is:
C(hc) ≡
X
γ∈[−β,β]
w(hc(γ))ND(hc|hc(γ), σhc
) (5)
where w(hc(γ)) = exp(−αkhc(γ) − h
g
c k
2
); γ is a parameter
that interpolates between the beginning (h
t
c
) and end (h
g
c
)
points on the trajectory, governed via Eq. 6 below; and β is a
parameter that allows extrapolation of the hand configuration.
hc(γ) = (1 − γ)h
g
c + γht
c
(6)
C. Grasp Transfer
When presented with a new object onew the contact models
must be transferred to that object. A partial point cloud of
onew is acquired (from a single view) and recast as a density,
Onew, again using Eq. 1. The transfer of each contact model