Page 1 of 8
Learning Football Body-Orientation as a Matter of Classification
Adria Arbu ` es Sang ́ uesa ̈
1
, Adrian Mart ́ ́ın1
, Paulino Granero2
,
Coloma Ballester1
and Gloria Haro1
1Universitat Pompeu Fabra, 2Russian Football Union
adria.arbues@upf.edu
Abstract
Orientation is a crucial skill for football players that
becomes a differential factor in a large set of events,
especially the ones involving passes. However, ex- isting orientation estimation methods, which are
based on computer-vision techniques, still have a
lot of room for improvement. To the best of our
knowledge, this article presents the first deep learn- ing model for estimating orientation directly from
video footage. By approaching this challenge as
a classification problem where classes correspond
to orientation bins, and by introducing a cyclic
loss function, a well-known convolutional network
is refined to provide player orientation data. The
model is trained by using ground-truth orientation
data obtained from wearable EPTS devices, which
are individually compensated with respect to the
perceived orientation in the current frame. The ob- tained results outperform previous methods; in par- ticular, the absolute median error is less than 12
degrees per player. An ablation study is included
in order to show the potential generalization to any
kind of football video footage.
1 Introduction
Although deep learning (DL) has been an active field of re- search over the last decade, its application on top of sports
data has had a slow start. The lack of universal sports datasets
made it an impossible challenge for a lot of researchers, pro- fessional clubs were not aware about the unlocked potential
of data-driven tools, and companies were highly focused on
manual video analysis.
However, during this last lustrum, the whole paradigm
shifted, and for instance, in the case of football, complete
datasets like SoccerNet have been publicly shared [Gian- cola et al., 2018; Deliege ` et al., 2020], hence providing re- searchers with valid resources to work with [Cioppa et al.,
2019; Cioppa et al., 2020]. At the same time, top European
clubs created their own departments of data scientists while
publishing their findings [Fernandez and Bornn, 2018; Llana
et al., 2020], and companies also shifted to data-driven prod- ucts based on trained large-scale models. Companies such as
SciSports [Decroos et al., 2019; Bransen and Haaren, 2020],
Sport Logiq [Sanford et al., 2020], Stats Perform [Sha et al.,
2020; Stockl ̈ et al., 2022] or Genius Sports [Quiroga et al.,
2020] made a huge investment in research groups (in some
cases, in collaboration with academia), and other companies
are also sharing valuable open data [MetricaSports, 2020;
StatsBomb, 2020; SkillCorner, 2020]. All these facts prove
that, DL is currently both trendy and useful within the con- text of sports analytics, thus creating a need for plug-and-play
models that could be exploited either by researchers, clubs or
companies.
Recently, expected possession value (EPV) and expected
goals models proved to produce realistic outcomes [Spear- man et al., 2017; Spearman, 2018; Fernandez ́ et al., 2019],
which can be directly used by coaches to optimize their tac- tics. Furthermore, in this same field, Arbues-Sang ́ uesa ̈ et al.
spotted a specific literature gap regarding the presented mod- els: player body-orientation. The authors claimed that by
merging existing methods with player orientation, the pre- cision of existing models would improve [Arbues-Sang ́ uesa ̈
et al., 2020a], especially in pass events. By defining orienta- tion as the projected normal vector right in the middle of the
upper-torso, the authors propose a sequential computer vi- sion pipeline to obtain orientation data [Arbues-Sang ́ uesa ̈ et
al., 2020b]. Their model stems from pose estimation, which
is obtained with existing models [Ramakrishna et al., 2014;
Wei et al., 2016; Cao et al., 2017], and achieves an absolute
median error of 28 degrees, which indicates that, despite be- ing a solid baseline, there is still room for improvement.
Therefore, in this article, a novel deep learning model to
obtain orientation out of any player’s bounding boxes is pre- sented. By: (1) using sensor-based orientation data as ground
truth, (2) turning this estimation into a classification problem,
(3) compensating angles with respect to the camera’s viewing
direction, and (4) introducing a cyclic loss function based on
soft labels, the network is able to estimate orientation with a
median error of fewer than 12 degrees per player.
The rest of the paper is organized as follows: in Section 2
the main data structures and types of datasets are detailed;
the proposed fine-tuning process is explained in Section 3 to- gether with the appropriate details about the loss function and
angle compensation. Results are shown in Section 4, and fi- nally, conclusions are drawn in Section 5.
Page 2 of 8
Figure 1: Several domains are merged in this research: (left) sensor-, (middle) field-, and (right) image-domain. By using corners and
intersection points of field lines, the corresponding homographies are used to map data across domains into one same reference system.
2 Data Sources
Before introducing the proposed method, a detailed descrip- tion of the required materials to train this model is given.
Similarly, since we are going to mix data from different
sources, their corresponding domains should be listed as well:
• Image-domain, which includes all kinds of data related
to the associated video footage. That is: (i1) the video
footage itself, (i2) player tracking and (i3) corners’ posi- tion. Note that the result of player tracking in the image- domain consists of a set of bounding boxes, expressed
in pixels; similarly, corners’ location is also expressed
in pixels. In this research, full HD resolution (1920 x
1080) is considered, together with a temporal resolution
of 25 frames per second.
• Sensor-domain, which gathers all pieces of data gener- ated by wearable EPTS devices. In particular, data in- clude: (s4) player tracking, and (s5) orientation data. In
this case, players are tracked according to the universal
latitude and longitude coordinates, and orientation data
are captured with a gyroscope in all XYZ Euler angles.
In this work, sensor data were gathered with RealTrack
Wimu wearable devices [RealTrack, 2018], which gen- erate GPS/Orientation data at 100/10 samples per second
respectively.
• Field-domain, which expresses all variables in terms of
a fixed two-dimensional football field, where the top-left
corner is the origin.
Once data are gathered and synchronized from different
sources, two possible scenarios are faced:
• The complete case, in which all variables (i1, i2, i3, s4,
s5) are available. Note that both image- and sensor-data
include unique identifiers, which are easy to match by
inspecting a small subset of frames.
• The semi-complete case, where only part of the infor- mation is available (i1,s4,s5). In order to estimate the
missing pieces (i2, i3) and match data across domains, a
sequential pipeline is proposed in Section 2.2.
In this article, both a complete and a semi-complete datasets
are used, each one containing data from single games. In
particular, the complete dataset contains a full game of F.C.
Barcelona’s Youth team recorded with a tactical camera with
almost no panning and without zoom; this dataset will be
named FCBDS. The semi-complete dataset contains a full
preseason match of CSKA Moscow’s professional team,
recorded in a practice facility (without fans) with a single
static camera that zooms quite often and has severe panning.
Similarly, this second dataset will be named CSKADS.
Furthermore, intersection and corner image-coordinates were
manually identified and labelled in more than 4000 frames of
CSKADS (1 frame every 37, i.e. 1.5 seconds), with a mean
of 8.3 ground-truth field-spots per frame (34000 annotations).
2.1 Homography Estimation
Since the reference system of the image- and the sensor- domain is not the same, corners’ positions (or line in- tersections) are used to translate all coordinates into the
field-domain. On the one hand, obtaining field locations
in the sensor domain is pretty straightforward: since its
gathered coordinates are expressed with respect to the
universal latitude/longitude system, the corners’ locations
are fixed. By using online tools such as the Satellite View of
Google Maps, and by accurately picking field intersections,
the corners’ latitude and longitude coordinates are obtained.
On the other hand, corner’s positions in the image domain
(in pixels) depend on the camera shot and change across
the different frames; although several literature methods
[Citraro et al., 2020] can be implemented in order to get the
location of these field spots or the camera pose, our proposal
leverages homographies computed from manual annotations.
From now on, the homography that maps latitude/longitude
coordinates into the field will be named HSF , whereas the
one that converts pixels in the image into field coordinates
will be named HIF . The complete homography-mapping
process is illustrated in Figure 1.
2.2 Automatic Dataset Completion
In this Subsection, the complete process to convert a
semi-complete dataset into a complete one is described. It
has to be remarked that the aim is to detect players in the
image-domain and to match them with sensor data, hence
pairing orientation and identified bounding boxes. Note that
this procedure has been applied to CSKADS, which did
not contain ground-truth data in the image-domain. The
proposed pipeline is also displayed in Figure 2.