Computer vision is difficult.

Actually, forget that. Computer vision is no problem – you hook up a webcam and boom. The machine can see. However, making a computer understand what it’s seeing is very difficult indeed. The Leap Motion Controller provides a solution. By interpreting the data coming from its cameras, it presents structured information to the computer – telling it there’s a hand here, with fingers there, moving this way and that.

However, while the Leap Motion Controller allows the machine to identify hand movements, for many applications there still remains the challenge of interpreting what those movements actually mean.


Hungry? It’s hungry, right?

For example, though the Leap Motion Controller provides extremely detailed data, like hand #0 is moving in direction (234, 12, 3) at speed x with a rotation of y, for many applications it would be easier to make immediate use of recognizable human gestures, like the user is waving or the user is giving a thumbs-up.

Now, recognizing gestures is difficult – and systems that can do it seem magical.

One way to build such a magical system is to hire a couple of engineers, put them to work, and a week or two later, you’ll have an algorithm that can recognize a gesture. However, you had best hope that this algorithm will be flexible enough to spot my wave – which is different from yours because my hands are bigger or smaller, or in an unusual position, or moving faster or slower than expected. You should also hope you have enough time and budget to keep those engineers on the clock long enough to write algorithms to recognise all the gestures you’ll need to create a motion interface for your system.

In fact, the real challenge is building motion interfaces reliably and quickly enough to be practical – without resorting to wizardry.


Not today, Dr. Bizarro.

“It’s Lasers, Right? Lasers are the Solution?”

Almost always – but in this case, the problem is that interpreting motion algorithmically is a long and difficult task. Worse, when we make hand gestures we’re hopelessly imprecise. Our hands fly around, more or less, here and there – varying depending on how we’re positioned relative to the computer, whether we’re tired or full of energy – really, any number of factors.

So, while tracking coherent motion over time is bad enough, writing an algorithm that can reliably filter all the variations in instances of the same motion is downright heinous. Unless all our users have the eerily steady hands of a neurosurgeon, we’re unlikely to get the results we need.

“So… Not Lasers Then? Or Just Not Enough Lasers?” (Wink.)

No, but you have a point. This is software, so we definitely want the solution to sound as futuristic as possible – which brings us to machine learning.

Briefly, machine learning systems perform statistical analyses to identify patterns common to known data samples and newly collected samples. So, the learning process is the gathering of identified data, which allows algorithms to detect similarities between known data and newly input data.

In our case, a known piece of data is a recording of a gesture we want our system to recognize.

We provide this recording during development, at which time we also assign it a name – “wave,” “tap,” “pew-pew” (patent pending), whatever.


Pew-pew (patent pending).

At runtime, our system compares the gestures it knows to gestures the user performs. When patterns in the known and new gestures are found to be sufficiently similar, we have recognition. This is how LeapTrainer works.

“That Sounds Suspiciously Easy. Are You a Wizard?”

Yes, but very few chickens were sacrificed to develop LeapTrainer. Practically none, in fact – the recording, processing, and pattern recognition functions work in essentially simple, non-supernatural ways.

First, we use the LeapTrainer UI to record samples of the gestures that we want to form the motion interface of our application.


Exhibit A: The LeapTrainer UI

From here, LeapTrainer monitors the motion data stream, looking for events that indicate the start and end of gestures. For example, if your hand is still, then suddenly starts to move, and then becomes still again – this period of movement is captured as a gesture:


“Why does it fade away cinematically at the start?” You shut your face, you.

Next, we assign a name to the gesture, and LeapTrainer performs some analysis to prepare the sample for recognition. You can find the full details of how the UI works and how gesture data is prepared on the LeapTrainer GitHub site. Plus, if you’re the hands-on type, you can try a live copy of the training interface, ready to use, right here.

So, how does it work? Essentially, the geometric positioning data is resampled, scaled, and transformed to the origin. As a result, while you might perform a big, fancy wave (because you might be a big, fancy type of person), and I might do a small, timid wave (because I’m the shy, retiring type – like a movie star on holiday), once the system has resampled and scaled our gestures, their similarities are amplified and their differences minimized.


Since pre-processing has ensured that the samples are the same length and scale, recognition is just a question of calculating the geometric distances between points in each gesture, and averaging them to calculate an overall distance between the two samples. If this distance is low enough (or, to put it another way, if similarity is high enough) – then boom – gesture recognition!

“Well that sounds just swell, wizard – but how do I actually use it?”

Easy-peasy – just decouple the plasma manifolds, realign the core, fire up the time circuits, and watch my super-simple video showing exactly how to use LeapTrainer to transform a mousey-clicky HTML page into a motion web interface. In case you missed it above:

“And what can we expect next from LeapTrainer?”

Well, right now, the system can learn and recognize relatively simple gestures and poses.

There are lots of dials and levers that can be used to configure how each stage in training, processing, and recognition operate. This is because Javascript-based gesture recognition (and indeed, gesture interfaces in general) is really a new technology – so there’s lots of experimentation and optimization to be done, and nobody’s quite sure yet what the best configurations might be.

I’ve thoroughly documented each variable in the system and provided a way to modify them through the training UI – and also provided a simple extension mechanism with which to sub-class the LeapTrainer controller – so that you can create your own new and improved versions of the system, or even just experiment with Leap Motion-powered gesture recognition in general.

The most interesting upgrade for the next LeapTrainer release should be the capability to recognize multi-stroke gestures.  The JSON exported from the training UI currently contains a mysterious “stroke” variable:


This variable isn’t used at all in the current version, but is intended to store the stroke index for gestures that are comprised of several movements in rapid succession – something like a Z for Zorro, or more practically, the kind of multi-stroke gestures that might be used in authentication systems or even in sign-language recognition.

So take a look at the code – it’s all open-source, under the MIT license – and hopefully it’s well documented enough that you can get a clear idea of how to use, or even modify it.

Pull requests, suggestions, comments, and contributions are always welcome!


Except from you, Bizarro.

Rob O’Leary is an Irish software engineer based in Rome – where he learned that Italian is all about hand movements. Rob became interested in the Leap Motion Controller after seeing its accessible approach and open developer APIs. The latest version of his JavaScript gesture learning and recognition framework lets you upgrade a standard web interface to a motion interface in just a few minutes.

Bizarro credit: Alex E. Proimos