Liam Welsh and Team win Kaggle 2022 Big Data Derby

January 13, 2023 by Michaela Drouillard

Congratulations to Liam Welsh (PhD Candidate, Department of Statistical Sciences), Brendan Kumagai (MS 2022 Simon Fraser University), Kimberly Kroetch (MS 2023, Simon Fraser University), and Gurashish Bagga (MS 2023, Simon Fraser University), as well as Dr. Tyrell Stokes (Postdoctoral Research Fellow at NYU Langone Health), for winning the 2022 Kaggle Big Data Derby. The team developed a Bayesian velocity model for simulating horse races by analyzing the frame-by-frame movement of horses on a racetrack. 108 teams competed in the 2022 Kaggle Big Data Derby, where participants had to create a model that interpreted one aspect of a dataset provided by The New York Racing Association (NYRA) and the New York Thoroughbred Horsemen's Association (NYTHA).

Welsh and his team developed a model that visualizes horses’ movement as a combination of two types of movement -- forward and side-to-side (lateral) -- at a frame-by-frame level. By analyzing movement at the frame level, the team was able to model and better understand the complicated nature of how a horse's movement depends on the movement and positioning of all other horses on the track.

Typically, in racing events, the main approach taken is to model rankings. Modelling ranks rather than race times captures the dynamics of how horses perform when in competition -- there are strategic effects like lane effects or drafting effects that contribute to performance that rating horses on time alone doesn’t capture. For instance, a horse can run in slow races but place well, or run in fast races but be performing poorly. Modelling rank captures this where race times cannot.

The team realized that with the tracking data available for them at the more fine grain level of horses’ movements, they could come up with a more complex model that could generate more insights than ranking alone. By modelling forward distance and sideways distance as two parts, the model could account for some of the more complicated strategic effects that rank modelling captured, except with better detail.

The team drew on  Mark Glickman and Jonathan Che’s work on developing rating systems while building their model, specifically using splines to model the forward movement of horses. Splines can estimate unknown functions by dividing the domain into points and fitting smooth polynomials or lines between each point. This spline structure allowed them to generate additional results, such as individual horse behavior and the influence of different jockeys on performance. The team then simulated from the model to recreate whole races to go through different types of race paths and start to capture some of those strategic effects.

The collaboration started in August, when Welsh was invited to join the team by a friend at SFU, and by November, they had a model ready to submit.
However, the process was not without its hiccups. The data they had to work with was messy, with many discontinuities and errors in the tracking which generated erroneous results, such as observing a horse running 320 meters per second for two seconds, which, it goes without saying, is much faster than the highest recorded speed for any horse, ever.

The team members coordinated the project through late-night Slack messages sent across time zones, over the course of months. Despite these challenges, they were ultimately able to create a successful model by drawing on each other’s strengths, which earned them the $20,000 grand prize.

Liam Welsh commented, "Prior to the project, my experience with horse racing was limited to the fact that my mom had watched Secretariat and Seabiscuit, so I’d hear it in the background sometimes. So you know, we weren’t experts, but we all came in with a different set of skills that we were able to apply in ways that optimized the work we did together. And eventually, it got us this very nice result for us as a team. So yeah, I’m very proud of the work we did."