OZmium Sports Betting and Horse Racing Forums - View Single Post

You could use either Multiple (Linear) Regression or (Binomial) Logistic Regression, depending on what sort of output you wanted.

You would use Multiple Regression if you wanted the combination of all your independent variables (Sky Rating, track odds, barrier number, career runs, etc) - multiplied by individual coefficients - to sum to some total (a dependent variable). For example, looking at your historical data, you could decide you want the dependent variable to be 100 minus 4 * the lengths behind in this race. If a horse won by 2 lengths you'd be determining what combination of coefficients multiplying your variables and summed together, equate to 108; (100 - 4*(-2)). If a horse lost by 15 lengths you'd be determining how your variables equate to 40. You could then look at upcoming races and find horses whose past results indicate a score greater than 100 - indicating that conditions are good for a win, or are a certain margin greater than those for the other horses in its race.

You would use Binomial Logistic Regression if you wanted the combination of your independent variables - multiplied by individual coefficients, etc - to determine the probability of an either/or event occuring. In this case, the probability of a horse winning (or not winning).

By applying either method to a large enough dataset you would determine the statistical significance of individual variables contributing to the final result. You may find some variables just don't matter.

The problem with both methods is that they clearly would work best using independent variables (for example, Betting Odds would depend somewhat on barrier draw and jockey, much like Sky Ratings do - they are not independent). While you can still use the methods, they'll be more useful as a rule-of-thumb than a firm guide.

Beton's approach of looking at specific field sizes on specific tracks over specific distances under specific track conditions from specific barriers would likely lead to more accurate models, but you may find yourself in turn limited by the size of your resulting dataset.

There's a bunch more reasons why any model won't be accurate (perhaps you're missing important variables - like blood counts, or training performance; or outliers are skewing your data; or the relationship between the variables isn't linear at all, maybe it's logarithmic/polynomial/exponential; etc). Still, I certainly feel it's worthwhile applying some mathematical rigour to your processes. You can do it in Excel and there are a heap of how-to's available.