Bime and the MCFC Analytics Project with Opta

When Manchester City Football Club announced they were working with Opta to release a dataset of every event in every game in the Premiership last year, the Bime Team were delighted!  Last season, we created a series of dashboards for the Premiership and Ligue 1 based on match-by-match statistics from (and occasionally, painstakingly hand-crafted spreadsheets collating data from a variety of other sources!) but this is different - over 10,000 rows of data for everything done by 539 players in 380 matches last season.

You can sign up to join the MCFC Analytics community to get hold of the data for yourself, and have a look at more results on the OptaPro website.

Click here for a group of dashboards showing our initial queries.

To explain a little about our calculation approach - the dataset contains over a 150 metrics, player and team details, so we used Bime to navigate, organise, aggregate and investigate the data.  Bime is primarily a Business Intelligence tool, but data is data, so it can be employed for all sorts of projects!  The system picks up the column headers or data fields and organises them as measures (values / quantities) and attributes (qualities) which can then be arranged on a pivot table and visualised in a range of ways.  Bime also has a very powerful calculation engine that allows for custom calculations and the creation of groups, conditionals, etc - terms in italics below refer to the type of calculation used to get the required result.

For example, data is provided in the file for passes backwards, forwards, left and right, so it was a simple matter for the Bime engine to create a new calculated measure 'sideways passes' to use in the combined measure pie chart on the second tab, Approach. This also meant we could investigate which teams play the most 'tiki-taka' game, i.e. the highest proportion of passes sideways.  From the results, it would seem to be a good idea, as the top four were, well, the Top  4!  Swansea and Wigan, who both received plaudits for their attractive style of play, join them in the top six.

It then seemed reasonable to look into what the other teams were up to, and also to see if certain stereotypes were true - so, we added together successful and unsuccessful long balls, then created a fixed calculated measure for the average per team, and put that up next to our tiki-taka visual.  You will see a distinct correlation between the top of one chart and the bottom of the other.  Unfortunately there isn't quite enough data here to prove if Stoke are particularly doughty on a particular weekday in certain climactic conditions...

Other steps taken helped with presentation - changing the formationID field from the file into readable formations using the rename elements option to build up a picture of how teams set up in the fourth tab, Formations (with thanks to Simon Austin at Opta for confirming the key).  This also involved creating a new calculated attribute to identify individual matches; as the dataset is event by event, we needed some form of match 'label' to aggregate all the relevant events, which we did by concatenating team + opponent + venue, and then using a simple DCount in a calculated measure to create the metric used.  The same concatenation approach was used to create each player as an individual name, so Thierry and Karl didn't get aggregated in the shooting statistics (oops).

When looking at the performance of individual players, it is very important to focus in on the most relevant part of the dataset.  For example, on the first tab, Striking, we look at conversion rates (goals / shots), but if everyone is included, some odd results pop up.  Three players, for example, had a 100% shot conversion ratio last season - before anybody jumps in to buy one of them as their star striker before the end of the transfer window, it should be noted that one of them was Tim Howard.  Thus, the scatter graph shows the 100 players who had the most shots (created using a filtered attribute), and their success rates - the table on the right cuts this down further to the 10 most frequent shooters for more detail.  As the dataset also includes minutes played, it seemed sensible to look at shooting efficiency from another angle, that of time, and by showing these two together, another view is seen; only one player (Wayne Rooney) appears in both top tens.  Similarly a 'floor' was put on the tackle scatter on tab 3, Tackling and Keeping, using post-processing to specify a minimum of ten tackles won to focus in on the most relevant players for this metric - which still leaves one with a 100% success rate, Connor Wickham.

Using statistics to analyse football is fraught with difficulty; far too much weight can be placed on individual metrics, and it can be tempting to confuse causation with correlation. The key strength of the statistical approach is perhaps not to answer questions, but to provoke new ones, for a greater understanding of the game - for example, why are seven of the top ten best tacklers not defenders?  What's up with Liverpool's strikers exactly? Why doesn't Berbatov play more often? And many many more.  Not all of these questions can even be approached with statistics, but having this level of detail available is a fascinating help in investigating.

Update 28/08/2012:

Three new tabs have been added to the group, looking at aerial prowess, the 'footedness' of players, and various disciplinary issues.

In the first, we can see a high correlation between the long-ball game and the proportion of goals that are scored in the air - compare and contrast the bar chart here with that on the 'Approaches' tab. We can also see a scatter of player performance in the air, with colour identifying position; compared to the trend line, the majority of defenders are above the line, successful in aerial duels. The size of the dot shows the number of headed goals scored, so two specialist strikers can be identified, Peter Crouch and (less successfully in terms of goals but winning more balls in the air than anyone else) Andy Carroll. A point made by Michael Cox of Zonal Marking in a recent Guardian podcast can also be seen, that Marouane Fellaini does indeed win a lot of balls in the air in midfield, with Steve N'Zonzi also impressive in this department.

On 'footedness', there is a bias towards the right foot, but interestingly (perhaps) this is more pronounced amongst attacking players (the 100 taking the most shots). To calculate a player's preference, a conditional calculated measure was used - if right-foot shots exceeded left-foot shots by 25% or more, they 'favour right' and vice versa for 'favour left'. Initially this was set at a 10% tolerance but under this, only one player was 'neutral' - Heidar Helguson, with 8 shots on each foot.

In the disciplinary tab, looking at offsides, we noticed that the worst offenders were 'mid-table' in terms of approach, i.e on the two bar charts for the long-ball game and 'tiki-taka rate'. Liverpool, Everton and West Brom were all caught offside over 100 times, perhaps because playing a less consistent style means attackers are less prepared for the type of ball coming forward to them. On the other hand, Arsenal and Tottenham do have a highly defined style and their chief strikers, Adebayor and Van Persie, have very high 'personal' offside ratings and make up 50%+ of their teams' totals - and factoring time played into things, Javier Hernandez was the most frequent offender, being caught offside every 35 minutes played.