Expected Goals (xG)

Introduction

Expected Goals (xG) is a statistical model designed to estimate the probability that a given goal-scoring opportunity will result in a goal. Instead of evaluating match outcomes only through the number of goals scored, the Expected Goals model evaluates the quality of scoring opportunities created during a match. Each shot is assigned a probability value between 0 and 1, representing the likelihood that the attempt results in a goal. For example, a shot taken from a long distance with several defenders in front of the goal typically has a very low probability of resulting in a goal. Conversely, a shot taken close to the goal with a clear view of the goal mouth usually has a much higher probability of success. By assigning probabilities to individual shot events, the Expected Goals model provides a more reliable way to evaluate attacking performance. It allows analysts to distinguish between teams that consistently create high-quality opportunities and teams that rely on low-probability shots. At KoraStats, the Expected Goals model was developed using historical match data and statistical modeling techniques to estimate the probability that a shot results in a goal. The model is based on the analysis of several characteristics describing the shot event and the situation in which it occurs.

Data and Dataset

The Expected Goals model was developed using match data collected from the Egyptian Premier League. The dataset contains goal-scoring opportunities recorded across four seasons of competition. Each observation in the dataset represents a single shot event taken during a match. The dataset contains a total of 25,716 shot events.
SeasonNumber of Events
2015 / 20166,668
2016 / 20177,026
2017 / 20187,109
2018 / 20194,913
The final season contains data collected until April 2, 2019. Each shot event contains a set of attributes describing the characteristics of the opportunity. These attributes allow the model to analyze how different variables influence the probability of scoring. The dataset includes information such as:
  • shot location on the pitch
  • type of scoring opportunity
  • contextual match information
  • outcome of the shot (goal or no goal)
Because goals are relatively rare compared to the total number of shot attempts, the dataset exhibits a strong class imbalance. Most shots do not result in goals, meaning that the number of non-goal observations significantly exceeds the number of goal observations. To address this issue during model development, the dataset was balanced using down-sampling techniques, where the number of non-goal observations used during training was reduced. This helps the logistic regression model learn meaningful relationships between the variables and the scoring outcome.

Methodology

The Expected Goals model was implemented using logistic regression, which is commonly used for binary classification problems. In the context of Expected Goals modeling, each shot event can produce one of two possible outcomes:
OutcomeDescription
GoalThe shot resulted in a goal
No GoalThe shot did not result in a goal
Logistic regression estimates the probability that the outcome of a shot event belongs to the Goal class. The probability is calculated using the logistic function: P(goal)=11+e(β0+β1X1+β2X2+...+βnXn)P(goal) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n)}} Where:
SymbolMeaning
P(goal)probability that the shot results in a goal
β0intercept
β1 … βnregression coefficients
X1 … Xnmodel variables
These variables describe the characteristics of the scoring opportunity.

Distance to Goal Mouth

One of the most important variables used in the model is the distance between the shot location and the goal mouth. The center of the goal is defined as the coordinate:
(50, 0)
If a shot is taken from location (x, y), the distance to the goal mouth is calculated using the Euclidean distance formula: Distance(x,y)=(50x)2+(0y)2Distance(x,y) = \sqrt{(50 - x)^2 + (0 - y)^2} Distance has a strong influence on the probability of scoring. As the distance between the shooter and the goal increases, the probability of scoring decreases. Instead of using the raw distance directly, the model uses the logarithm of the distance, which provides better statistical behavior for the regression model. LogDistance=log(Distance)LogDistance = \log(Distance)

Shooting Angle

Another important variable is the shooting angle, which represents how much of the goal mouth is visible from the position of the shot. The shooting angle is defined as the angle between the lines connecting the shot location to the two goal posts. The coordinates of the goal posts are defined as:
Left Post  = (50, 4)
Right Post = (50, -4)
The width of the goal used in the model is:
GoalWidth = 8
The angle is calculated using the distances between the shot location and each of the goal posts. Angle=cos1(dL2+dR2GoalWidth22×dL×dR)Angle = \cos^{-1} \left( \frac{d_L^2 + d_R^2 - GoalWidth^2} {2 \times d_L \times d_R} \right) Where:
VariableDescription
dLdistance from shot location to the left post
dRdistance from shot location to the right post
A larger shooting angle means that a greater portion of the goal is visible to the player, which increases the probability of scoring.

Angles Buckets

The diagram should show:
  • the shot location (x,y)
  • the left goal post (50,4)
  • the right goal post (50,-4)
  • the two lines connecting the shot to each post
  • the angle between those lines
Hero Dark

Angle Buckets

To better capture the relationship between shooting angle and scoring probability, the model groups angles into discrete ranges known as angle buckets.
BucketAngle Range
00° – 10°
110° – 20°
220° – 30°
330° – 40°
440° – 50°
4+≥ 50°
These categories allow the model to capture non-linear relationships between angle and scoring probability.

Chance Type Variables

The model also incorporates several binary variables describing the type of scoring opportunity. Each variable takes the value:
1 = true
0 = false
VariableDescription
isHeadershot was taken with the head
isPenaltyshot resulted from a penalty kick
isOneOnOneplayer was in a one-on-one situation with the goalkeeper
isFreeKickshot resulted from a direct free kick
These variables allow the model to capture systematic differences between types of chances.

Summary

The KoraStats Expected Goals model was developed using historical match data from several seasons of the Egyptian Premier League. The model estimates the probability that a shot results in a goal using logistic regression and a set of variables describing the characteristics of the scoring opportunity. Key variables used by the model include:
  • distance to the goal
  • logarithm of the distance
  • shooting angle
  • angle bucket classification
  • type of scoring opportunity
By combining these variables within a probabilistic framework, the model assigns a probability value to each shot attempt. These probability values represent the Expected Goals value of the opportunity, allowing KoraStats analysts to evaluate attacking performance based on the quality of chances created rather than solely on the number of goals scored.