(Kaggle) Optiver - Trading at the Close¶
This post is a summary of participating in this Kaggle competition.
In general, it's a time series prediction competition which includes a strong background in finance, to the point that I spent a lot of time on figuring out those finance terms.
At the time I write this post, the final results haven't came out, but my final public score is 5.3341, ranking at the 186th place. My code is visible at my github.
Some useful links for this competition
Overview of this competition: you can get to know this competition quickly.
Submission Demo: This notebook teached you how to utilize the Optiver2023 API and submit your predictions.
Competition idea sharing: I think you can learn everything (ideas, terms explanations, extensions of Nasdaq, etc.) here. It's a post in Chinese.
Features as explained to a kid: This post explained features in trainning data using chatGPT. But I think those explanations are not so clear to a newbie.
Weights of the Synthetic Index: This is a brilliant finding about how to approximate Index
using Linear Regression.
Baseline step by step¶
I copied this public notebook as my baseline code. The most impressive part of the code is feature engineering, and we are going to go through all the derived features.
Feature Engineering¶
V1 - Baseline features¶
# V1
prices = ["reference_price", "far_price", "near_price", "ask_price", "bid_price", "wap"]
sizes = ["matched_size", "bid_size", "ask_size", "imbalance_size"]
df["volume"] = df.eval("ask_size + bid_size")
df["mid_price"] = df.eval("(ask_price + bid_price) / 2")
df["liquidity_imbalance"] = df.eval("(bid_size-ask_size)/(bid_size+ask_size)")
df["matched_imbalance"] = df.eval("(imbalance_size-matched_size)/(matched_size+imbalance_size)")
df["size_imbalance"] = df.eval("bid_size / ask_size")
for c in combinations(prices, 2):
df[f"{c[0]}_{c[1]}_imb"] = df.eval(f"({c[0]} - {c[1]})/({c[0]} + {c[1]})")
for c in [['ask_price', 'bid_price', 'wap', 'reference_price'], sizes]:
triplet_feature = calculate_triplet_imbalance_numba(c, df)
df[triplet_feature.columns] = triplet_feature.values
Here we can seperate those features into 3 parts. The first part is about some simple combination between bid/ask price and bid/ask size. The second part and third part is about the combination features, as the former are constructed with 2 original features and the latter are constructed with 3.
For the triplet features, the code of generation is as below. It uses numba to accelerate the process.
from numba import njit, prange
@njit(parallel=True)
def compute_triplet_imbalance(df_values, comb_indices):
num_rows = df_values.shape[0]
num_combinations = len(comb_indices)
imbalance_features = np.empty((num_rows, num_combinations))
for i in prange(num_combinations):
a, b, c = comb_indices[i]
for j in range(num_rows):
max_val = max(df_values[j, a], df_values[j, b], df_values[j, c])
min_val = min(df_values[j, a], df_values[j, b], df_values[j, c])
mid_val = df_values[j, a] + df_values[j, b] + df_values[j, c] - min_val - max_val
if mid_val == min_val: # Prevent division by zero
imbalance_features[j, i] = np.nan
else:
imbalance_features[j, i] = (max_val - mid_val) / (mid_val - min_val)
return imbalance_features
About numba and prange
Numba
is a Just-In-Time (JIT) compiler for Python that translates a portion of Python code into machine code at runtime. It is designed to accelerate numerical and scientific Python code, making it run at speeds comparable to compiled languages like C or Fortran.
The prange
function in Numba is used for parallelizing loops. It is similar to Python's built-in range, but it is designed to work efficiently with Numba's parallel execution capabilities. When used with parallel loops, prange can distribute the loop iterations across multiple threads, taking advantage of multicore processors to speed up computations.
The essential generative formula of triplet features is \(\frac{max\_val - mid\_val}{mid\_val - min\_val}\).
V2 - Imbalance features¶
Then we built some features in terms of imbalance, as shown below.
# V2
df["stock_weights"] = df["stock_id"].map(weights)
df["weighted_wap"] = df["stock_weights"] * df["wap"]
df['wap_momentum'] = df.groupby('stock_id')['weighted_wap'].pct_change(periods=6)
df["imbalance_momentum"] = df.groupby(['stock_id'])['imbalance_size'].diff(periods=1) / df['matched_size']
df["price_spread"] = df["ask_price"] - df["bid_price"]
df["spread_intensity"] = df.groupby(['stock_id'])['price_spread'].diff()
df['price_pressure'] = df['imbalance_size'] * (df['ask_price'] - df['bid_price'])
df['market_urgency'] = df['price_spread'] * df['liquidity_imbalance']
df['depth_pressure'] = (df['ask_size'] - df['bid_size']) * (df['far_price'] - df['near_price'])
df['spread_depth_ratio'] = (df['ask_price'] - df['bid_price']) / (df['bid_size'] + df['ask_size'])
df['mid_price_movement'] = df['mid_price'].diff(periods=5).apply(lambda x: 1 if x > 0 else (-1 if x < 0 else 0))
df['micro_price'] = ((df['bid_price'] * df['ask_size']) + (df['ask_price'] * df['bid_size'])) / (df['bid_size'] + df['ask_size'])
df['relative_spread'] = (df['ask_price'] - df['bid_price']) / df['wap']
for func in ["mean", "std", "skew", "kurt"]:
df[f"all_prices_{func}"] = df[prices].agg(func, axis=1)
df[f"all_sizes_{func}"] = df[sizes].agg(func, axis=1)
# V3
for col in ['matched_size', 'imbalance_size', 'reference_price', 'imbalance_buy_sell_flag']:
for window in [1, 2, 3, 5, 10]:
df[f"{col}_shift_{window}"] = df.groupby('stock_id')[col].shift(window)
df[f"{col}_ret_{window}"] = df.groupby('stock_id')[col].pct_change(window)
for col in ['ask_price', 'bid_price', 'ask_size', 'bid_size',
'wap', 'near_price', 'far_price']:
for window in [1, 2, 3, 5, 10]:
df[f"{col}_diff_{window}"] = df.groupby("stock_id")[col].diff(window)
Some advanced usage about pandas
As you can see in the code above, we use df.groupby
, df.diff
, df.agg
and df.pct_change
.
df.diff(peroids=)
is used to calculate the difference between \(x_{i}\) and \(x_{i-periods}\).
df.agg("mean | std | skew | kurt", axis=)
is used for aggregating the data. We can calculate the figure of mean value, standard deviation, skewness (偏度), kurtosis (峰度) for a group of data. Among those statistical values, skewness is standardized 3rd central moment & kurtosis is standardized 4th central moment.
V3 - Global features¶
We can also add some global features which are at stock_id level, and features in terms of time.
global_stock_id_feats = {
"median_size": df_train.groupby("stock_id")["bid_size"].median() + df_train.groupby("stock_id")["ask_size"].median(),
"std_size": df_train.groupby("stock_id")["bid_size"].std() + df_train.groupby("stock_id")["ask_size"].std(),
"ptp_size": df_train.groupby("stock_id")["bid_size"].max() - df_train.groupby("stock_id")["bid_size"].min(),
"median_price": df_train.groupby("stock_id")["bid_price"].median() + df_train.groupby("stock_id")["ask_price"].median(),
"std_price": df_train.groupby("stock_id")["bid_price"].std() + df_train.groupby("stock_id")["ask_price"].std(),
"ptp_price": df_train.groupby("stock_id")["bid_price"].max() - df_train.groupby("stock_id")["ask_price"].min(),
}
for key, value in global_stock_id_feats.items():
df[f"global_{key}"] = df["stock_id"].map(value.to_dict())
df["dow"] = df["date_id"] % 5
df["dom"] = df["date_id"] % 20
df["seconds"] = df["seconds_in_bucket"] % 60
df["minute"] = df["seconds_in_bucket"] // 60
Model Training¶
Use a single model LightGBM
. First we divide the training dataset to 5 folds and apply an innovative CV strategy on this temporal predicting problem, that is, Purged K-Fold Validation. You can see the detailed information in the part Some tricks.
Update in the later version¶
I made some small changes based on the baseline code. Here are some tips.
Calculate the synthetic index¶
Here the target can be calculated using the formula below: $$ Target = (\frac{StockWAP_{t+60}}{StockWAP_t} - \frac{IndexWAP_{t+60}}{IndexWAP_{t}}) * 10000 $$
Assume that we can get the \(\frac{IndexWAP_{t+60}}{IndexWAP_t}\) by a linear combination of \(\frac{StockWAP_{t+60}}{StockWAP_t}\) of all stocks. So we can derive the following formula: $$ \frac{IndexWAP_{t+60}}{IndexWAP_t} = a \times \frac{StockWAPA_{t+60}}{StockWAPA_t} + b \times \frac{StockWAPB_{t+60}}{StockWAPB_t} + \cdot\cdot\cdot $$
This dicussion has already claim the detailed process of calculating the weights \(a, b, ...\) Of course this can be a method to estimate Index
.
Version 1 (LB: 5.3366)¶
The Baseline code got 5.3365 on public score.
Here we change LGBRegressor's n_estimators
from 5000 to 7000.
Version 2 (LB: 5.3341)¶
This is one of my final submission.
Add some new features:
for col in ['wapdiff', 'wap']:
for window in [1, 2, 3, 5, 10]:
df[f"{col}_shift_{window}"] = df.groupby('stock_id')[col].shift(window)
df[f"{col}_ret_{window}"] = df.groupby('stock_id')[col].pct_change(window)
# Calculate diff features for specific columns
for col in ['weighted_wap','price_spread','wapdiff']:
for window in [1, 2, 3, 5, 10]:
df[f"{col}_diff_{window}"] = df.groupby("stock_id")[col].diff(window)
Also we change n_estimators
to 6300, subsample
=0.7, colsample_bytree
=0.7.
Version 3 (LB: 5.3389)¶
This is one of my final submission.
I used the ordinary K-Fold to replace the purged K-Fold.
Some tricks¶
Just like last competition that I have taken, I learned some tricks/useful tips in Optiver.
Reduce Memory¶
We always consume lots of memory during data processing, but we can use this method to revise the type of each column to reduce memory usage.
def reduce_mem_usage(df, verbose=0):
"""
Iterate through all numeric columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == "int":
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float32)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float32)
if verbose:
logger.info(f"Memory usage of dataframe is {start_mem:.2f} MB")
end_mem = df.memory_usage().sum() / 1024**2
logger.info(f"Memory usage after optimization is: {end_mem:.2f} MB")
decrease = 100 * (start_mem - end_mem) / start_mem
logger.info(f"Decreased by {decrease:.2f}%")
return df
Purged K-Fold Validation¶
The standard practice in time-series analysis is to use rolling or expanding window cross-validation. However, this strategy is to divide the entire dataset into five distinct folds based on date_id, which spans 480 days in total.
A "purge" period is a gap between the training and validation sets. This purging process is meant to prevent the leakage of information from the validation set back into the training set. For each fold, we train on data occurring before this purge period and validate on data following it.
for i in range(num_folds):
start = i * fold_size
end = start + fold_size
# Define the purged set ranges
purged_before_start = start - 2
purged_before_end = start + 2
purged_after_start = end - 2
purged_after_end = end + 2
# Exclude the purged ranges from the test set
purged_set = ((date_ids >= purged_before_start) & (date_ids <= purged_before_end)) | \
((date_ids >= purged_after_start) & (date_ids <= purged_after_end))
# Define test_indices excluding the purged set
test_indices = (date_ids >= start) & (date_ids < end) & ~purged_set
train_indices = ~test_indices & ~purged_set