Project for Financial Engineering Club at UIUC - Spring 2025 semester
I am currently working on this locally on my PC with a NVIDIA GeForce RTX 4060. I am using PyTorch version 2.6.0+cu126 for CUDA 12.6.
The goal of this project is to develop a machine learning-based system for pricing European options and estimating implied volatility using open-source data from OptionDX.com. The project will leverage state-of-the-art algorithms, including Random Forest, XGBoost, and Neural Networks, to analyze historical option data and predict market behavior.
-
Data Collection: Gather historical European option data from OptionDX.com (free), including strike price, expiration date, underlying asset price, and other relevant variables. SPX is a European style option instrument on the S&P500 Index.
-
Model Development: Implement three machine learning models:
-
Random Forest – tree-based model, useful for feature importance analysis.
-
XGBoost – ensembled model for efficient gradient boosting.
-
Neural Networks (e.g., Multi-Layer Perceptron) - for capturing complex non-linear patterns in option pricing.
-
-
Model Evaluation: Compare the performance of the models using metrics such as RMSE, Rsquared, and computational time efficiency.
-
Visualization: Create visualizations to demonstrate model predictions versus actual implied volatility surfaces.
-
Documentation & Deployment: Document findings and deploy a user-friendly interface for real-world application.
- A robust machine learning framework for European option pricing constructed in Python (Google Colab link).
- Insights into the relative performance of different algorithms for financial data analysis.
- A deployable tool for traders and risk managers to estimate implied volatility quickly.
- Compare the machine learning approaches to numerical methods for calculating implied volatility (Black-Scholes model). Is there a time advantage to using machine learning trained on this data compared to the numerical method? Is there a loss of accuracy?
- Compare the machine learning models trained on different sizes of training set. What is the loss or increase in accuracy due to training set size? What is the trade-off due to additional training time?