From Idea to Execution: Building an Algorithmic Crypto Trading Bot (Part 2)
Crafting the Foundations — Building Utilities and First Attempts
In the previous article, we explored why I started this crypto trading bot project and how I gathered and structured historical data. Now it’s time to see how everything fits together. This second part focuses on the behind-the-scenes utilities — like the data loading pipeline and the initial attempt to train and validate a trading strategy.
Why Focus on Utilities?
Building a trading bot isn’t just about writing a fancy algorithm. You need a robust foundation to handle:
- Data loading and preprocessing
- Configuration management (so you don’t hardcode everything)
- Logging of runs, errors, and performance metrics
- Backtesting so you can gauge how well your bot might do in real markets
I discovered these essentials the hard way. The more my project grew, the more crucial it became to have reusable utilities I could rely on — especially once I moved toward distributed training (we’ll get to that in Part 4).
DataLoader: The Heart of Data Management
One of the first utilities I built was a dedicated class to manage data import, resampling, and indicator calculations. Here’s a snippet from src/data_loader.py
that shows how I load 1-minute data and resample it into multiple timeframes:
# data_loader.py (excerpt)
import pandas as pd
import pandas_ta as ta
class DataLoader:
def __init__(self):
self.tick_data = {}
self.timeframes = ['1min', '5min', '15min', '30min', '1h', '4h', '1d']
self.base_timeframe = '1min'
self.data_folder = 'output_parquet/'
def import_ticks(self):
tick_data_1m = pd.read_parquet(
os.path.join(self.data_folder, 'BTCUSDT-tick-1min.parquet')
)
tick_data_1m['timestamp'] = pd.to_datetime(tick_data_1m['timestamp'])
tick_data_1m.set_index('timestamp', inplace=True)
tick_data_1m.sort_index(inplace=True)
# (Additional date filtering omitted for brevity)
self.tick_data[self.base_timeframe] = tick_data_1m
def resample_data(self):
base_data = self.tick_data[self.base_timeframe]
for tf in self.timeframes:
if tf == self.base_timeframe:
continue
resampled_data = base_data.resample(tf).agg({
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum'
}).dropna()
self.tick_data[tf] = resampled_data
Memory Constraints
One challenge I faced early on was RAM usage. Loading a full year of 1-second or 1-minute data with dozens of indicators can easily exceed the limits of a typical desktop. To mitigate this:
- I used Parquet (columnar format) to store compressed data.
- I selectively loaded timeframes (not all at once).
- I used distributed methods later when I needed to scale (Part 4 discusses that in detail).
Configuration: A Single Source of Truth
Rather than hardcode dates, file paths, or hyperparameters, I created a YAML configuration file and a helper class Config
in src/config_loader.py
:
# config_loader.py (excerpt)
import yaml
import os
class Config:
def __init__(self, config_path: str = 'config.yaml'):
if not os.path.exists(config_path):
raise FileNotFoundError(f"Configuration file {config_path} not found.")
with open(config_path, 'r') as file:
self.config = yaml.safe_load(file)
def get(self, section: str, key: str = None):
if key:
return self.config.get(section, {}).get(key)
return self.config.get(section, {})
This approach let me quickly change the start or end date, tweak model thresholds, or adjust timeframes without digging through multiple scripts.
First Attempt at a Trading Strategy
At this stage, I wanted a neural network (or simple model) that looked at indicators and decided whether to buy or sell. The naive approach:
- Two Models — One model for “buy signals,” another for “sell signals.”
- Threshold — If the buy model output was above a threshold, we went long. If the sell model output was above a threshold, we closed or reversed the position.
Snippet: TradingStrategy
First Iteration
Below is a condensed excerpt of the TradingStrategy
class from src/trading_strategy.py
, illustrating how I attempted to manage buy/sell signals:
# trading_strategy.py (excerpt)
class TradingStrategy:
def __init__(self, data_loader, config, model_buy, model_sell):
self.data_loader = data_loader
self.config = config
self.model_buy = model_buy
self.model_sell = model_sell
self.threshold_buy = config.get('threshold_buy', 0.6)
self.threshold_sell = config.get('threshold_sell', 0.6)
# Set up initial balances, last price, etc.
self.initial_balance = config.get('initial_balance', 100000000)
self.available_balance = self.initial_balance
# More initialization...
def calculate_profit(self):
# Simulate trades
while self.last_price >= 0:
self.read_next_trade()
if self.last_price == -1:
return self.write_orders()
# Evaluate indicators for this timestep
if self.current_timestamp in self.features.index:
indicator_values = self.features.loc[self.current_timestamp].values
buy_prob = self.analyze_buy(indicator_values)
sell_prob = self.analyze_sell(indicator_values)
if self.current_order['quantity'] == 0 and buy_prob > self.threshold_buy:
self.place_order_buy(self.order_qty)
elif self.current_order['quantity'] != 0 and sell_prob > self.threshold_sell:
self.place_order_sell(abs(self.current_order['quantity']))
return self.write_orders()
Why This Approach Stalled
I ran into a few hurdles:
- Training Issues: It wasn’t producing reliable buy/sell signals. The model needed better labeled data or a more robust reinforcement approach.
- Indicator Explosion: Loading too many indicators simultaneously bogged down memory and CPU time, especially for year-long data.
- Overfitting: The model occasionally memorized segments of historical data, leading to poor generalization.
Validating Performance
Even with an imperfect strategy, I still needed a quick way to measure performance. My validation consisted of:
- Genetic Algorithm Fitness: Each “individual” in the GA population represented a unique set of indicator parameters (e.g., RSI length, MACD fast/slow).
- Neural Network-Level Backtest: For each individual’s configuration, I ran a simulation. The
Simulation
class insrc/simulation.py
simply calls the strategy’scalculate_profit()
:
# src/simulation.py
class Simulation:
def __init__(self, strategy):
self.strategy = strategy
def run(self):
result = self.strategy.calculate_profit()
return result
Result: The final account balance (or total profit) after the backtest became the “fitness” metric.
Early Results
- The GA framework generated many parameter sets, but the neural net didn’t converge on meaningful buy/sell signals. I had to re-think how to combine these two methods, ultimately deciding to switch from the
geneticalgorithm
library to DEAP and from a naive feed-forward approach to a reinforcement learning agent (more on that in Part 3).
Key Libraries & Dependencies
- pandas & pandas_ta: For data manipulation and indicator calculation.
- geneticalgorithm (later replaced by DEAP): Handled parameter tuning via evolutionary approaches.
- PyArrow & Parquet: Efficient file formats for reading/writing large datasets.
- NumPy: For numerical computations and array manipulations.
- SciKit-Learn: For initial “buy vs. sell” classification models (e.g.,
LogisticRegression
).
Lessons Learned (So Far)
- Utility Organization: Creating separate files/classes (
data_loader.py
,config_loader.py
, etc.) saved me endless hours of troubleshooting. - Memory Limitations Are Real: Precomputing indicators for large timeframes can be resource-intensive.
- Naive Buy/Sell Approaches: Basic threshold-based neural nets may not be enough. A more advanced or dedicated RL approach can pay off.
- Logging: Although not shown in detail here, logging to a dedicated file or console stream is invaluable for debugging runs (especially when you move to multi-processing or distributed setups).
Wrapping Up Part 2
In this article, we covered how I set up the foundational utilities — data loading, configuration management, and my first attempt at a naive buy/sell neural network. Despite early limitations, these structures proved essential as the project evolved.
Up Next (Part 3): I’ll share how I swapped out the geneticalgorithm
library for DEAP and built a more sophisticated reinforcement learning agent. That pivot changed the entire trajectory of the bot’s evolution—one step closer to a truly adaptive trading system!
What’s Next?
- Part 3: Leveling Up — Replacing Genetic Algorithm with DEAP and Reinforcement Learning
- Part 4: Scaling Up — Distributed Training for Maximum Efficiency
Have questions about structuring your utilities or want to share your own first attempts? Feel free to drop a comment below!