From Idea to Execution: Building an Algorithmic Crypto Trading Bot (Part 1)

From Concept to Data — Preparing to Train a Crypto Trading Bot

Teo Miscia
5 min readJan 4, 2025
Photo by Dylan Calluy on Unsplash

Have you ever wanted to build a trading bot that not only trades crypto automatically but also optimizes its own strategy? That’s precisely what this series aims to show. In Part 1, we’ll explore the motivation behind this bot and walk through how we gather, merge, and preprocess the massive amounts of historical crypto data required to train a machine learning model.

Why Build an Algorithmic Crypto Trading Bot?

A few years ago, I attempted to use genetic algorithms to optimize trading parameters for a specific strategy. Although that project remained unfinished, it fueled my desire to start fresh. This time around, the goal is to:

  • Automatically optimize parameters across multiple timeframes and indicators.
  • Automatically discover a winning strategy that can adapt to the ever-volatile crypto market.

This challenge not only helps me refine my machine learning skills but also leverages hardware I have for computationally heavy tasks (like backtesting and distributed training, which we’ll explore later in the series).

My Background in Crypto and Algo Trading

Before beginning this project, I had some experience with day trading on Binance and BitMEX and even wrote Ethereum smart contracts to operate on decentralized exchanges (DEXes). While that foundation helped me navigate the crypto landscape, I wanted a more advanced, automated approach — something that could learn and adapt on its own.

Data Sources: Where I Get My Historical Crypto Data

For this project, Binance is my primary source of historical crypto data due to its high-volume BTC/USDT futures market. I gathered all trades from 2021–01–01 to 2024–12–01, creating a massive dataset. If you’d like to replicate this, there are two main avenues:

  1. Public GitHub Projects
    Several repositories provide scripts to pull historical data directly from Binance’s APIs.
  2. Paid Data Services
    For a small monthly fee, you can subscribe to services offering neatly packaged historical data.
    One service I used is https://www.cryptodatadownload.com/ (no paid referall)

How I Gathered and Merged the Data

Once I had the raw data, I needed to convert and merge it into a more manageable format. Below is a snippet from my merge_timeframe_files.py script showing how I combined multiple CSV files for different timeframes into consolidated Parquet files:

# merge_timeframe_files.py (excerpt)
import os
from glob import glob
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from tqdm import tqdm

def merge_timeframe_files(data_directory, output_directory):
# Ensure output directory exists
os.makedirs(output_directory, exist_ok=True)
# Get list of all CSV files
all_files = glob(os.path.join(data_directory, 'BTCUSDT-tick-*.csv'))
timeframes = set()
# Identify unique timeframes from file names
for file in all_files:
filename = os.path.basename(file)
parts = filename.split('-')
timeframe = parts[2]
timeframes.add(timeframe)
# Process each timeframe
for timeframe in sorted(timeframes):
timeframe_files = sorted(glob(os.path.join(data_directory, f'BTCUSDT-tick-{timeframe}-*.csv')))
df_list = []
for file in tqdm(timeframe_files, desc=f"Reading files for {timeframe}"):
try:
df = pd.read_csv(file, parse_dates=['timestamp'], dtype={
'open': 'float64','high': 'float64','low': 'float64','close': 'float64','volume': 'float64'
})
df_list.append(df)
except Exception as e:
print(f"Error reading {file}: {e}")
continue
if df_list:
combined_df = pd.concat(df_list, ignore_index=True)
combined_df.drop_duplicates(subset=['timestamp'], inplace=True)
combined_df.sort_values(by='timestamp', inplace=True)
combined_df.reset_index(drop=True, inplace=True)
output_file = os.path.join(output_directory, f'BTCUSDT-tick-{timeframe}.parquet')
table = pa.Table.from_pandas(combined_df)
pq.write_table(table, output_file)
print(f"Saved {output_file}")

Why Parquet?

  • Efficiency: Parquet is a columnar storage format, which often reads and writes faster than CSV.
  • Size Reduction: Combined CSVs of 7 GB can sometimes compress down to 2 GB or less in Parquet.

Cleaning and Preprocessing with DataLoader

After merging data into Parquet files, I still needed a clean, uniform dataset for training. That’s where my DataLoader class comes in. Below is a snippet showing how it imports 1-minute data (the “base timeframe”) and resamples it to other timeframes:

# data_loader.py (excerpt)
import os
import pandas as pd
import pandas_ta as ta

class DataLoader:
def __init__(self):
self.tick_data = {}
self.timeframes = ['1min','5min','15min','30min','1h','4h','1d']
self.base_timeframe = '1min'
self.data_folder = 'output_parquet/'
def import_ticks(self):
"""
Imports tick data for the base timeframe (1 minute) from Parquet.
"""
tick_data_1m = pd.read_parquet(
os.path.join(self.data_folder, 'BTCUSDT-tick-1min.parquet')
)
tick_data_1m['timestamp'] = pd.to_datetime(tick_data_1m['timestamp'])
tick_data_1m.set_index('timestamp', inplace=True)
tick_data_1m.sort_index(inplace=True)

# Filter data by user-defined start/end date (config-based)
# Then store in self.tick_data
self.tick_data[self.base_timeframe] = tick_data_1m
def resample_data(self):
"""
Resamples the base timeframe data into multiple specified timeframes.
"""
base_data = self.tick_data[self.base_timeframe]
for tf in self.timeframes:
if tf == self.base_timeframe:
continue
resampled_data = base_data.resample(tf).agg({
'open': 'first','high': 'max','low': 'min','close': 'last','volume': 'sum'
}).dropna()
self.tick_data[tf] = resampled_data
  1. import_ticks() loads 1-minute Parquet data into a Pandas DataFrame, sets timestamps as the index, and sorts by time.
  2. resample_data() creates higher timeframes (e.g., 5 min, 15 min, 1 hour) by aggregating the 1-minute data.

Configuration Management

I also rely on a YAML-based configuration file to handle dynamic parameters like start/end dates for filtering. Here’s a snippet from config_loader.py:

# config_loader.py (excerpt)
import os
import yaml

class Config:
def __init__(self, config_path: str = 'config.yaml'):
if not os.path.exists(config_path):
raise FileNotFoundError(f"Configuration file {config_path} not found.")
with open(config_path, 'r') as file:
self.config = yaml.safe_load(file)
def get(self, key, default=None):
return self.config.get(key, default)

This makes it easy to change date ranges, file paths, or hyperparameters without hard-coding them in multiple scripts.

Feature Engineering: Technical Indicators and “Brute Forcing”

The plan is to let a genetic algorithm (and later a reinforcement learning agent) explore a variety of technical indicator parameters to figure out the best strategy. Indicators I’m currently using include:

  • Moving Averages (SMA/EMA)
  • RSI
  • MACD
  • ATR
  • Stochastics

All of these are computed through libraries like pandas_ta or TA-Lib. This portion of code in data_loader.py illustrates how we create a single indicator dataframe:

# data_loader.py (excerpt)
def calculate_indicator(self, indicator_name: str, params: Dict[str, Any], timeframe: str) -> pd.DataFrame:
data = self.tick_data[timeframe].copy()
if indicator_name == 'sma':
length = int(params['length'])
data[f'SMA_{length}'] = ta.sma(data['close'], length=length).astype(float)
result = data[[f'SMA_{length}']].dropna()
# ...
elif indicator_name == 'macd':
fast = int(params['fast'])
slow = int(params['slow'])
signal = int(params['signal'])
macd_df = ta.macd(data['close'], fast=fast, slow=slow, signal=signal).astype(float)
result = macd_df.dropna()
# etc.
return result

By storing each indicator in a separate DataFrame and merging them as needed, I can quickly switch them on or off and experiment with new features.

Putting It All Together

Data merging + cleaning + indicator generation = a robust dataset ready for machine learning. When you run the commands in your environment:

  1. Gather/merge all trade data with scripts like merge_timeframe_files.py and merge_trades.py.
  2. Import & resample with DataLoader to unify everything into consistent timeframes.
  3. Compute any desired technical indicators (SMA, RSI, MACD, etc.).

Conclusion & What’s Next

Data is the bedrock of any trading bot. In this first part, you’ve seen how I acquire, merge, and clean large historical datasets. By the end of this stage, I have Parquet files that are much smaller, faster to load, and structured in a way that’s friendly for machine learning experiments.

In Part 2, we’ll explore how I set up the foundational utilities — like logging, backtesting frameworks, and the initial attempts at trading strategies. We’ll see what worked, what didn’t, and how I tested each iteration. Stay tuned!

Quick Recap

  • Motivations for building a self-optimizing crypto bot.
  • Using Binance as a primary data source.
  • Merging CSV files into efficient Parquet format.
  • Preprocessing with a custom DataLoader.
  • Setting up basic indicator calculations.

What’s Next?

  • Part 2: Crafting the Foundations — Building Utilities and First Attempts
  • Part 3: Leveling Up — Replacing Genetic Algorithm with DEAP and Reinforcement Learning
  • Part 4: Scaling Up — Distributed Training for Maximum Efficiency

If you have questions or suggestions about data gathering or preprocessing, feel free to leave a comment below. Let’s keep building!

--

--

Teo Miscia
Teo Miscia

Written by Teo Miscia

I’m a freelance backend developer. I’m using Laravel as my daily driver since 2015. Alway finding new solutions to the problems I face in my daily work

No responses yet