Nischal Baidar

Split Your Dataset for Training, Validation and Testing: Using split-folders

2 min read
Cover Image for Split Your Dataset for Training, Validation and Testing: Using split-folders

When training machine learning models, especially for image classification, it’s important to split your dataset into training, validation, and test sets.

Instead of writing manual scripts to shuffle and copy files, you can use a simple yet powerful Python library called split-folders.


What is split-folders?

split-folders is a Python utility that automatically splits a folder of images (or any files) into multiple subfolders typically train, val, and test using a given ratio.

Example folder before splitting:

Dataset/
 ├── with_mask/
 └── without_mask/

After splitting:

splitted/
 ├── train/
 │   ├── with_mask/
 │   └── without_mask/
 ├── val/
 │   ├── with_mask/
 │   └── without_mask/
 └── test/
     ├── with_mask/
     └── without_mask/

Installing split-folders

First, install it inside your virtual environment:

pip install split-folders

Methods to use split-folders for splitting dataset

You can use it directly inside a Python file.

For example, split_dataset.py:

import splitfolders

splitfolders.ratio(
    "Dataset",                         # Input folder
    output="splitted",                 # Output folder
    seed=42,                           # For reproducibility
    ratio=(0.7, 0.2, 0.1)              # Train, Val, Test split
)

Then run:

python split_dataset.py

Why this is better:

  • Works inside Jupyter or scripts

  • Easy to adjust ratios and output names

  • Easier to reuse in automation or experiments

Method 2: Command Line Usage

If you prefer running it directly from the terminal, you can use:

splitfolders --output splitted --ratio .7 .2 .1 --seed 42 -- Dataset

Here:

  • --output splitted → the destination folder

  • --ratio .7 .2 .1 → 70% train, 20% val, 10% test

  • --seed 42 → ensures repeatability

  • -- Dataset → path to your dataset

Also the code can be written as:

split_folders --output splitted --ratio .7 .1 .2 -- Dataset

You can also tweak ratios and output names like:

splitfolders --output checksplit --ratio .7 .1 .2 --seed 42 -- Dataset
  • CLI is great for quick one-time use

  • But not ideal if you want to reuse or track configurations later

PRO TIP:

Always fix a random seed (like seed=42) to ensure you get the same split every time.


Note:

split-folders does not modify or split annotation files (like JSON, CSV, or XML) automatically.
It only splits folders containing images, videos or data files organized by class names.