The art of toy data

How and why to make small test data

By Faith Okamoto

A recent class project had me creating and documenting a small Python script (bach). As part of this process, I made toy data. This post will go over the process I used.

What is toy data?

Toy data is a very small example dataset. It probably isn’t “real”, and it’s probably so small as to have no value to science. Its value lies in enabling a short and simple test of the program. The benefits are manyfold:

  • Check if the program was installed correctly
  • Demonstrate standard syntax by example
  • Quick test case for sanity-checking changes to the code

Ideally, the toy dataset is small enough that the program’s algorithm can be worked through by hand. If an end-user has difficulty understanding what a program does, they can use the toy dataset to figure out the kinks.

How to make toy data

What does the program do? What is the simplest case to demonstrate? Multiple datasets are fine. Better to have many examples than an overly complicated one.

Toy data should, if at all possible, be human-readable. Only use binary or compressed files if there absolutely necessary. (My toy data uses BAM binary files because the human-readable counterpart SAM can’t be sorted/indexed as bach required.) It should be very very tiny. So tiny that uploading it to e.g. GitHub, and having everyone who downloads the tool also download the data, is no problem at all.

Handmade toy data is usually the best option. You can carefully choose data such that exactly what you want will occur, and nothing else. For bach, I handmade the [VCF][VCFspec] using an example VCF as a base, but I didn’t want to manually generate BAM files with enough reads to make sense, so those were made by a simple helper script.

Toy data goes in a clearly labeled directory (e.g. test). It should be made before the actual program, and it should be a test case during development.

How to use toy data

First, it’s a [test case during development][TestDrivenDevelopment], right? Great. Then, once you’re more or less done with a minimum viable product, document it! Create a model command which uses the toy data. (I actually made two, demonstrating different modes of bach.) Document the command. Make sure it works. Publish it.

Congratulations, you’ve just made toy data.

[TestDrivenDevelopment] [VCFspec]: http://samtools.github.io/hts-specs/VCFv4.2.pdf

Share: X (Twitter) Facebook LinkedIn