Organizing files for code

Today I’ll discuss a meta-issue: not what you put in the code, but where you put the code. I’m not so arrogant as to think my way of organizing is the best. Instead, I’ll discuss what I do, and explain why. Take what you’d like.

My file organization

Upon starting a project, I make a new directory. It’s named something relevant, e.g. y_mt_assoc for an association study of the Y and MT chromosomes. The first file is a README. It explains the directory’s purpose.

Within the main directory, I create code/, data/, log/, and results/. Each does what it says on the tin: logfiles are in log/, data in data/, etc. Typically there will be sub-directories, especially in data/ and results/, but those are project-specific and also often not immediately obvious.

Each script is named for what it does. For example, find_biased_regions.py is a script which finds “biased regions”. Its script docstring would explain the scan and define what a biased region was. However, such details are too wordy for the filename.

Sometimes code files are “helpers”: only used from other files. For example, if many of scripts use the same custom input/output functions, I’ll make a helper which consolidates those. This abides by the DRY principle. Helper files have the suffix _helpers, e.g. io_helpers.py.

Why do I use this organization method?

I love consistency. It’s why I can get around my apartment in pitch black: I know where everything is. If I need to quickly grab a logfile, I know exactly where to look. Same for searching through all the code from a project because I vaguely remember that one file had some useful function.

Besides finding files I need, this also makes writing code easier. I don’t have to spend any thought on where I should write results to. They all go in results/. I/O is one of the most finicky parts of any program, and this reduces overhead. While I still document assumptions (e.g. “GRM filenames are relative to data/GRM/”), it’s easier when the assumptions are standard across scripts. I might later add robustness by e.g. accepting files as command-line arguments, but in early development I just want a thing that works now.

A consistent naming format makes it easier to find the script I need, and makes it easier to know what everything does at a glance.

Plus, the README. I cannot stress enough, please make a README.

Tags: process file-level