A Google Scholar alert sent me some clickbait recently. Well, clickbait for folks who work on structural variation (SV) genotyping. A preprint claiming to genotype SVs using short reads just as well as standard long read methods. I prefer to not critique preprints, but lucky me, its repository links to prior papers, whose code has similar issues.
This month’s paper: Rajaby R and Sung WK. SurVIndel2: improving copy number variant calling from next-generation sequencing using hidden split reads. Nature Communications 2024. doi: 10.1038/s41467-024-53087-7
Original code
This paper’s code is on GitHub. (Note: not the preprint.) I focus on
survindel.py because it seems to be a main entry point.
Critique
Standard disclaimer: issues with published code are not necessarily anyone’s fault, and often are due to nothing more nefarious than time constraints.
Magic numbers could use some explanation
The top of the file defines a few magic numbers:
MAX_READS = 1000
GEN_DIST_SIZE = 100000
MAX_ACCEPTABLE_IS = 20000
I can see what these magic numbers are pretty clearly, but I have no idea why they have those values. Why do we need a maximum insert size? Why no more than 1k reads? Also, what is 1k reads the maximum for? Typical sequencing runs have way more than 1k reads, so this probably is for some sort of subset. But no context is provided.
… and also names
While I got some hope from those magic number definitions, later on we get:
while len(general_dist) < GEN_DIST_SIZE:
chr, pos = random_positions[rnd_i]
rnd_i += 1
if pos > len(reference_fa[chr])-10000:
continue
i = 0
for read in bam_file.fetch(contig=chr, start=pos, end=pos+10000):
if read.is_proper_pair and not read.is_secondary and not read.is_supplementary and \
0 < read.template_length < MAX_ACCEPTABLE_IS:
if i > 100:
break
i += 1
general_dist.append(read.template_length)
Um, what’s that 10k doing? It seems we’re searching in 10kbp windows; why? There
are other magic numbers which get names (e.g. GEN_DIST_SIZE), so why is 10k
a bare number? I can make guesses about the answers, or look elsewhere, but why
should I have to?
As-is, I look at that 10k, and I think, “why?”
This goes as well for other magic numbers. For example, why do we only search
100 reads for a pair? Why not more? Why not less? And I had to think a bit to
figure out that’s what the if i > 100: break bit meant; would have been
better to have an explicit explanation.
Don’t always rely on self-documenting names
The end of the file has:
clip_consensus_builder_cmd = SURVINDEL_PATH + "/clip_consensus_builder %s %s %s %s" % (cmd_args.bam_file, cmd_args.workdir, cmd_args.reference, sample_name)
run_cmd(clip_consensus_builder_cmd)
normalise_cmd = SURVINDEL_PATH + "/normalise %s/sr.vcf.gz %s/sr.norm.vcf.gz %s" % \
(cmd_args.workdir, cmd_args.workdir, cmd_args.reference)
run_cmd(normalise_cmd)
merge_identical_calls_cmd = SURVINDEL_PATH + "/merge_identical_calls %s/sr.norm.vcf.gz %s/sr.norm.dedup.vcf.gz" % \
(cmd_args.workdir, cmd_args.workdir)
run_cmd(merge_identical_calls_cmd)
dp_clusterer = SURVINDEL_PATH + "/dp_clusterer %s %s %s %s" % (cmd_args.bam_file, cmd_args.workdir, cmd_args.reference, sample_name)
run_cmd(dp_clusterer)
Here everything is nicely named, but it’s also a bunch of names thrown around at once, with no explanatory comments. Why run the consensus building before the normalization? What exactly is being normalized, and how, and why?
Right now this is a bare list of steps. Some context would be nice.
“Why” comments are incredibly important for someone else trying to understand, well, why the code is like that. I looked through a 165 line Python script here. That’s a famously readable language. And I still came away with a bunch of “Why” questions. (Not so much “how” or “what”, but then… “why” is the basis of a bunch of science, y’know.)
If there’s a recent paper you’d like me to look through, shoot me an email. Address in my CV.