Technically this paper isn’t published yet, but it’s on biorXiv and has gone through revision with a journal, so I’m using it. Also I’ve just spent the better part of a week trying to use its software.
This month’s paper: Walia S et al.. Compressive Pangenomics Using Mutation-Annotated Networks. bioRxiv 2024. doi: 10.1101/2024.07.02.601807
Original code
This is a marker paper for a software tool. The tool, panmanUtils
(which works
with the novel PanMAN data structure) is on GitHub. The authors
“welcome contributions from the community”. My current rotation project is to
extend panmanUtils
with more functionality (GitHub), so I guess I’m
the community.
Critique
Keep your documentation up to date
I’ll be approaching this critique from the perspective of someone who wants to
improve/extend panmanUtils
. They’ll have been excited by the paper
and want to get their hands dirty with this cool new data structure. Step one
would be to read the availabe documentation.
Unfortunately for this hypothetical person, that’s sort of the exact wrong move. The README, as far as I can tell, is largely fine, if incomplete. That would be fine. The repository links to an external wiki with more details. However, that wiki is riddled with enough errors that it’s hard to know what to trust.
A few examples:
- Multiple commands which use files in the
test/
directory are included as examples. Several of these commands use files which do not exist intest/
. - At least one option (
--fasta
) has a different description in the wiki vs. by the tool itself (when running--help
). The wiki’s description is incorrect. - Copy-paste errors, e.g. two options in Table 1 with the same description.
- Numerous small English grammar errors.
Users should be able to trust documentation as they start on the steep learning curve of figuring out how the code-base works. When that trust is violated, what is there left to rely on? Reading the code? Well, that brings me to the next issue…
Explain what functions do, especially similar ones
If you don’t have time for docstrings, at least throw a brief comment defining what a function does. This goes double if there are multiple similar functions. The user wants to know which they should use. If there are subtle differences, then explain what they are and why they exist.
An example are the functions in fitchSankoff.cpp (in order):
nucFitchForwardPassOpt()
nucFitchForwardPass()
nucFitchBackwardPassOpt()
nucFitchBackwardPass()
nucFitchAssignMutations()
nucFitchAssignMutationsOpt()
blockFitchForwardPassNew()
blockFitchBackwardPassNew()
blockFitchAssignMutationsNew()
nucSankoffForwardPassOpt()
nucSankoffForwardPass()
nucSankoffBackwardPass()
nucSankoffBackwardPassOpt()
nucSankoffAssignMutationsOpt()
nucSankoffAssignMutations()
blockSankoffForwardPass()
blockSankoffBackwardPass()
blockSankoffAssignMutations()
Some of the differences are intuitive. Fitch
functions use the algorithm from
Fitch 1971, and the Sankoff
ones Sankoff 1975.
Some of these functions work with mutations of “blocks” (a concept defined in
the paper) and some of them work with nucleotides.
But what does it mean if New()
or Opt()
are appended to a normal function
name? Which of these functions should I use, if I want to have the Fitch
algorithm as a subroutine? If I want to build something similar to the Fitch
algorithm, which should I emulate?
These functions are not user-facing, but anyone trying to develop around them
will have a headache. If e.g. the Opt()
versions are experimental and under
active development, then perhaps at least put a comment in the .hpp
file to
warn about that.
This problem is multiplied over all the many functions which work together to create the complete tool. Untangling that web would be much easier if each part were labeled.
Eventually I figured out what was happening, via liberal use of external references and staring at nothing. If it wasn’t my project, I might’ve given up. You will lose potential contributors if the barrier to entry is too high. Ease their way in, please. Signpost hat’s already there and how it can be used.
If there’s a recent paper you’d like me to look through, shoot me an email. Address in my CV.