(Note: This is written from the context of academic writing in Computer
Science and related fields—I understand much of this may not generalize
to people who don’t write all of their articles in something like LaTeX or
pandoc
.)
When writing an academic article with multiple collaborators, you really
want to be using a git
repository to manage your LaTeX sources and/or
other related materials.
I think this is common advice given out to grad students
in CS and related departments, but a lot of people don’t really seem to be
taking it to heart. Using tools like Overleaf,
ShareLaTeX, or even a shared sync folder like
Dropbox seems to be the norm. While I’m in agreement that these
tools can really reduce the friction for getting started, I think that it’s
in your best interest as a graduate student learning how to write
effectively in an academic setting to use a git
repository instead.
Here are my reasons why.
1. Learning Through diff
s
When you use git
to manage your paper’s sources, you are more or less
forced to create a built-in CHANGELOG for your paper. What was changed? In
what section? Why was the change made? These things, even when being
incredibly brief, make their way into your (and others’) commit messages
and provide important context for the changes being made in a paper.
You also get a very explicit denotation of what exactly was changed
between two versions of the paper, rather than having to implicitly
discover them by occasionally stumbling upon words that you don’t think you
initially wrote.
Why is this important? One of the implicit goals of getting a research degree (a thesis Master’s or a PhD) is to learn how to communicate complex ideas clearly and efficiently. As a graduate student (especially at the early stages), most of what we are doing is imitation—we read a bunch of papers that we like, and we do our best to write in a way that emulates that “voice”. This is can be an effective strategy, but it’s sort of like trying to learn how to cook by eating food at a restaurant. Sure, you can really learn to appreciate what constitutes a great meal, but you aren’t developing any real sense of the process by which that success was created. Nobody becomes a Michelin star chef by solely being a food critic.
If you use version control, and you pay attention to what your more experienced collaborators and advisor are doing, you can start to pick up on strategies and patterns that they employ and don’t even know that they are employing to construct well written arguments. This goes beyond even the most concrete advice and examples they can give you, because you can see exactly what changes they made to your writing to improve it. This is the most concrete and actionable advice you can get, and if you’re not using version control most of this advice is just being lost. Don’t throw away valuable learning opportunities. Version control your sources, and learn from the patches made by people with more experience than you.
2. Versioning of Arguments
When you use a standard git
repo, you automatically create a version
history of the argument and framing you are using in the paper. If at any
point in time you want to go back to something that used to be there, you
can. This gives you a degree of freedom that I feel a lot of people are
uncomfortable with, but is very freeing once you recognize you have it.
If you’re using version control, you don’t need to comment out sentences or
use \ignore{}
-style macros when they’ve been replaced or when they become
vestigial. Just delete them. You can always get them back, because they
are part of the version history. The current document should reflect
exactly what is in the text of the current version, not some strange
mishmash of things that used to be and things that currently are. I see
this a lot in Dropbox or Overleaf documents where people are afraid of
losing material when they remove it from a version of the paper. This is
because the versioning in these systems is hidden from you and is not
explicit. git
gives you exactly what changes were made between each
version, by whom, and for what reason.
As a collaborator, this empowers you to make wording changes wherever you
feel it might help. If the main author doesn’t like them, they can always go
back to what was previously there with minimal effort, without you having
to make the source an ugly mess with \ignore{}
s and commented out text
everywhere. Make the changes you want to make, justify them in your commit
message, and your intent is made clear to all authors so they can continue
to improve upon what you’ve done.
Being able to quickly get a sense of how a paper has evolved between
editing sessions is also invaluable as a collaborator. What did my
co-authors change since the last time I viewed the paper? Where should I
focus my efforts? If you’re using a git
repository, this is as simple as
a git fetch && git diff origin/master
away. If you’re using pretty much
anything else, you’re more or less forced to have to have an old copy of
the generated PDF sitting around somewhere to manually compare things. This
slows down the process of editing and providing feedback.
3. Versioning of Figures
This is a big one for me. When you’re polishing a manuscript, you are going to spend a lot of time generating figures and plots for the paper. These will be regenerated numerous times, changing things like the coloration, the label sizes, the positioning of the legend(s), and sometimes even the plot type itself.
A lot of the times when dealing with a Dropbox or Overleaf manuscript,
these figures are “magic”. They appear out of nowhere, with no context for
how they were generated, as files uploaded into a figures/
folder in the
project. By using a git
repository, it is much more natural to start to
consider the fact that the “source” for a paper ought to include the
“source” for the figures as well. If you adhere to this, you commit the
code used to generate a figure alongside the actual figure itself. You can
now go back to any previous iteration of a figure effortlessly, and you can
also see the progression of a figure from initial conception to the final
polished version. Doing this also makes more explicit the exact data used
to generate the figure1, so you don’t come back to a
paper months later having no idea what data was used for generating the
figure.
If you don’t version your figures and their associated sources, you don’t actually have a full history for your paper. Version your figures!
4. Offline Work and Conflict Resolution
$ ls
"sec_introduction (XXX's conflicted copy YYYY-MM-DD).tex"
"sec_introduction.tex"
"abstract (XXX's conflicted copy YYYY-MM-DD).tex"
"abstract.tex"
"sec_experiments (XXX's conflicted copy YYYY-MM-DD).tex"
"sec_experiments (YYY's conflicted copy YYYY-MM-DD).tex"
"sec_experiments.tex"
...
This sort of disaster is common when using Dropbox to collaborate. In an academic setting, your advisor is likely to spend a ton of time at about 35,000 feet above Earth’s surface. Internet connectivity on flights is starting to get better, but is typically still spotty at best even when flying with business or first-class accommodations.
Now, you can’t avoid conflicts if two people edit the same passage at the
same time offline. Every collaboration system will have this problem. But
with a true version control system like git
, you at least get somewhat
sane conflict resolution built-in to the tool. And, in the event that
nobody worked on the same paragraph at the same time, you’re likely to
avoid conflicts at all and get just a nice, clean merge when your
collaborator finally gets an Internet connection at their hotel or the
conference venue.
Online-only systems like Overleaf or ShareLaTeX lock out your advisor and
collaborators from working on the paper while they’re in the air or
otherwise away from a stable Internet connection. In grad school, you’re
going to want to maximize the amount of possible times your advisor and
collaborators can potentially spend looking at your paper. Using a simple
git
repository ensures they can still be helpful when traveling abroad.
5. Learning From Each Other
The utility of this depends on your exact git
setup, but the
configuration I’m seeing become a growing trend is for research groups to
have either a private git
hosting solution set up on-site, or to be using
a GitHub organization or similar on some shared repository hosting website.
In that kind of setting, you can learn a lot about writing from your lab
mates should you be able to view their papers’ git
histories, too.
When things are locked away in random Dropbox folders or in unorganized Overleaf/ShareLaTeX links thrown around in emails or messenger platforms, it becomes harder to develop some shared institutional knowledge about writing since it isn’t all together in one organized place.
But Overleaf gives me a git
repo!
Yes, it does, but it completely misses the point. The
advantage of having a git
repository is that you have a complete,
versioned history of all of the changes that were made. Overleaf’s
git
history is basically completely useless:
$ git log --pretty=oneline
01bf697c4422a417e7a2751095cf33d609c7e8f7 (HEAD -> master, origin/master, origin/HEAD) Update on Overleaf.
a0873e520e0a7f9b510646c165c68ae5f8dbf4a4 Update on Overleaf.
cce528e94b0ce0a8e5faddadd47f98f0e5a9202a Update on Overleaf.
9b28a1069c4134ab63a6029d4e97666161d1ce39 Update on Overleaf.
fed08a3b8aa632e9be63c87b04a91723f31c77ed Update on Overleaf.
6b71abd84014e5fe0b554eeeb8709d376be79674 Update on Overleaf.
c11b6f08403c3fb7724ff08db5177855c3b51a2f Update on Overleaf.
...
What was the update? Why was it made? In what section? What was the
rationale behind the change? All of this is missing in the git
repo
provided by Overleaf, which basically completely negates all of the
advantages of using a git
repository. Multiple commits are made for one
change, including false-starts and mid-sentence revisions. The history, as
a result, is full of meaningless diff
s that defeat the purpose of
maintaining a version history.
Advice for Maximizing git
Effectiveness
Most of the discussion above is generally applicable even if you are using
git
poorly, but following good practices when using version control can
magnify the benefits discussed above. Here are some concrete pieces of
advice that can supercharge the benefit of using git
over these other
solutions:
-
Commit early, often, and with small changes. This allows you to create a more granular version history and gives more opportunities to provide yourself and your collaborators context for the changes that were made.
Note that the distinction between committing and pushing in
git
allows you to do a lot of work locally, and then piece together chunks of changes that you made that logically go together later by usinggit add -p filename
. The “right” way is to be making commits as you go, but I’ve found that I tend to forget to do that.git add -p
is my crutch in those cases.You want to make each commit a logical change. This is no different than when you’re using
git
for version control of a programming project. Each commit should explain what the commit is for, what the change does, and why the change was made. They should be as small and self-contained as humanly possible. -
Use good commit messages.
git commit -m "Rewrite introduction"
is less useful than doing a fullgit commit
and opening up your text editor and writing a long-form message like:Reframe introduction Emphasize the impact and problem importance earlier to hook the reader and be very explicit about the concrete contributions to avoid burying the lede.
The more context you provide here, the more information you give anyone viewing the repository’s changeset (which includes your advisor and other collaborators).
Doing this right is hard, I know, but it is very useful. Strive to do the best you can.
-
Use
--word-diff
or--color-words
in commands that print diff likegit diff
orgit log -p
. This is especially nice when viewing diffs of hard-wrapped text. -
Push and pull often. You want to make sure you’re up-to-date with your collaborators as you’re making changes. Going a long time between pulling in changes is a bad idea, especially when you know multiple people are working at once during crunch time. Frequent incorporation of changes can often avoid conflicts entirely, and when they do occur they are smaller in size and more manageable.
-
Version as much as possible. This includes code and data used to generate figures, but also should include things like
requirements.txt
orPipfile
s for your code to capture the exact libraries and their versions used. You want to be able to come back to the repository potentially years later and still be able to compile the paper from scratch and update figures. Remember: if it’s your first-author paper it’s likely going to be part of your thesis. You’ll thank yourself later if you make porting its contents to the thesis format easier.Try to document how to use your tools, what version of the Python interpreter you were using at the time, etc. This is useful for both you and your lab mates, should your group dynamic be to share completed results (and it should be).
-
You should also ideally version control the data used for the figure generation as well. There’s an important distinction here between the raw data used for generating a figure and the relevant data used by the plot itself. I see a lot of “figure” creation scripts that do a lot of heavy processing before doing the actual work of making the plot. Save those intermediate results! If you can’t commit the raw data, at least commit the processed version of the data that has the relevant information for regenerating a plot. ↩