Overleaf Considered Harmful: Why a Simple git Repository is Best for Academic Writing

Posted: February 22, 2018

(Note: This is written from the context of academic writing in Computer Science and related fields—I understand much of this may not generalize to people who don’t write all of their articles in something like LaTeX or pandoc.)

When writing an academic article with multiple collaborators, you really want to be using a git repository to manage your LaTeX sources and/or other related materials.

I think this is common advice given out to grad students in CS and related departments, but a lot of people don’t really seem to be taking it to heart. Using tools like Overleaf, ShareLaTeX, or even a shared sync folder like Dropbox seems to be the norm. While I’m in agreement that these tools can really reduce the friction for getting started, I think that it’s in your best interest as a graduate student learning how to write effectively in an academic setting to use a git repository instead. Here are my reasons why.

1. Learning Through diffs

When you use git to manage your paper’s sources, you are more or less forced to create a built-in CHANGELOG for your paper. What was changed? In what section? Why was the change made? These things, even when being incredibly brief, make their way into your (and others’) commit messages and provide important context for the changes being made in a paper. You also get a very explicit denotation of what exactly was changed between two versions of the paper, rather than having to implicitly discover them by occasionally stumbling upon words that you don’t think you initially wrote.

Why is this important? One of the implicit goals of getting a research degree (a thesis Master’s or a PhD) is to learn how to communicate complex ideas clearly and efficiently. As a graduate student (especially at the early stages), most of what we are doing is imitation—we read a bunch of papers that we like, and we do our best to write in a way that emulates that “voice”. This is can be an effective strategy, but it’s sort of like trying to learn how to cook by eating food at a restaurant. Sure, you can really learn to appreciate what constitutes a great meal, but you aren’t developing any real sense of the process by which that success was created. Nobody becomes a Michelin star chef by solely being a food critic.

If you use version control, and you pay attention to what your more experienced collaborators and advisor are doing, you can start to pick up on strategies and patterns that they employ and don’t even know that they are employing to construct well written arguments. This goes beyond even the most concrete advice and examples they can give you, because you can see exactly what changes they made to your writing to improve it. This is the most concrete and actionable advice you can get, and if you’re not using version control most of this advice is just being lost. Don’t throw away valuable learning opportunities. Version control your sources, and learn from the patches made by people with more experience than you.

2. Versioning of Arguments

When you use a standard git repo, you automatically create a version history of the argument and framing you are using in the paper. If at any point in time you want to go back to something that used to be there, you can. This gives you a degree of freedom that I feel a lot of people are uncomfortable with, but is very freeing once you recognize you have it.

If you’re using version control, you don’t need to comment out sentences or use \ignore{}-style macros when they’ve been replaced or when they become vestigial. Just delete them. You can always get them back, because they are part of the version history. The current document should reflect exactly what is in the text of the current version, not some strange mishmash of things that used to be and things that currently are. I see this a lot in Dropbox or Overleaf documents where people are afraid of losing material when they remove it from a version of the paper. This is because the versioning in these systems is hidden from you and is not explicit. git gives you exactly what changes were made between each version, by whom, and for what reason.

As a collaborator, this empowers you to make wording changes wherever you feel it might help. If the main author doesn’t like them, they can always go back to what was previously there with minimal effort, without you having to make the source an ugly mess with \ignore{}s and commented out text everywhere. Make the changes you want to make, justify them in your commit message, and your intent is made clear to all authors so they can continue to improve upon what you’ve done.

Being able to quickly get a sense of how a paper has evolved between editing sessions is also invaluable as a collaborator. What did my co-authors change since the last time I viewed the paper? Where should I focus my efforts? If you’re using a git repository, this is as simple as a git fetch && git diff origin/master away. If you’re using pretty much anything else, you’re more or less forced to have to have an old copy of the generated PDF sitting around somewhere to manually compare things. This slows down the process of editing and providing feedback.

3. Versioning of Figures

This is a big one for me. When you’re polishing a manuscript, you are going to spend a lot of time generating figures and plots for the paper. These will be regenerated numerous times, changing things like the coloration, the label sizes, the positioning of the legend(s), and sometimes even the plot type itself.

A lot of the times when dealing with a Dropbox or Overleaf manuscript, these figures are “magic”. They appear out of nowhere, with no context for how they were generated, as files uploaded into a figures/ folder in the project. By using a git repository, it is much more natural to start to consider the fact that the “source” for a paper ought to include the “source” for the figures as well. If you adhere to this, you commit the code used to generate a figure alongside the actual figure itself. You can now go back to any previous iteration of a figure effortlessly, and you can also see the progression of a figure from initial conception to the final polished version. Doing this also makes more explicit the exact data used to generate the figure1, so you don’t come back to a paper months later having no idea what data was used for generating the figure.

If you don’t version your figures and their associated sources, you don’t actually have a full history for your paper. Version your figures!

4. Offline Work and Conflict Resolution

$ ls
"sec_introduction (XXX's conflicted copy YYYY-MM-DD).tex"
"sec_introduction.tex"
"abstract (XXX's conflicted copy YYYY-MM-DD).tex"
"abstract.tex"
"sec_experiments (XXX's conflicted copy YYYY-MM-DD).tex"
"sec_experiments (YYY's conflicted copy YYYY-MM-DD).tex"
"sec_experiments.tex"
...

This sort of disaster is common when using Dropbox to collaborate. In an academic setting, your advisor is likely to spend a ton of time at about 35,000 feet above Earth’s surface. Internet connectivity on flights is starting to get better, but is typically still spotty at best even when flying with business or first-class accommodations.

Now, you can’t avoid conflicts if two people edit the same passage at the same time offline. Every collaboration system will have this problem. But with a true version control system like git, you at least get somewhat sane conflict resolution built-in to the tool. And, in the event that nobody worked on the same paragraph at the same time, you’re likely to avoid conflicts at all and get just a nice, clean merge when your collaborator finally gets an Internet connection at their hotel or the conference venue.

Online-only systems like Overleaf or ShareLaTeX lock out your advisor and collaborators from working on the paper while they’re in the air or otherwise away from a stable Internet connection. In grad school, you’re going to want to maximize the amount of possible times your advisor and collaborators can potentially spend looking at your paper. Using a simple git repository ensures they can still be helpful when traveling abroad.

5. Learning From Each Other

The utility of this depends on your exact git setup, but the configuration I’m seeing become a growing trend is for research groups to have either a private git hosting solution set up on-site, or to be using a GitHub organization or similar on some shared repository hosting website. In that kind of setting, you can learn a lot about writing from your lab mates should you be able to view their papers’ git histories, too.

When things are locked away in random Dropbox folders or in unorganized Overleaf/ShareLaTeX links thrown around in emails or messenger platforms, it becomes harder to develop some shared institutional knowledge about writing since it isn’t all together in one organized place.

But Overleaf gives me a git repo!

Yes, it does, but it completely misses the point. The advantage of having a git repository is that you have a complete, versioned history of all of the changes that were made. Overleaf’s git history is basically completely useless:

$ git log --pretty=oneline
01bf697c4422a417e7a2751095cf33d609c7e8f7 (HEAD -> master, origin/master, origin/HEAD) Update on Overleaf.
a0873e520e0a7f9b510646c165c68ae5f8dbf4a4 Update on Overleaf.
cce528e94b0ce0a8e5faddadd47f98f0e5a9202a Update on Overleaf.
9b28a1069c4134ab63a6029d4e97666161d1ce39 Update on Overleaf.
fed08a3b8aa632e9be63c87b04a91723f31c77ed Update on Overleaf.
6b71abd84014e5fe0b554eeeb8709d376be79674 Update on Overleaf.
c11b6f08403c3fb7724ff08db5177855c3b51a2f Update on Overleaf.
...

What was the update? Why was it made? In what section? What was the rationale behind the change? All of this is missing in the git repo provided by Overleaf, which basically completely negates all of the advantages of using a git repository. Multiple commits are made for one change, including false-starts and mid-sentence revisions. The history, as a result, is full of meaningless diffs that defeat the purpose of maintaining a version history.

Advice for Maximizing git Effectiveness

Most of the discussion above is generally applicable even if you are using git poorly, but following good practices when using version control can magnify the benefits discussed above. Here are some concrete pieces of advice that can supercharge the benefit of using git over these other solutions:

  1. Commit early, often, and with small changes. This allows you to create a more granular version history and gives more opportunities to provide yourself and your collaborators context for the changes that were made.

    Note that the distinction between committing and pushing in git allows you to do a lot of work locally, and then piece together chunks of changes that you made that logically go together later by using git add -p filename. The “right” way is to be making commits as you go, but I’ve found that I tend to forget to do that. git add -p is my crutch in those cases.

    You want to make each commit a logical change. This is no different than when you’re using git for version control of a programming project. Each commit should explain what the commit is for, what the change does, and why the change was made. They should be as small and self-contained as humanly possible.

  2. Use good commit messages. git commit -m "Rewrite introduction" is less useful than doing a full git commit and opening up your text editor and writing a long-form message like:

    Reframe introduction
    
    Emphasize the impact and problem importance earlier to hook the reader
    and be very explicit about the concrete contributions to avoid burying
    the lede.
    

    The more context you provide here, the more information you give anyone viewing the repository’s changeset (which includes your advisor and other collaborators).

    Doing this right is hard, I know, but it is very useful. Strive to do the best you can.

  3. Use --word-diff or --color-words in commands that print diff like git diff or git log -p. This is especially nice when viewing diffs of hard-wrapped text.

  4. Push and pull often. You want to make sure you’re up-to-date with your collaborators as you’re making changes. Going a long time between pulling in changes is a bad idea, especially when you know multiple people are working at once during crunch time. Frequent incorporation of changes can often avoid conflicts entirely, and when they do occur they are smaller in size and more manageable.

  5. Version as much as possible. This includes code and data used to generate figures, but also should include things like requirements.txt or Pipfiles for your code to capture the exact libraries and their versions used. You want to be able to come back to the repository potentially years later and still be able to compile the paper from scratch and update figures. Remember: if it’s your first-author paper it’s likely going to be part of your thesis. You’ll thank yourself later if you make porting its contents to the thesis format easier.

    Try to document how to use your tools, what version of the Python interpreter you were using at the time, etc. This is useful for both you and your lab mates, should your group dynamic be to share completed results (and it should be).

  1. You should also ideally version control the data used for the figure generation as well. There’s an important distinction here between the raw data used for generating a figure and the relevant data used by the plot itself. I see a lot of “figure” creation scripts that do a lot of heavy processing before doing the actual work of making the plot. Save those intermediate results! If you can’t commit the raw data, at least commit the processed version of the data that has the relevant information for regenerating a plot.