Systems Programming in Data Science

Posted: September 18, 2015

A former student of mine recently asked a good question:

Is <systems programming course> really beneficial for everyone? I want to do NLP and <systems programming course> makes me feel like the political science major in a math class complaining “WHEN WILL I EVER NEED TO USE THIS MULTIPLICATION STUFF?”

This speaks to me. Throughout my undergraduate education, I always needed to find an answer to “why study this” before I could really get into the material. Too often, courses just presuppose that students understand why the material is useful and offer little to no explanation for the applicability of what’s being taught.

I’m sure I’m just as guilty as others of this.

This question is borne from a question of “how does this apply to me”, which is a good question to be asking. In fact, one of the things that my advisor always stresses is to be able to think of at least one application of some technique to your own work for any presentation you listen to, ever. He’s some sort of deity at this, because without fail he will always ask some question that ties in the presenter’s research with his own in a way that could easily be a stepping stone toward collaboration.

Here’s my answer to “why should I study systems level programming for a degree/career in data science”. I define “data science” loosely to include NLP mentioned above, but this applies just as well to the entire field of Database and Information Systems.

Real-world performance matters

Suppose you design a new algorithm for solving some inference task. An easy to overlook but very important aspect of actually implementing that algorithm is going to be the question: is it parallelizable? Note that there are two systems programming aspects here: threads and synchronization (parallelization on one machine) as well as networking/distributed systems (parallelization on several machines). Both of these considerations need, at a bare minimum, rudimentary understanding of systems level programming. If your algorithm gives us 1% better F1 score but takes 100% more time, chances are it won’t be adopted. Time constraints in data science are very real.

A perfect example of this is the KenLM project resulting from Kenneth Heafield’s PhD study. His thesis was “Efficient Language Modeling Algorithms with Applications to Statistical Machine Translation”, which was motivated by the observation that statistical machine translation performance improves monotonically with the amount of data used to train the target-side language model. Unfortunately, existing systems were often too slow in practice for real MT systems to throw web-scale data into their language model, unless you were Google and could throw 100+ machines and several entire DAYS at estimating your language model. And even then, you still would need approximations.

That is, before KenLM was implemented.

Sharing your results as a real-world system

Suppose you’ve designed a cool system that you want to show the world. Probably, you’re going to set up some sort of web-based demo. Suddenly you’re faced with the task of integrating your algorithm with some web-facing application. There’s a ton of ways to do this, but the simplest one is to design some basic HTTP-based API for your tool. Now you’re dealing with networking again. Granted, you’re probably going to use some library to abstract things away from direct BSD socket APIs, but the same fundamental concepts are going to apply. How will your service handle multiple clients at the same time? Threads? A reactor/event loop? Now you’ve got to think about how your algorithm ties in to that. Are your models thread safe? Can they be safely integrated into a non-blocking, evented IO loop?

These are the same issues practitioners will encounter when/if they try to leverage your models in real systems. It’s important to think about your algorithms in the bigger context of some real system: if it’s impossible to integrate, it won’t be used.

My own personal anecdote

In general, you can’t get away with ignoring systems stuff in designing real systems. Now, if you want to be the guy that spends all of his time in theory land and mathematics, go right ahead (malloc is constant time! I have all the memory! What’s “out-of-core”?).

But at the end of the day someone like me has to implement the algorithm on a real computer, and these systems level things come in to play all the time. My short list:

  • Threading and synchronization: threads, mutexes, condition variables, semaphores, deadlock avoidance, readers-writers locks, …
  • Caching: both on the actual CPU and for your own results
  • Networking: sockets, TCP vs UDP, blocking vs non-blocking IO, linearizability and eventual consistency, …
  • Filesystems and Serialization: in-memory to on-disk model storage (hint, you deal with bytes, endianness, and compression!), buffering, transaction-safety, …

Here’s a simple real example from my experience: by understanding how to use tools like profilers, coupled with a knowledge that “geez, malloc() is really expensive!”1, I was able to nearly double the throughput of the text analysis pipeline in MeTA.

When you’re dealing with truly vast amounts of data, a few milliseconds per data item adds up. Constants matter in real-world systems.

For every polynomial-time algorithm you have, there is an exponential algorithm that I would rather run. — Alan Perlis

A better understanding of the system will lead to better performing algorithms, often in surprising ways. Things like cache-effects and false-sharing become incredibly important as you scale things.

  1. Seriously, strings are BAD