samgeo.codes

Apparently, lexing and parsing are the the slowest part of compiling.

I was reading through some old discussions of abstraction and optimization over the weekend and stumbled upon an interesting take from Joel Spolsky's Back to Basics post:

As every compiler writer knows, lexing and parsing are the slowest part of compiling. Suffice it to say that it involves a lot of string stuff, which we discovered is slow, and a lot of memory allocation stuff, which we discovered is slow, as we lex, parse, and build an abstract syntax tree in memory. That assumes that you have enough memory to load the whole thing at once.

This quip is part of a discussion of why XML is a poor choice for performance sensitive systems or when handling a very large quantity of data.

I generally enjoy a lot of Joel's discussion of software design and how to run an effective engineering organization. But I really thought this short snippet has aged poorly (and probably wasn't even very accurate at the time):

  • my impression is that lexing and parsing is almost always one of the fastest parts of compiling, or at least simple enough to optimize and parallelize that it should rarely be the bottleneck;
  • "string stuff" is not slow (a "naive" SIMD implementation of HTML scanning proceeds faster than we can read bytes from almost any single SSD);
  • a lot of modern parsers emit offsets that consume only a tiny fraction of the memory used to represent the original document.

XML and most other text-based formats still have a significant drawback (that Joel notes correctly): we may still need to parse a significant portion of the file (if not all of the file) to retrieve a small piece of data; or we'll need to store metadata alongside the file containing offsets to various important pieces of data. I wish that Joel stayed focused on this more fundamental idea instead of ranting about "string stuff" and memory allocation.

It's interesting to see how systems programming languages have evolved over the years to tackle some of the issues that Joel describes with string processing:

  • Modern C++ offers std::string and std::string_view
  • Rust has String and &str (and, more generally, "fat pointers")
  • Zig has a notion of slices, but also supports sentinel termination

And we now have a reasonable variety of formats that support efficient seek operations (think riegeli or arrow). Joel suggests that relational databases support moving from record to record in a single CPU instruction:

With relational databases, the performance of moving from record to record is fixed and is, in fact, one CPU instruction. That’s very much by design. And thanks to memory mapped files you only have to load the pages of disk that you are actually going to use.

I'm not aware of any RDBMS system that supports single instruction steps. But we do have plenty of binary file formats that provide reasonable tradeoffs to enable efficient implementation of "seek".