Application Performance

[ARCHIVED] Last updated by Joe Schaefer on Tue, 23 Apr 2024    source
 

Many developers fall in to the trap of thinking performance optimization is about making each line of code as as efficient as possible, or choosing to write their entire app in the fastest programming language they can find.

It’s actually the opposite. You start with the architectural constraints of the application, and use them to drill down to the observed “slowest” part of the program. That part’s implementation guides all the other performance choices you need to make. Anything that’s not as slow as that part, need not be optimized further. Instead, focus on the human expression and the simplicity and clarity of the implementation, to non-expert readers over the SSDLC of the software, for the rest of your program’s code.

You can iterate on this playbook, but I’ve never needed to go beyond 3 iterations in my professional career.

So go ahead and use an elegant programming language like Python3 or Javascript/Typescript, and let the Subject Matter Experts (SME) out there in the open source world give you powerful C/C++ native bindings for your special purpose needs. Nothing you do for business logic needs more speed than any dynamic programming language can give you out of the box.

Even a dependency-free bash script is a workable solution for many basic tasks. Here’s one I wrote for Augmented Reality firm Magic Leap years ago, to replace a clumsy OpenGrok service with something that leverages multiprocessor parallelization with xargs -P, and supports PCRE search with simple Emacs/Vim bindings.

https://github.com/joesuf4/home/blob/wsl/bin/pffxg.sh

That script is an order of magnitude faster than the usual suspects on GitHub, which were all written in static, compiled programming languages. But by identifying the exact bottleneck in bash (looping with high volume fork+exec calls in the middle), and using xargs instead, you get a script that looks a lot like this one, with the core algorithm implemented in 10 lines of shell.

It’s also using the Open Source Community of SME‘s in a smart way, instead of the way the other “filtered recursive grep” implementations on GitHub did. Instead of internally adopting and maintaining my own (threaded) implementation of find, xargs, and grep, I just reuse the pre-installed executables other SME‘s have been perfecting over decades as-is. I don’t need to master their implementations, just reuse their CLIs. I don’t even want to master them, that’s their bailiwick. Performance deltas only matter when they are several seconds or more, given the application’s expected (human) use cases.

To see the opposite tack, where everything is done in-house, fully microoptimized, and still can’t beat this script with the default search options, and no caching system available, here is a fine example https://github.com/BurntSushi/ripgrep

Just to pull the first #performance #benchmark off that page, and scale it up from a toy sample tree size (linux kernel sources), to a heterogeneous tree that’s 23GB: (best runs after 3 iterations; LANG=en_US.UTF-8)

    % du -sh .
    23G .
    % time rg -uuniw '[A-Z]+_SUSPEND' | wc -l
    6259
    rg -uuniw '[A-Z]+_SUSPEND' 9.46s user 16.08s system 261% cpu 9.759 total
    wc -l 0.00s user 0.07s system 0% cpu 9.759 total
    % time pffxg.sh -- -wnE '[A-Z]+_SUSPEND' | wc -l
    5855
    pffxg.sh -- -wnE '[A-Z]+_SUSPEND' 16.66s user 2.68s system 429% cpu 4.501 total
    wc -l 0.00s user 0.00s system 0% cpu 4.501 total

It’s quite silly to microoptimize something that’s deeply tied to the state of the kernel’s filesystem cache for your search. The variation in performance timings is dominated by the access speed to the corpus of files’ contents, and it is an order of magnitude more relevant than any other factor to the end results. Being on an NVMe helps, but nothing in this space beats RAM itself.

That’s why having an in-memory, compressed cache for a large corpus of files, will stabilize the performance timings. It is surprising that no one else thought this was important enough to support.

Take the second #performance #benchmark off that page, and scale it up as before (same 23GB tree):

    % time rg -tc -uuuiwn '[A-Z]+_SUSPEND' | wc -l
    5629
    rg -tc -uuuiwn '[A-Z]+_SUSPEND' 3.51s user 1.71s system 1141% cpu 0.457 total
    wc -l 0.00s user 0.05s system 11% cpu 0.457 total
    % time LANG=C pffxg.sh --cache /tmp/pffxg-$USER --workers 32 --cc -- -wE '[A-Z]+_SUSPEND' | wc -l
    5628
    LANG=C pffxg.sh --cache /tmp/pffxg-$USER --workers 32 --cc -- -wE  3.14s user 0.88s system 1055% cpu 0.381 total
    wc -l 0.00s user 0.00s system 0% cpu 0.381 total

A tuned pffxg.sh is still faster, despite all the work put into microoptimizing ripgrep for this C-file lookup.

The way I used this script with AOSP was to schedule a repo sync and a subsequent pffxg.sh lzop-compressed-cache seed-to-tmpfs run every morning before work (via crontab), with PFFXG_CACHE=... set in my ~/.pffxg.conf file. Thus any pffxg.sh invocations I ran throughout the workday would use the compressed cache in tmpfs, no matter what the state of the kernel’s filesystem cache was at the time.

.25M LOC between ripgrep and ugrep. 632 LOC for pffxg.sh. Silly.

Because it’s such a small shell program, pffxg.sh can give you powerful hooks into its internals with almost zero effort. Even the grep command itself is customizable: any command you need to run on a select corpus of files, that can accept a list of filenames appended to the end of its arguments, is fair game. Here’s a “total line count in MiLOC“ exercise on the linux kernel git repo:

    % time find * -type f | xargs wc -l | awk '{ $2 == "total" {a+=$1} END {print a/1024**2}'
    28.451
    find * -type f 0.00s user 0.06s system 2% cpu 2.733 total
    xargs wc -l 0.53s user 1.02s system 54% cpu 2.853 total
    awk '$2 == "total" {a+=$1} END {print a/1024**2}' 0.23s user 0.59s system 28% cpu 2.853 total

    % time pffxg.sh --workers 8 --cmd wc --all -- -l | awk '{$2 == "total" {a+=$1} END {print a/1024**2}'
    28.4506
    pffxg.sh --workers 8 --cmd wc --all -- -l 0.92s user 0.66s system 826% cpu 0.192 total
    awk '$2 == "total" {a+=$1} END {print a/1024**2}' 0.02s user 0.00s system 11% cpu 0.192 total

ripgrep version:

    % time rg -c \$ | awk -F : '{a+=$2} END {print a/1024**2}'
    28.4284
    rg -c \$ 2.12s user 2.19s system 276% cpu 1.564 total
    awk -F : '{a+=$2} END {print a/1024**2}' 0.58s user 0.45s system 66% cpu 1.564 total

Here it is restricted to C-files (same linux tree):

    % time pffxg.sh --workers 8 --cc --cmd wc -- -l | awk '$2 == "total" {a+=$1} END {print a/1024**2}'
    25.3935
    pffxg.sh --workers 8 --cc --cmd wc -- -l 0.76s user 0.54s system 734% cpu 0.177 total
    awk '$2 == "total" {a+=$1} END {print a/1024**2}' 0.02s user 0.00s system 9% cpu 0.177 total

and the ripgrep version:

    % time rg -tc -c \$ | awk -F : '{a+=$2} END {print a/1024**2}'
    25.3844
    rg -tc -c \$ 3.49s user 1.54s system 441% cpu 1.140 total
    awk -F : '{a+=$2} END {print a/1024**2}' 0.38s user 0.38s system 66% cpu 1.140 total

Real Application Performance comes from balance, flexibility, and functional programming techniques; it does not come from fixation on imperative microoptimization tactics in static, compiled programming languages that are a bear to work with from the balance and flexibility perspectives. Such overhyped, imperative languages are great targets for very specific problem domains, but are terrible for system-wide application performance.

pffxg.sh is not a product, and this is not a sales pitch for it. It’s an example to illustrate my point in a very dramatic way. If you are familiar with the long history of filtered-recursive-grep solutions on GitHub, they all are premised on the notion that the problem with Andy Lester’s original Perl implementation ack, was that it was written in Perl. The only real problem from a performance standpoint was that the Perl was written by Andy, who didn’t seem to have any knack for systems performance concepts (like farming out the find parallelization work to a purpose-built C binary), but instead aimed for lazy portability by trying to capture the whole code as a single-threaded Pure Perl oddity.

May a thousand flowers bloom, no matter silly they seem!