A Thing I Learned: counting bits

Jul 24 2023

How many bits are set to one in a bit string? It’s a problem that pops up in a number of areas: bitsets, cryptography, error correction, and most of all interview questions. This problem is also known by many names: hamming weight, population count, popcount, sideways sum just to name a few. Almost any software engineer can write a trivial implementation using bit shifts in minutes, yet this problem has a long history and, even today, you will find a variety of different implementations for this seemingly simple problem. I recently revisited this problem in the context of writing a simple bitset and found myself falling down a rabbit hole when I tried to answer the question: how is std::popcount implemented in C++20?

There are many ways to count bits

Let’s start with the most straight-forward approach. This is the solution you are most likely to get if you were to ask a software engineer in an interview to count the number of bits set to one in a 64-bit unsigned integer. You simply shift the value right and accumulate the sum of each right most bit until the value is zero.

int count_bits(uint64_t v) {
    int count = 0;
    for (; v != 0; v >>= 1) {
        count += v & 1;
    }
    return count;
}

There is also a similar, but more clever, version commonly known as Brian Kernighan’s Algorithm. Kernighan is credited with this strategy because of this exercise that appears in the 2nd Edition of The C Programming Language.

Exercise 2-9. In a 2’s complement number system, x &= (x - 1) deletes the rightmost 1-bit in x. Explain why. Use this observation to write a faster version of bitcount.

The approach uses the expression above and counts the number of rightmost bits that are deleted until x is zero (no more 1-bits). So unlike the earlier approach, this one executes on the order of the number of 1-bits, not the number of bits in the word.

int count_bits(uint64_t v) {
    int count = 0;
    for (; v != 0; count++) {
        v &= v - 1;
    }
    return count;
}

The most widely used implementation, though, seems to be the approach that was included in the 2nd edition of Hacker’s Delight by Henry Warren. This strategy is kind of hard to explain but the book includes an explanation and a proof. This is what you will find in Go’s math/bits implementation of OnesCount64. Here’s what the code looks like in the Go source tree currently (with some of the comments removed).

const m0 = 0x5555555555555555 // 01010101 ...
const m1 = 0x3333333333333333 // 00110011 ...
const m2 = 0x0f0f0f0f0f0f0f0f // 00001111 ...
const m3 = 0x00ff00ff00ff00ff // etc.
const m4 = 0x0000ffff0000ffff
// OnesCount64 returns the number of one bits ("population count") in x.
func OnesCount64(x uint64) int {
	// Implementation: Parallel summing of adjacent bits.
	// See "Hacker's Delight", Chap. 5: Counting Bits.
	// The following pattern shows the general approach:
	//
	//   x = x>>1&(m0&m) + x&(m0&m)
	//   x = x>>2&(m1&m) + x&(m1&m)
	//   x = x>>4&(m2&m) + x&(m2&m)
	//   x = x>>8&(m3&m) + x&(m3&m)
	//   x = x>>16&(m4&m) + x&(m4&m)
	//   x = x>>32&(m5&m) + x&(m5&m)
	//   return int(x)
	const m = 1<<64 - 1
	x = x>>1&(m0&m) + x&(m0&m)
	x = x>>2&(m1&m) + x&(m1&m)
	x = (x>>4 + x) & (m2 & m)
	x += x >> 8
	x += x >> 16
	x += x >> 32
	return int(x) & (1<<7 - 1)
}

There is an instruction for this

All those clever approaches above, it turns out we don’t really need any of them now. As I was working on my bitset implementation, I used std::popcount from C++20 and looking at the generated code on my M1 Mac, the assembly is surprisingly concise.

<count_bits>:
  fmov d0, x0
  cnt.8b v0, v0
  uaddlv.8b h0, v0
  fmov w0, s0
  ret

The CNT instruction on arm64 is a population count instruction. It is a little strange in that it’s a SIMD instruction so you have to move the value from a general register into a vector register and also sum across the vector register. But none of the clever tricks from the long history of clever tricks are needed on ARM processors. And indeed, compilers on my Mac M1 are very happy to use the CNT instruction. This swift code, for instance, uses CNT,

func countCounts(_ x: UInt64) -> Int {
    x.nonzeroBitCount
}

As does this rust code,

fn count_bits(v: u64) -> u32 {
    v.count_ones()
}

But what about x86-64? It also has a POPCNT instruction and it was introduced a long time ago in SSE4.2 (2008). I have a number of x86-64 chips in my house and they support POPCNT (You can usually check with the linux lscpu command). For a more exhaustive look at the history of the population count CPU instruction, I suggest Vaibhav Sagar’s post You Won’t Believe This One Weird CPU Instruction! which also includes a number of historical links.

Here's another thing that will bake your noodle. If you were to compile the Kernighan version of count_bits above with clang or gcc, you will very likely find that it also uses the CNT instruction on arm64. These compilers are smart enough to recognize a few common implementations of population count and replace your code with its own builtin version that is optimized for the target CPU.

As it turns out, counting the number of 1-bits in a bit string isn't as simple as it seems. if you want to read more, I recommend the following links:

Wikipedia: Hamming Weight
Vaibhav Sagar’s You Won’t Believe This One Weird CPU Instruction!
Daniel Lemire's The surprising cleverness of modern compilers