Bloom Filters and the Life List Problem
Phoebe Snetsinger’s life list held 8,398 species. When you spot bird number 8,399, you need to know: have I logged this before? A simple list means scanning thousands of names. A hash table works, but at scale it’s memory-hungry. Burton Bloom’s 1970 solution was elegantly pessimistic: a bit array with multiple hash functions that can tell you “definitely not seen” or “probably seen.”
The trade-off is perfect for birding: false positives (thinking you’ve seen a species when you haven’t) are annoying but harmless—you double-check your notes. False negatives (missing a species you’ve actually logged) would corrupt your data. Bloom filters guarantee zero false negatives.
Here’s the bit array approach in Pascal, showing three hash functions salting the same input:
program LifeListCheck;
var bits: array[0..999] of boolean;
i: integer;
function hash(s: string; salt: integer): integer;
begin hash := (length(s) * 31 + ord(s[1]) + salt) mod 1000; end;
procedure add(species: string);
begin
bits[hash(species, 0)] := true;
bits[hash(species, 17)] := true;
bits[hash(species, 83)] := true;
end;
function seen(species: string): boolean;
begin
seen := bits[hash(species, 0)] and
bits[hash(species, 17)] and
bits[hash(species, 83)];
end;
begin
for i := 0 to 999 do bits[i] := false;
add('Buteo regalis');
writeln(seen('Buteo regalis')); { TRUE }
writeln(seen('Pica hudsonia')); { FALSE }
end.
R’s vector operations make the same logic compact:
bits <- logical(1000)
hash <- function(s, salt) (nchar(s)*31 + utf8ToInt(s)[1] + salt) %% 1000 + 1
add <- function(species) {
bits[hash(species, 0)] <<- TRUE
bits[hash(species, 17)] <<- TRUE
bits[hash(species, 83)] <<- TRUE
}
seen <- function(species) {
all(bits[c(hash(species,0), hash(species,17), hash(species,83))])
}
add("Buteo regalis")
print(seen("Buteo regalis")) # TRUE
print(seen("Pica hudsonia")) # FALSE
With 1,000 bits and three hash functions, you can store about 100 species with a 1% false positive rate. For Snetsinger’s 8,398 species, you’d need roughly 84,000 bits—10 kilobytes. The entire life list of the most prolific birder in history fits in less memory than this paragraph of text.