Why does the collection library for Dart use a bit mask for hashing collections?
I was implementing a hashing function for a class and I took a minute to look at the first-party collection package for Dart to see how they implemented their hashing function for collections. They use the Jenkins one-at-a-time algorithm to calculate the hashing, but I noticed something peculiar: the function was applying a bit mask to every operation that modified the running hash value: (I'm using the ListEquality implementation here for ease of comprehension, but every collection hash function uses this bit mask. Also, the _elementEquality.hash just calculates a hash code for an element if a custom hashing function is specified, and defaults to using the element's own hashing function otherwise.) const _hashMask = 0x7fffffff; ... int hash(List? list) { if (list == null) return null.hashCode; // Jenkins's one-at-a-time hash function. // This code is almost identical to the one in IterableEquality, except // that it uses indexing instead of iterating to get the elements. var hash = 0; for (var i = 0; i < list.length; i++) { var c = _elementEquality.hash(list[i]); hash = (hash + c) & _hashMask; hash = (hash + (hash > 6; } hash = (hash + (hash > 11; hash = (hash + (hash

I was implementing a hashing function for a class and I took a minute to look at the first-party collection
package for Dart to see how they implemented their hashing function for collections. They use the Jenkins one-at-a-time algorithm to calculate the hashing, but I noticed something peculiar: the function was applying a bit mask to every operation that modified the running hash value:
(I'm using the ListEquality
implementation here for ease of comprehension, but every collection hash function uses this bit mask. Also, the _elementEquality.hash
just calculates a hash code for an element if a custom hashing function is specified, and defaults to using the element's own hashing function otherwise.)
const _hashMask = 0x7fffffff;
...
int hash(List? list) {
if (list == null) return null.hashCode;
// Jenkins's one-at-a-time hash function.
// This code is almost identical to the one in IterableEquality, except
// that it uses indexing instead of iterating to get the elements.
var hash = 0;
for (var i = 0; i < list.length; i++) {
var c = _elementEquality.hash(list[i]);
hash = (hash + c) & _hashMask;
hash = (hash + (hash << 10)) & _hashMask;
hash ^= hash >> 6;
}
hash = (hash + (hash << 3)) & _hashMask;
hash ^= hash >> 11;
hash = (hash + (hash << 15)) & _hashMask;
return hash;
}
I'm not sure what function this bit mask serves, and more than that, I worry that it runs the risk of generating collisions. Consider this (albeit contrived) example:
import 'package:collection/collection.dart';
class MyInteger {
final int x;
const MyInteger(this.x);
@override
int get hashCode => x;
}
void main() {
final a = [MyInteger(0x0000000000000000)];
final b = [MyInteger(0xffffffff80000000)];
print(a[0].hashCode); // Prints: 0
print(b[0].hashCode); // Prints: -2147483648
final equality = const ListEquality();
print(equality.hash(a)); // Prints: 0
print(equality.hash(b)); // Prints: 0
}
As I suspected, masking the top bit leads to collisions in hash codes that differ above the 31st bit. However, if I reimplement the hashing function with the bit mask removed, it works as expected:
// Reimplementation of the package hashing function with the bit mask removed
int hashAll(List? list) {
if (list == null) return null.hashCode;
var hash = 0;
for (var i = 0; i < list.length; i++) {
var c = list[i].hashCode;
hash += c;
hash += hash << 10;
hash ^= hash >> 6;
}
hash += hash << 3;
hash ^= hash >> 11;
hash += hash << 15;
return hash;
}
void main() {
final a = [MyInteger(0x0000000000000000)];
final b = [MyInteger(0xffffffff80000000)];
print(hashAll(a)); // Prints: 0
print(hashAll(b)); // Prints: 659537742195965952
}
This strikes me as a bizarre design choice. In Dart, int
s are 64-bit by definition, so this bit mask would have the effect of repeatedly chopping off the 33 most significant bits every time the hash was updated. By reducing the effective hash resolution by more than half, I'd imagine that this would significantly increase the chance of collisions, especially in collections with more .
The only thing I can think of is that since this utility was created back in 2016, it has its origins in the days when Dart was designed to be a transpiled language for Javascript. What seems to support this assumption is the fact that running that third code example in a compiled setting vs a dart2js setting. (For example, in DartPad, the generated unmasked hash is 2987540480
.)
But even if that was the case, why limit the int
to 31 bits when integers in Javascript are safe up to 53 bits, and why not make this an implementation detail for just dart2js? And if not, what purpose does the bit mask serve that justifies this increased chance of collision?