blob: e7d679fc7a0d1659f776d62ecde366f62c7d11d1 [file] [log] [blame]
CityHash, a family of hash functions for strings.
Introduction
============
CityHash provides hash functions for strings. The functions mix the
input bits thoroughly but are not suitable for cryptography. See
"Hash Quality," below, for details on how CityHash was tested and so on.
We provide reference implementations in C++, with a friendly MIT license.
CityHash64() and similar return a 64-bit hash.
CityHash128() and similar return a 128-bit hash and are tuned for
strings of at least a few hundred bytes. Depending on your compiler
and hardware, it's likely faster than CityHash64() on sufficiently long
strings. It's slower than necessary on shorter strings, but we expect
that case to be relatively unimportant.
CityHashCrc128() and similar are variants of CityHash128() that depend
on _mm_crc32_u64(), an intrinsic that compiles to a CRC32 instruction
on some CPUs. However, none of the functions we provide are CRCs.
CityHashCrc256() is a variant of CityHashCrc128() that also depends
on _mm_crc32_u64(). It returns a 256-bit hash.
Performance on long strings
===========================
We are most excited by the performance of CityHash64() and its variants on
short strings, but long strings are interesting as well.
CityHash is intended to be fast, under the constraint that it hash very
well. For CPUs with the CRC32 instruction, CRC is speedy, but CRC wasn't
designed as a hash function and shouldn't be used as one. CityHashCrc128()
is not a CRC, but it uses the CRC32 machinery.
On a single core of a 2.67GHz Intel Xeon X5550, CityHashCrc256 peaks at
about 6.03 bytes/cycle. The other CityHashCrc functions are wrappers around
CityHashCrc256 and should have similar performance on long strings.
CityHash128 peaks at about 4.58 bytes/cycle. The fastest Murmur variant on
that hardware, Murmur3F, peaks at about 2.4 bytes/cycle. We expect the peak
speed of CityHash128 to dominate CityHash64, which is aimed more toward short
strings or use in hash tables.
For long strings, a couple new functions by Bob Jenkins, Spooky and Akron,
appear to be right around the same speed as CityHash128, and they also seem
targeted to the same level of quality as CityHash and Murmur. However, it
isn't clear whether these functions have been "released" or are still works
in progress. We'll try to find out, and update this.
Performance on short strings
============================
For short strings, e.g., most hash table keys, CityHash64 is faster than
CityHash128, and probably faster than all the aforementioned functions,
depending on the mix of string lengths. Here are a few results from that
same hardware, where we (unrealistically) tested a single string length over
and over again:
Hash Results
------------------------------------------------------------------------------
CityHash64 7ns for 1 byte, or 6ns for 8 bytes, or 9ns for 64 bytes
Murmur2 (64-bit) 6ns for 1 byte, or 6ns for 8 bytes, or 15ns for 64 bytes
Murmur3F 14ns for 1 byte, or 15ns for 8 bytes, or 23ns for 64 bytes
Installation
============
We provide reference implementations of several CityHash functions, written
in C++. The build system is based on autoconf. It defaults the C++
compiler flags to "-g -O2", which is probably slower than -O3 if you are
using gcc. YMMV.
On systems with gcc, we generally recommend:
./configure
make all check CXXFLAGS="-g -O3"
sudo make install
Or, if your system has the CRC32 instruction, and you want to build everything:
./configure --enable-sse4.2
make all check CXXFLAGS="-g -O3 -msse4.2"
sudo make install
Note that our build system doesn't try to determine the appropriate compiler
flag for enabling SSE4.2. For gcc it is "-msse4.2". The --enable-sse4.2
flag to the configure script controls whether citycrc.h is installed when
you "make install." In general, picking the right compiler flags can be
tricky, and may depend on your compiler, your hardware, and even how you
plan to use the library.
For generic information about how to configure this software, please try:
./configure --help
Failing that, please work from city.cc and city*.h, as they contain all the
necessary code.
Usage
=====
The above installation instructions will produce a single library. It will
contain CityHash64() and CityHash128() and their variants, and possibly
CityHashCrc128(), CityHashCrc128WithSeed(), and CityHashCrc256(). The
functions with Crc in the name are declared in citycrc.h; the rest are
declared in city.h.
Limitations
===========
1) The current version of CityHash is intended for little-endian 64-bit CPUs.
Functions that do use the CRC32 instruction should also work, slowly, in
little-endian 32-bit code. CityHash should work on big-endian CPUs as well;
if you try it, please let us know.
2) CityHash is fairly complex. As a result of its complexity, it may not
perform as expected on some compilers. For example, preliminary reports
suggest that some Microsoft compilers compile CityHash to assembly that's
10-20% slower than it could be.
Hash Quality
============
We like to test hash functions with SMHasher, among other things.
SMHasher isn't perfect, but it seems to find any significant flaw.
SMHasher is available at http://code.google.com/p/smhasher/
SMHasher is designed to pass a 32-bit seed to the hash functions it tests.
No CityHash function is designed to work that way, so we adapt as follows:
For our functions that accept a seed, we use the given seed directly (padded
with zeroes); for our functions that don't accept a seed, we hash the
concatenation of the given seed and the input string.
The CityHash functions have the following flaws according to SMHasher:
(1) CityHash64: some cases of poor avalanching, which we suspect don't matter in
practice, with some inputs <= 16 bytes.
(2) CityHash64WithSeed: one or two minor issues with inputs containing at most
two non-zero bytes and length <= 24 bytes.
(3) CityHash64WithSeeds: did not test
(4) CityHash128: none
(5) CityHash128WithSeed: none
(6) CityHashCrc128: none
(7) CityHashCrc128WithSeed: none
(8) CityHashCrc256: none
For now we are treating the flaws in our 64-bit functions as harmless, as we
expect the primary use of these functions will be in hash tables. A
"perfect" hash function is overkill there. Other opinions are welcome,
especially if you have real-world examples of these flaws causing trouble!
We hold our 128-bit and 256-bit hash functions to higher quality standards.
For more information
====================
http://code.google.com/p/cityhash/
cityhash-discuss@googlegroups.com
Please feel free to send us comments, questions, bug reports, or patches.