Scala Murmurhash3 Library Not Matching Python Mmh3 Library
Solution 1:
Scala uses Java strings which are encoded as UTF-16. These are packed two at a time into an Int
; Python uses a char*
(8 bits), so packs in four characters at a time instead of two.
Edit: Scala also packs the chars in MSB order, i.e. (s.charAt(i) << 16) | (s.charAt(i+1))
. You might need to switch to an array of shorts and then swap every pair of them if it's really important to get exactly the same answer. (Or port the Scala code to Python or vice versa.) It also finalizes with the string length; I'm not sure how Python incorporates length data, if it does at all. (This is important so you can distinguish the strings "\u0000"
and "\u0000\u0000"
.)
Solution 2:
This is due to the difference in implementation between Scala's MurmurHash3.stringHash
and MurmurHash3.bytesHash
MurmurHash3.bytesHash
and python's mmh3.hash
pass characters to the hashing mixer in groups of 4, but MurmurHash3.stringHash
mixes characters in groups of 2. This means that the two hash functions return completely different outputs:
import scala.util.hashing.MurmurHash3
valtestString="FiddlyString"
MurmurHash3.stringHash(testString) /* Returns an int */
MurmurHash3.bytesHash(testString.getBytes()) /* Returns a different int */
So if you need the results of python and Scala's MurmurHash3
values to match exactly:
- Use
MurmurHash3.bytesHash(myString.getBytes())
instead ofMurmurHash3.stringHash()
withmmh3.hash()
- Use
MurmurHash3.stringHash
with thepymmh3.string_hash
function that I adapted from wc-duck's pure-python implementation of MurmurHash3 to be compatible with Scala'sMurmurHash3.stringHash
I'd advise the first option, especially if your use-case requires better performance, or you need to hash massive strings
Post a Comment for "Scala Murmurhash3 Library Not Matching Python Mmh3 Library"