Class BloomFilter
- java.lang.Object
-
- org.apache.flink.runtime.operators.util.BloomFilter
-
public class BloomFilter extends Object
BloomFilter is a probabilistic data structure for set membership check. BloomFilters are highly space efficient when compared to using a HashSet. Because of the probabilistic nature of bloom filter false positive (element not present in bloom filter but test() says true) are possible but false negatives are not possible (if element is present then test() will never say false). The false positive probability is configurable depending on which storage requirement may increase or decrease. Lower the false positive probability greater is the space requirement. Bloom filters are sensitive to number of elements that will be inserted in the bloom filter. During the creation of bloom filter expected number of entries must be specified. If the number of insertions exceed the specified initial number of entries then false positive probability will increase accordingly.Internally, this implementation of bloom filter uses MemorySegment to store BitSet, BloomFilter and BitSet are designed to be able to switch between different MemorySegments, so that Flink can share the same BloomFilter/BitSet object instance for different bloom filters.
Part of this class refers to the implementation from Apache Hive project https://github.com/apache/hive/blob/master/common/src/java/org/apache/hive/common/util/BloomFilter.java
-
-
Field Summary
Fields Modifier and Type Field Description protected BitSet
bitSet
protected int
numHashFunctions
-
Constructor Summary
Constructors Constructor Description BloomFilter(int expectedEntries, int byteSize)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addHash(int hash32)
static double
estimateFalsePositiveProbability(long inputEntries, int bitSize)
Compute the false positive probability based on given input entries and bits size.static BloomFilter
fromBytes(byte[] bytes)
Deserializing bytes array to BloomFilter.static byte[]
mergeSerializedBloomFilters(byte[] bf1Bytes, byte[] bf2Bytes)
static int
optimalNumOfBits(long inputEntries, double fpp)
Compute optimal bits number with given input entries and expected false positive probability.void
reset()
void
setBitsLocation(MemorySegment memorySegment, int offset)
boolean
testHash(int hash32)
static byte[]
toBytes(BloomFilter filter)
Serializing to bytes, note that only heap memory is currently supported.String
toString()
-
-
-
Field Detail
-
bitSet
protected BitSet bitSet
-
numHashFunctions
protected int numHashFunctions
-
-
Method Detail
-
setBitsLocation
public void setBitsLocation(MemorySegment memorySegment, int offset)
-
optimalNumOfBits
public static int optimalNumOfBits(long inputEntries, double fpp)
Compute optimal bits number with given input entries and expected false positive probability.- Parameters:
inputEntries
-fpp
-- Returns:
- optimal bits number
-
estimateFalsePositiveProbability
public static double estimateFalsePositiveProbability(long inputEntries, int bitSize)
Compute the false positive probability based on given input entries and bits size. Note: this is just the math expected value, you should not expect the fpp in real case would under the return value for certain.- Parameters:
inputEntries
-bitSize
-- Returns:
-
addHash
public void addHash(int hash32)
-
testHash
public boolean testHash(int hash32)
-
reset
public void reset()
-
toBytes
public static byte[] toBytes(BloomFilter filter)
Serializing to bytes, note that only heap memory is currently supported.
-
fromBytes
public static BloomFilter fromBytes(byte[] bytes)
Deserializing bytes array to BloomFilter. Currently, only heap memory is supported.
-
mergeSerializedBloomFilters
public static byte[] mergeSerializedBloomFilters(byte[] bf1Bytes, byte[] bf2Bytes)
-
-