Bloom Filter Vs Feature Hashing → Big Data Partnership → Unlock Value from Complex Data

Bloom Filter Vs Feature Hashing

Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure that is used to efficiently encode sets and perform set membership tests, whether an element is a member of a set. False positives are possible, but false negatives are strictly not possible. i.e. a query returns either “inside set (may be wrong)�? or “definitely not in set�?.

How it works,

Let h1, h2, h3 … hk be the set of hash functions, where hash function is a compression function.
A Bloom filter calculates hi(x) = d and sets the d’th bit in the m-length bit-vector for all hash functions ‘hi’ and for each feature value ‘x’.

To test whether any element ‘x’ is present in the feature-set, you check that hi(x) bit is set for all hash functions ‘hi’. If any bit is not set, then ‘x’ is not in the set.

Implications

Bloom filters have false-positives due to hash collisions, but never have a false-negative. By adding more hash functions we can reduce the false-positives.

Removing an element from a simple Bloom filter is impossible because false negatives are not permitted, An element is mapped to k bits, and setting any one of those k bits to zero will remove the element, as a side-effect it also results in removing any other elements that was mapped onto that bit. Clearing any of the bits would introduce the possibility for false-negatives and which is not permitted.

One-time removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed. However, false positives in the second filter become false negatives in the composite filter, which are not permitted. In this approach re-adding a previously removed item is not possible, as one would have to remove it from the “removed�? filter.

If the false-positive rate is too high then the filter need to be regenerated.

Feature Hashing

Feature hashing is similar to Bloom filters where we compute h(x) = d and set the d’th bit, except that only a single hash function is used unlike using k hash functions in the bloom filter. Increasing number of hash functions will increase error, which is why Bloom filters don’t work well.
Error increases as ‘k’ increases, minimum error achieved at k = 1, which is different from the usual Bloom filter set membership tests. Feature hashing is relatively a good approach where only one hash function is used. Using only one hash function also improve the performance of learning algorithms.

Locality sensitive hashing (LSH)

The main idea in Locality sensitive hashing (LSH) is to define a hash function ‘h’ such that h(s1) = h(s2) if the two sets ‘s1′ and ‘s2′ are similar. LSH is complementary to feature hashing, because it reduces the number of items while feature hashing reduces the number of features both feature hashing and locality-sensitive hashing could be combined in a real system.

Posted on February 29, 2012 in Blog, Mahout, Science, Technology

Share the Story

Leave a reply

Back to Top