Updated on Jan 10, 2005:

Many thanks to Dr. Luciana Buriol, I recently updated the code to fix a few bugs. Please find the recent version of code here. The result from a test run:

An Implementation of Min-wise Independent Permutation Family

Project Report by Jerry Zhao for Prof. Ashish Goel's CS599, 2000

Introduction

Min-wise Independent Permutation Families are newly defined on [1]. One application of  such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near­duplicate documents. There are also potential applications of such families in de-randomization. A permutation family F (subset of all permutations over [1..n]) is exactly min-wise independent if for any subset X of [1..n], and any x in X, when p is chosen at random from F we have

Pr{min{p(X)}= p(x)}=1/|X|.

In practice, we can and sometimes have to allow certain relaxations. One of the relaxed definition is approximate restricted min-wise permutation family. We say that F is approximately min­wise independent with relative error e, restricted for sets up to size k, if for any set subset X of [1..n] with |X|<k and any x in X, when p is chosen at random in F we have

|Pr{min{p(X)}= p(x)}-1/|X||<e/|X|.

 

This project implements an permutation generator to construct an approximate min-wise (1/2^err away), restricted (by 2^k) permutation family for [0..2^n-1], given parameters: n, k, err. This C++ implementation consists of several classes:

Implementation Decomposition

Approx Restricted Min-wise Independent Permutation Construction (from [1])

For given parameters n, k, and err, we construct hi to be almost ki-wise independent hashing function from [0..2^n-1] to [0..2^ri-1], where

k1=k

k(i+1)=ki-1

We choose ri so that hi maps any set of size 2^ki into [2^ri] in such a way so that no bucket has size greater than 2^k(i+1) with probability at least 1-1/(2*2^e*k). hi will be close enough to ki-wise independent so that the difference adds a error probability 1/(2*k*2^e).

            ri=2+2/ki*(e+1+log2(k))

 

hi can decide the first significant ri bits for a input x over the permutation. Each hashing function can be implemented by get a string from an almost 2^ki*ri-wise independent distribution on 2^n*ri bits.

The space cost for this construction will be O(loglogN + logK + log(1/E)), where N=2^n, K=2^k, E=1/2^err.

Approx k-wise Independent distribution on n bits (from [2])

We use a LFSR implementation to generate the distribution. A LFSR is defined as:

Given two m-bit sequence s and f. the shift register sequence generated by feedback rule f and start sequence s is

ri=si, if i<m

or sum{fj*r(I-m+j)} for j=0...m-1, if i>=m

In [2] the authors states that if f is a non-degenerate, i.e., f0=1 and f(t)=t^m+sum{fj*t^j} 0<j<m is a reducible polynomial, such m-LFSR family can generate approx (e away in L1 form) k-wise Independent distribution over n bits, if m>=k*2log(k*log(n)/2e).

 

The space cost for this construction will be O(k+2log(k*log(N)/(2*e)).

Irreducible Polynomial Testing (from [3])

Since irreducible polynomials are quite common (somewhat like prime numbers), it is quite easy to find them using the following efficient algorithm for testing for irreducibility over GF(2):

int poly_irreducible( const vlong & x )
{
  unsigned d = x.bits()-1;
  vlong u = 2;
  for (unsigned i=1;i<d;i+=1)
  {
    u = poly_rem( poly_mul(u,u), x );
    vlong g = poly_gcd( u^2, x );
    if ( g != 1 ) return 0;
  }
  return 1;
}

“Here poly_mul, poly_rem denote polynomial multiplication and remainder, and poly_gcd finds the greatest common factor of two polynomials (essentially Euclid's algorithm). Note that the function is using a mixture of normal 2's complement arithmetic and polynomial routines in a slightly obscure way; x.bits() is 1+log2(x), and ^ is the 'C' exclusive-or operation not exponentiation.”

The Time complexity is O(d^3) for testing a polynomial with degree d.

Misc.

In this implementation, we use an implementation of the "minimal standard" multiplicative linear congruential generator from [4]. Also there is a piece of ugly written code to test the permutation generator.

  Complexity

Space:

To store an approx restricted min-wise permutation, the space cost is O(loglogN + logK + log(1/E)), where N=2^n, K=2^k, E=1/2^err.

Time:

To construct an approx restricted min-wise permutation, the time cost is in the same order as O(loglogN + logK + log(1/E)),  where N=2^n, K=2^k, E=1/2^err. To compute the permuted position map(x) for input x, the cost is O(N*(loglogN+logK+log(1/E)).

Performance:

Please note this implementation serves mainly as a class project to understand the concept of min-wise independent permutation. Currently, it can only handle n<32, also the space cost is not well managed. A more practical version can be implemented upon  well-done mathematical libraries, such as NTL[5] or matlab. The following figure shows how well this implementation approximates min-wise independent permutation. “eps-n-k-err-t” is defined as

Eps-n-k-err=|Pr{min{p(X)}= p(0)}*|X|-1|. Where X={0,1,…,2^k-1}

Pr{min{p(X)}= p(0)} is the observed probability for t permutations constructed using mwPermutation class.

Ideally, eps-n-k-err-t will be bounded by 1/2^err after a large number of trials. This characteristic can be observed clearly by “eps-8-3-5-4000” and “eps-10-4-5-4000”. However for larger n, k, e.g. “eps-10-5-5-4000”, this is not obvious, partially because our 4000 sampling trials only cover a very small faction of permutation space.

References

1.     Min-wise independent Permutations, Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher, 1999

2.     Simple Constructions of Almost k-wise Independent Random Variables, Noga Alon, Oded Goldreich, Johan, Hastad, Rene Peralta, 1992

3.     Elliptic curve cryptography FAQ v1.12 22nd December 1997

4.     "Random Number Generators: Good Ones are Hard to Find", Park, S.K. and Miller, K.W.,  CACM 31:10, Oct. 88, pp. 1192-1201.

5.     NTL: A Library for doing Number Theory, Victor Shoup, http://www.shoup.net/ntl/

 

Comments and Suggestions are welcome. Jerry Zhao. 2000/05/04