Many thanks to Dr. Luciana Buriol, I recently updated the code to fix a few bugs. Please find the recent version of code here. The result from a test run:

Min-wise
Independent Permutation Families are newly defined on [1]. One application
of such a family (under some
relaxations) is essential to the algorithm used in practice by the AltaVista
web index software to detect and filter nearduplicate documents. There are
also potential applications of such families in de-randomization. A permutation
family *F* (subset of all permutations over [1..*n*]) is *exactly
min-wise independent* if for any subset *X* of [1..*n*], and any *x*
in *X*, when p
is chosen at random from *F* we have

*Pr{min{**p(X)}=
**p(x)}=1/|X|.*

In
practice, we can and sometimes have to allow certain relaxations. One of the
relaxed definition is approximate restricted min-wise permutation family. We
say that *F *is *approximately minwise independent* with relative
error *e*, *restricted* for sets up to size *k*, if for any set
subset *X* of [1..*n*] with |*X*|<*k* and any *x*
in *X*, when p
is chosen at random in *F* we have

|*Pr{min{**p(X)}=
**p(x)}-1/|X||<e/|X|.*

This
project implements an permutation generator to construct an approximate
min-wise (*1/2^err* away), restricted (by *2^k*) permutation family
for [0..*2^n-1*], given parameters: *n, k, err*. This C++ implementation consists of several
classes:

- mwPermutation: An class represents a permutation in an approximate min-wise (1/2^err away), restricted (by k) permutation family for [0..2^n-1].
- mwPermutaion(int n, int k, int err): Constructor for such a permutation, given n ,k, err.
- INTTYPE map(INTTYPE x): Return the permuted location for x, where 0<=x<=2^n-1
- kwHashFunction: An approx (1/2^err away), k-wise independent hashing function from [0..2^n-1] to [0..2^r-1].
- kwHashFunction(int n, int r, int k, int err): Constructor, given n, k, err.
- INTTYPE hash(INTTYPE x): Return the hashing value for x, where 0<=x<=2^n-1
- LFSR: An implementation of Linear Feedback Shift Register with m-bit source sequence and m-bit feedback sequence.
- LFSR(int m): Constructor.
- void getBits(INTTYPE x, int len, BIT *bits): Return "len" bits from position x.
- INTTYPE getBits(INTTYPE x, int len): Return "len" bits from position x, in form of a INTTYPE.
- irreduciblePolynomial: A irreducible Polynomial of degree d over GF[2].
- irreduciblePolynomial(int d): Constructor.

For given parameters *n*, *k*, and *err*, we
construct *hi* to be almost *ki*-wise independent hashing function
from [0..2^*n*-1] to [0..2^*ri*-1], where

k1=k

*k(i+1)=ki-1*

We choose *ri* so that *hi* maps any set of size *2^ki*
into *[2^ri]* in such a way so that no bucket has size greater than *2^k(i+1)*
with probability at least *1-1/(2*2^e*k)*. *hi* will be close enough
to *ki*-wise independent so that the difference adds a error probability *1/(2*k*2^e).*

*ri=2+2/ki*(e+1+log2(k))*

*hi* can decide the first significant *ri* bits for
a input x over the permutation. Each hashing function can be implemented by get
a string from an almost *2^ki*ri*-wise independent distribution on *2^n*ri*
bits.

The space cost for this construction will be *O(loglogN +
logK + log(1/E)),* where *N=2^n, K=2^k, E=1/2^err*.

We use a LFSR implementation to generate the distribution. A LFSR is defined as:

Given two m-bit sequence s and f. the shift register sequence generated by feedback rule f and start sequence s is

*ri=si*, if *i<m*

or *sum{fj*r(I-m+j)} *for *j=0...m-1*,
if* i>=m*

In [2] the authors states that if *f *is a
non-degenerate, i.e., *f0=1* and *f(t)=t^m+sum{fj*t^j} 0<j<m*
is a reducible polynomial, such m-LFSR family can generate approx (e away in L1
form) k-wise Independent distribution over n bits, if m>=k*2log(k*log(n)/2e).

The space cost for this
construction will be *O(k+2log(k*log(N)/(2*e)).*

Since irreducible polynomials are quite
common (somewhat like prime numbers), it is quite easy to find them using the
following efficient algorithm for testing for irreducibility over GF(2):

int poly_irreducible( const vlong & x )

{

` unsigned d = x.bits()-1;`

` vlong u = 2;`

` for (unsigned i=1;i<d;i+=1)`

` {`

` u = poly_rem( poly_mul(u,u), x );`

` vlong g = poly_gcd( u^2, x );`

` if ( g != 1 ) return 0;`

` }`

` return 1;`

}

“Here poly_mul, poly_rem denote
polynomial multiplication and remainder, and poly_gcd finds the greatest common
factor of two polynomials (essentially Euclid's algorithm). Note that the function
is using a mixture of normal 2's complement arithmetic and polynomial routines
in a slightly obscure way; x.bits() is 1+log2(x), and ^ is the 'C' exclusive-or
operation not exponentiation.”

The Time complexity is O(*d^3*) for
testing a polynomial with degree *d*.

In this implementation, we use an implementation of the "minimal standard" multiplicative linear congruential generator from [4]. Also there is a piece of ugly written code to test the permutation generator.

To
store an approx restricted min-wise permutation, the space cost is *O(loglogN
+ logK + log(1/E)),* where *N=2^n, K=2^k, E=1/2^err*.

To
construct an approx restricted min-wise permutation, the time cost is in the
same order as *O(loglogN + logK + log(1/E))*, where *N=2^n, K=2^k, E=1/2^err*. To compute the
permuted position map(x) for input x, the cost is *O(N*(loglogN+logK+log(1/E))*.

Please
note this implementation serves mainly as a class project to understand the
concept of min-wise independent permutation. Currently, it can only handle *n<32*,
also the space cost is not well managed. A more practical version can be
implemented upon well-done
mathematical libraries, such as NTL[5] or matlab. The following figure shows
how well this implementation approximates min-wise independent permutation.
“eps-n-k-err-t” is defined as

*Eps-n-k-err=|Pr{min{**p(X)}=
**p(0)}*|X|-1|*.
Where *X={0,1,…,2^k-1}*

*Pr{min{**p(X)}=
**p(0)}*
is the observed probability for *t* permutations constructed using
mwPermutation class.

Ideally,
eps-n-k-err-t will be bounded by *1/2^err* after a large number of trials.
This characteristic can be observed clearly by “eps-8-3-5-4000” and “eps-10-4-5-4000”.
However for larger n, k, e.g. “eps-10-5-5-4000”, this is not obvious, partially
because our 4000 sampling trials only cover a very small faction of permutation
space.

1. Min-wise independent Permutations, Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher, 1999

2. Simple Constructions of Almost k-wise Independent Random Variables, Noga Alon, Oded Goldreich, Johan, Hastad, Rene Peralta, 1992

3. Elliptic curve cryptography FAQ v1.12 22nd December 1997

4. "Random Number Generators: Good Ones are Hard to Find", Park, S.K. and Miller, K.W., CACM 31:10, Oct. 88, pp. 1192-1201.

5. NTL: A Library for doing Number Theory, Victor Shoup, http://www.shoup.net/ntl/

*Comments
and Suggestions are welcome. Jerry Zhao.
2000/05/04 *