Information Theory & Statistical Mechanics

Summer Term 2017, University of Cologne


Sheet 3 / Problem 12 - Sampling Bias

In [28]:
using Plots
using Distributions
using StatsBase

pyplot()
Out[28]:
Plots.PyPlotBackend()
In [2]:
plt = plot(legend=true)

α=0.5
β=0.1
N=5*10^6 #Number of sample blocks

X = Float64[] #This array holds all the samples that fulfill the constraints
tmp = Float64[] #temporary 

# We sample for different lengths of the sequence and see how the resulting histogram
# approaches an exponential form. Note in particular how for n=1 the distribution is 
# simply uniform but cut-off.
for n in [1,2,4,8] #length of the sequence X=(X_1,...,X_n)
    for k in 1:N
        tmp = rand(n) #draw 
        if mean(tmp)>=α && (var(tmp)>=β || n==1)
            append!(X,tmp)
        end
    end
    
    # For convinience I used Julias implementation to create a histogram.
    # Feel free to write your own!
    hist = fit(Histogram, X, nbins=20,closed=:left) 
    
    # Plot the histograms for all n in one plot 
    plot!(plt, midpoints(hist.edges[1]), hist.weights/length(X), alpha=0.6,lw=0.5,marker=:auto,lab="n=$n")
    
    empty!(X)
end
    
display(plt)

Let's try to find the parameters of the distribution by fitting a curve to the histogram.

Again for convenience, I use an algorithm from the LsqFit package to perform a least-square fit onto the data. The package uses the Levenberg-Marquardt algorithm, which is a variant of Gauss-Newton. If you like to know more, please consult e.g the corresponding Wikipedia page.

In [3]:
using LsqFit

Increase $n$ quite a bit to produce a result closer to the exponential form.

In [4]:
#This array holds all the samples that fulfill the constraints

α=0.55
β=0.1

N = 10^8 #Number of sample blocks
n = 200
tmp = Vector{Float64}(n) #temporary 
Y = Vector{Float64}()
count = 1

@time for k in 1:N
    rand!(tmp) #draw 
    if mean(tmp)>=α && var(tmp)>=β
        append!(Y, tmp)
#         for c in count:count+n-1
#         X[c] = tmp[c-count+1]
#         end
#         count += n
    end
end
 41.032830 seconds (403.55 M allocations: 7.557 GB, 2.83% gc time)
In [5]:
# Sample rate
length(Y)/(n*N)
Out[5]:
2.49e-6

Fit a rather dense ($nBins=100$) histogram

In [6]:
nBins = 100
hist = fit(Histogram, Y, nbins=nBins,closed=:left)

# Convenience function to calculate the midpoints of the histogram bins.
midpts = midpoints(hist.edges[1])
Out[6]:
0.005:0.01:0.995

Define the function we want to fit to the data. p is a vector of parameters to be determined by the fit.

In [7]:
# Function to fit.
exp_func(x,p) = exp(p[1]+x.*p[2]+(x.^2).*p[3])
Out[7]:
exp_func (generic function with 1 method)
In [8]:
expFit = curve_fit(exp_func, midpts, hist.weights/length(Y)*nBins, [+1.0,-1.0,1.0])
Out[8]:
LsqFit.LsqFitResult{Float64,1}(97,[0.238002,-2.77549,3.31022],[-0.100075,0.0552333,-0.0568559,-0.00470088,0.0469598,-0.00435848,0.00994854,-0.00857869,-0.00377717,0.0985922  …  -0.0076427,-0.00067788,-0.0853072,0.0318085,0.0516504,-0.0938638,-0.0290283,-0.0106524,0.10206,0.0523323],[1.25133 0.00625665 3.12833e-5; 1.21788 0.0182683 0.000274024; … ; 2.04584 2.01515 1.98492; 2.12462 2.114 2.10343],true,Float64[])

Here are the parameters $(\lambda_0,\lambda_1,\lambda_2)$ of the fit

In [9]:
expFit.param
Out[9]:
3-element Array{Float64,1}:
  0.238002
 -2.77549 
  3.31022 

Histogram and its fit.
Observe that even though the number of samples was $10^8$, the histogram is still quite noisy. There is a tradeoff between making $n$ large to get close to the theoretical prediction, and keeping the simulation feasible.

In [21]:
plot(midpts, hist.weights/length(Y)*nBins, markercolor=:green, alpha=0.5,lw=0.0,marker=:auto,lab="Histogram")
# bar!(hist.edges[1], hist.weights/length(X)*nBins, alpha=0.3)
plot!(midpts, x->exp_func(x,expFit.param),lab="Fit to histgram", lw=2.)
Out[21]:

Let's extract the parameters from the model directly.
Determine the parameters of the model such that the constraints are fulfilled numerically.

It's a system of non-linear equations. Various algorithms to solve such problems exist. If you like to know more about solving & optimization in Julia, have a look at

Here we use the package NLsolve which implements a trust-region algorithm

In [11]:
using NLsolve

Define helper functions. quadgk is a numerical integration routine. (Gauß-Kronrod-quadrature)

In [12]:
# normalization
Z(p) = quadgk(x->exp(p[1]*x+p[2]*x^2),0.,1.,order=10)[1]
Out[12]:
Z (generic function with 1 method)
In [13]:
# and its inverse
iZ(p) = 1./Z(p)
Out[13]:
iZ (generic function with 1 method)
In [14]:
# k-th moment of the distribution
mom(k::Int,p;order=10) = quadgk(x->x^k*exp(p[1]*x+p[2]*x^2),0.,1.,order=order)[1]*iZ(p)
Out[14]:
mom (generic function with 1 method)

Target function

In [26]:
function f!(p,store)
    store[1] = mom(1,p) - 0.55             #Constraint on the mean
    store[2] = mom(2,p) - mom(1,p)^2 - 0.1 #constraint on the variance
end
WARNING: Method definition f!(Any, Any) in module Main at In[24]:2 overwritten at In[26]:2.
Out[26]:
f! (generic function with 1 method)

Solve the system, arbitrarily starting at [0.0,0.0].

In [27]:
opt = nlsolve(f!, [0.,0.], autodiff=true, iterations=1000, show_trace=false, extended_trace=false)
Out[27]:
Results of Nonlinear Solver Algorithm
 * Algorithm: Trust-region with dogleg and autoscaling
 * Starting Point: [0.0,0.0]
 * Zero: [-2.61891,3.11317]
 * Inf-norm of residuals: 0.000000
 * Iterations: 4
 * Convergence: true
   * |x - x'| < 0.0e+00: false
   * |f(x)| < 1.0e-08: true
 * Function Calls (f): 5
 * Jacobian Calls (df/dx): 5

The fit parameters coincide not too badly with this exact solution.

In [18]:
opt.zero, expFit.param
Out[18]:
([-2.61891,3.11317],[0.238002,-2.77549,3.31022])
In [22]:
let p = opt.zero
    plot!(x->exp(p[1]*x+p[2]*x^2)*iZ(p),lab="Opt", line=(2.0, :dash, :blue))
end
Out[22]: