Solution 4

Exercise 4.1

The Pareto distribution with shape \(\alpha>0\) and scale \(\tau>0\) has density \[ f(x; \alpha, \tau) = \alpha x^{-\alpha-1}\tau^\alpha \mathsf{I}(x > \tau). \] It can be used to model power laws in insurance and finance, or in demography. The uscitypopn data set in the hecbayes package contains the population size of cities above 200K inhabitants in the United States, from the 2020 census.

  1. Using improper priors, write the joint posterior for a simple random sample of size \(n\) and derive the conditional distributions \(p(\alpha \mid \boldsymbol{y}, \tau)\) and \(p(\tau \mid \alpha, \boldsymbol{y})\).
  2. The mononomial distribution \(\mathsf{Mono}(a,b)\) has density \(p(x) \propto x^{a-1}\mathsf{I}(0 \leq x \leq b)\). Find the normalizing constant for the distribution and obtain the quantile function to derive a sampler.
  3. Implement Gibbs sampling for this problem for the uscitypopn data. Draw enough observations to obtain an effective sample size of at least 1000 observations. Calculate the accuracy of your estimates?

Solution. With improper prior, the joint posterior is the product of the likelihood contributions so \[ p(\alpha, \tau \mid \boldsymbol{y}) \propto \alpha^n \left(\prod_{i=1}^n y_i\right)^{-\alpha-1} \tau^{-n\alpha} \mathsf{I}(\min_i y_i > \tau). \] Using the hint, write the conditional density for \(\alpha\) given the rest as \[\begin{align*} p(\alpha \mid \boldsymbol{y}, \tau) \propto \alpha^n \left( \frac{\prod_{i=1}^n y_i}{\tau^n}\right)^{-\alpha} = \alpha^{(n+1)-1} \exp\left\{-\alpha \left(\sum_{i=1}^n\log y_i - n\log \tau\right) \right\} \end{align*}\] which is \(\mathsf{Gamma}\big(n+1, \sum_{i=1}^n \log y_i - n \log \tau \big)\). For the second, we have \[\begin{align*} p(\tau \mid \alpha, \boldsymbol{y}) \propto \tau^{n\alpha} \mathsf{I}(\min_{i} y_i > \tau), \end{align*}\] a mononomial distribution with parameters \(a=n\alpha+1\) and \(b = \min_{i} y_i\).

To find the normalizing constant of the mononomial distribution, we simply integrate the unnormalized density to obtain the reciprocal constant: if \(c = \int g(x) \mathrm{d} x\) for \(c < \infty\) and \(g(x) \geq 0\) for all \(x\), then \(g(x)/c\) integrates to one and is a valid density. Thus, we find \[c= \int_0^b x^{a-1}\mathrm{d} x = \left[\frac{x^{a}}{a}\right]_{0}^b= \frac{b^{a}}{a}.\] The distribution function is \(G(x) = (x/b)^{a}\) for \(x \in [0,b]\) and the quantile function \(G^{-1}(u) = u^{1/a}b\).

qmono <- function(u, a, b, log = FALSE){
  stopifnot(isTRUE(all(a > 0, b > 0, u >= 0, u <= 1)))
 logq <-   log(u)/(a+1) + log(b)
 if(log){ return(logq)} else { return(exp(logq)) }
}


# Load data
data("uscitypopn", package = "hecbayes")
y <- uscitypopn$population
n <- length(y)
# Summary statistics appearing in the posterior distribution
sumlogy <- sum(log(y))
miny <- min(y)
# MCMC via Gibbs sampling
B <- 1e4L
chains <- matrix(0, nrow = B, ncol = 2)
colnames(chains) <- c("alpha", "tau")
curr <- c(2, 2e5)
for(b in seq_len(B)){
  chains[b,1] <- curr[1] <- rgamma(n = 1, shape = n+1, rate = sumlogy - n*log(curr[2]))
  chains[b,2] <- curr[2] <- qmono(runif(1), a = n*curr[1]+1, b = miny)
}
chains <- coda::as.mcmc(chains)
# Compute effective sample size
coda::effectiveSize(chains)
    alpha       tau 
 9590.174 10000.000 
summary(chains)

Iterations = 1:10000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 10000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

           Mean        SD  Naive SE Time-series SE
alpha 1.386e+00    0.1297  0.001297       0.001324
tau   1.991e+05 1257.8213 12.578213      12.578213

2. Quantiles for each variable:

           2.5%       25%       50%      75%     97.5%
alpha 1.142e+00 1.298e+00 1.382e+00 1.47e+00 1.651e+00
tau   1.957e+05 1.986e+05 1.995e+05 2.00e+05 2.004e+05

We can see that the autocorrelation is minimal, so the sampler is quite efficient.

Exercise 4.2

Implement the Bayesian LASSO for the diabetes cancer surgery from package lars. Check Park & Casella (2008) for the details of the Gibbs sampling.

  1. Fit the model for a range of values of \(\lambda\) and produce parameter estimate paths to replicate Figure 2 of the paper.
  2. Check the effective sample size and comment on the mixing. Is it impacted by the tuning parameter?
  3. Implement the method of section 3.1 from Park & Casella (2008) by adding \(\lambda\) as a parameter.
  4. For three models with different values of \(\lambda\), compute the widely applicable information criterion (WAIC) and use it to assess predictive performance.

Solution. We first setup a Gibbs sampler for a given value of \(\lambda\), or using the empirical Bayes estimator provided in section 3.1. The effective sampling size for fixed \(\lambda\) is good. If we let the parameter varies, the performance degrades and we obtain an effective size shy of 1000 for 10K iterations for \(\lambda\), and comfortably above 5000 for others.

data(diabetes, package = "lars")
bayeslasso <- function(lambda = NULL, 
                       B = 1e4L,
                       x = diabetes$x, 
                       y = diabetes$y){
  stopifnot(is.matrix(x), is.vector(y))
  # Scale inputs in case
  x <- scale(x, center = TRUE, scale = FALSE)
  y <- y - mean(y)
  # Check method
  if(is.null(lambda)){
    method <- "empbayes"
  } else{
    method <- "fixed" 
  }
  burnin <- 250L
  # Precompute quantities and dimensions
  xtx <- crossprod(x)
  p <- ncol(x)
  n <- nrow(x)
  # Obtain initial estimates
  linmod <- lm(y ~ x - 1)
  betaols <- coef(linmod)
  beta.curr <- betaols
  sigmasq.curr <- mean(residuals(linmod)^2)
  tausqinv.curr <- rep(1, p)
  # Value reported in the text for the optimal parameter: lambda = 0.237
  beta.ind <- 1:p
  sigmasq.ind <- p + 1L
  tausq.ind <- seq(from = p + 2L, length.out = p, by = 1L)
  chains <- matrix(0, nrow = B, ncol = p + 1 + p + 
                     ifelse(method == "fixed", 0,1))
  if(method == "fixed"){
    colnames(chains) <- c(paste0("beta", 1:p), "sigmasq",
                          paste0("tausq", 1:p))
    lambdasq.curr <- lambda[1]^2
  } else{
    colnames(chains) <- c(paste0("beta", 1:p), "sigmasq", 
                          paste0("tausq", 1:p), "lambda")
    lambdasq.curr <- p*sqrt(sigmasq.curr)/sum(abs(betaols))
    lambdasq.ind <- ncol(chains)
  }
# MCMC loop
for(b in seq_len(B + burnin)){
  ind <- pmax(1, b-burnin)
  Ainv <- solve(xtx + diag(tausqinv.curr))
  beta.curr <- chains[ind,beta.ind] <- as.numeric(
    mvtnorm::rmvnorm(
      n = 1, 
      mean = as.numeric(Ainv %*% t(x) %*% y), 
      sigma = sigmasq.curr*Ainv))
  sigmasq.curr <- chains[ind, sigmasq.ind] <- 1/rgamma(
    n = 1, 
    shape = (n-1+p)/2,
    rate = sum((y-x %*% beta.curr)^2)/2 + 
      sum(beta.curr^2*tausqinv.curr)/2)
  # Compute marginal posterior mean for lambda, using section 3.1
  sumexpect <- 0
  for(j in 1:p){
    tausqinv.curr[j] <- actuar::rinvgauss(
      n = 1, 
      mean = sqrt(lambdasq.curr*sigmasq.curr)/abs(beta.curr[j]),
      dispersion = 1/lambdasq.curr)
    if(method != "fixed"){
    sumexpect <- sumexpect + mean(1/actuar::rinvgauss(
      n = 1000, 
      mean = sqrt(lambdasq.curr*sigmasq.curr)/abs(beta.curr[j]),
      dispersion = 1/lambdasq.curr))
    }
  }
  chains[ind, tausq.ind] <- 1/tausqinv.curr
  if(method != "fixed"){
    lambdasq.curr <- chains[ind, lambdasq.ind] <- 2*p/sumexpect
  }
}
  if(method != "fixed"){
  chains[, lambdasq.ind] <- sqrt(chains[, lambdasq.ind])
}
# Cast Markov chains to mcmc class object.
chains.mcmc <- coda::as.mcmc(chains)
# Effective sample size
ess <- as.integer(round(coda::effectiveSize(chains.mcmc), 0))

# Compute WAIC from log pointwise density
lppd <- 0
penalty <- 0
for(i in seq_len(n)){
  lppd_i <- dnorm(
    x = y[i], 
    mean = as.numeric(chains[,beta.ind] %*% c(x[i,])), 
    sd = sqrt(chains[,sigmasq.ind]), 
    log = TRUE)
  lppd <- lppd + mean(lppd_i)
  penalty <- penalty + var(lppd_i)
}
waic <- (-lppd + penalty)/n
l1norm <- mean(rowSums(abs(chains[,beta.ind])))

# Parameter estimates and 95% equitailed credible intervals
quant <- t(apply(chains, 2, 
                quantile, prob = c(0.025, 0.5, 0.975)))
regpar <- as.data.frame(cbind(quant,
                colMeans(chains),
                coda::batchSE(chains.mcmc),
                ess))
regpar$pars <- rownames(quant)
rownames(regpar) <- NULL
colnames(regpar) <- c("lower", "median", "upper", 
                      "mean", "se", "ess", "par")
regpar <- regpar[,c(7,4:5,1:3,6)]
attr(regpar, "waic") <- waic
attr(regpar, "l1norm") <- l1norm
 return(regpar)
}

# Call the MCMC sampler
set.seed(2023)
lasso_empbayes <- bayeslasso(lambda = NULL)
# Extract the value of WAIC
waic <- attr(lasso_empbayes, "waic")
Figure 1: Standardized median posterior estimates of the coefficients for the Bayesian LASSO with 95 percent equitailed credible intervals, with \(\lambda\) estimated using empirical Bayes. Ordinary least square estimates are denoted by crosses.

The plot corresponds to Figure 2 of Park & Casella (2008) and the posterior summaries, reported in Table 1, are also in line with those of the paper.

Table 1: Estimates posterior summaries from the Bayesian LASSO, based on 10K draws. Posterior means and adjusted standard errors, posterior median and equitailed 95 percent credible intervals, effective sample.
par mean se lower median upper ess
beta1 -3.14 0.58 -112.92 -2.34 105.68 10000
beta2 -214.04 0.58 -331.89 -213.49 -97.58 9012
beta3 523.41 0.76 393.47 523.87 656.33 9062
beta4 307.64 0.72 177.38 307.69 439.33 8957
beta5 -187.12 2.60 -587.60 -170.71 125.58 5089
beta6 7.79 1.92 -272.14 -1.62 344.59 6056
beta7 -154.43 1.62 -385.12 -153.23 72.40 5259
beta8 96.44 1.60 -131.62 87.10 353.40 6460
beta9 524.09 1.26 331.93 520.90 727.05 6630
beta10 64.21 0.69 -49.05 61.87 188.78 8897
sigmasq 2952.78 2.00 2585.93 2945.02 3370.20 9511
tausq1 20.55 0.26 0.24 11.13 93.70 10000
tausq2 34.75 0.33 3.42 25.19 122.06 8868
tausq3 58.44 0.36 13.42 49.09 155.41 9375
tausq4 41.99 0.39 6.09 32.35 133.19 8972
tausq5 34.17 0.45 1.03 24.45 122.56 5159
tausq6 26.82 0.38 0.57 17.18 105.39 7441
tausq7 31.06 0.35 1.04 21.46 118.21 8954
tausq8 27.47 0.33 0.56 17.94 107.05 8551
tausq9 59.02 0.47 12.95 49.73 160.42 7805
tausq10 23.35 0.30 0.40 13.66 102.21 9703
lambda 0.24 0.00 0.21 0.24 0.26 1038

For the last part, we can simply run the MCMC and find the value of \(\lambda\) that yields the lowest value of WAIC.

set.seed(2023)
blasso1 <- bayeslasso(lambda = 0.1)
blasso2 <- bayeslasso(lambda = 0.2)
blasso3 <- bayeslasso(lambda = 1)
2*length(diabetes$y)*c(attr(blasso1, "waic"), attr(blasso2, "waic"), attr(blasso3, "waic"))
[1] 4802.449 4801.923 4804.496

References

Park, T., & Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103(482), 681–686. https://doi.org/10.1198/016214508000000337