What do the powerball numbers look like?

The odds to win a lottery jackpot

Every time, any of the US lotto jackpots go over 500 millions, the whole world is overexciting. At least, this is the case in my big family with all my bros from the other side of the earth throwing red cashes (RMB) to me to change them for green bucks and then, eventually, lotto tickets full of the magic numbers (potentially). However, what is the odd to win a lottery jackpot? Using Powerball as an example, each time, five white number balls are drawn out of 64 number balls ( checking the range of data is 1-64) using non-replacement sampling method and one red number ball is drawn separately from 42 red number balls (checking the range of the data is 1-42). Thus, the odds of matching all six numbers are about 1.8e-11.

# odds of winning powerball jackpot 
x <- 69 
y <- x
# first five numbers
for (i in 1: 4){ 
  y <- y*(x-i)
}
z <- 1/42 # red ball
odds <- (1/y)*z 

Are all the numbers picked randomly?

This is clearly a very very very small odds that, for me, it seems not a reasonable decision to burn your cash on lottery tickets. Anyway, I am not really want to talk about the craziness of buying lotto, people may have different reasons to do that. But when I first time decided to try powerball, I went the official website and found a file including all the historical winning numbers (I assume, it started from 1997), which is updated every week after the latest drawing. This dataset immediately attracted me. I want to see what it looks like. Any specific patterns? The answer is highly like “NO”! But how about the randomness? Are the drawing completely random? Or there are some biases that certain numbers do have been drawn in higher frequencies?

First of all, fetching the data and formatting into a standard table

library(dplyr)
library(reshape2)
library(ggplot2)
library(RCurl)
library(splitstackshape)
library(animation)
source("/Users/Xin/Desktop/datagame/pb_analysis_func.R")

URL <- "http://www.powerball.com/powerball/winnums-text.txt"
outputdir1 <- "/Users/Xin/Desktop/datagame/sum_pb.png"
outputdir2 <- "/Users/Xin/Desktop/datagame/random_pb.png"
outputdir3 <- "/Users/Xin/Desktop/datagame/"
x <- getURL(URL)
dat <- read.delim(textConnection(x), stringsAsFactors = F, header = F, sep = "")
pn <- dat[-1,2:8]
names(pn) <- c("wb1", "wb2", "wb3", "wb4", "wb5", "rb", "pp")
v1 <- dat[-1,1] 
v1 <- cSplit(as.data.table(v1), splitCols="v1", sep = "/")
v1 <- as.data.frame(v1)
names(v1) <- c("mm", "dd", "yy")
pn <- cbind(v1, pn)
ball.sample <- pn[, 4:9]

alt text][Imgur

Next, to plot the frequency of the all the white ball numbers and red ball numbers

#define a function to calculate the count and percentage of the presence of each number
#function to calculate the count and percent of numbers
number_count <- function (sample){  
  wb <- stack(as.data.table(sample[, 1:5]))
  wb.count <- tally(group_by(wb, values), sort = F)
  names(wb.count) <- c("num", "count")
  percent <- (wb.count[, 2]/sum(wb.count[, 2]))*100
  names(percent) <- "weight"
  wb.count <- cbind(wb.count, percent, group = "wb")
  wb.count <- as.data.frame(wb.count)
  rb <- as.data.frame(sample[, 6])
  names(rb) <- "values"
  rb.count <- tally(group_by(rb, values), sort = F)
  names(rb.count) <- c("num", "count")
  percent2 <- (rb.count[, 2]/sum(rb.count[, 2]))*100
  names(percent2) <- "weight"
  rb.count <- cbind(rb.count, percent2, group = "rb")
  rb.count <- as.data.frame(rb.count)
  pb.count <- rbind(wb.count, rb.count)
  return (pb.count)
} 
# call the function number_count()
pb.count <- number_count(ball.sample)

g <- ggplot(pb.count, aes(x = num, y = weight, color = group)) + 
  geom_segment(aes(xend = num, colour = group), yend=0, alpha = 0.5) +
  geom_point(size = 6, alpha = 0.8) +
  scale_color_manual(values = c("white", "red"), 
                     labels=c("white ball", "read ball")) + 
  theme(panel.background = element_rect(fill = "black"), 
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "black")) + xlab("number") + 
  ylab("weight (%)") 
g

ggsave(outputdir, plot = g, dpi = 600, width = 16, height = 9)

alt text][Imgur

Surprisingly, it seems that frequencies of being drawn for all the balls are not that even. Particularly, the large number 60-69 have been picked very often. If we run statistic test to check the significance of the bias, we could first simulate the powerball sampling in complete random way.

# simulate the pb drawing results! n is the number of simulations, m is the number of drawing
pb_simulator <- function (n, m, wb.pool, rb.pool, num.wb, num.rb){
  sample.matrix <- matrix(NA, nrow = (wb.pool + rb.pool), ncol = (2+2*n))
  wb.samples <- matrix(NA,nrow = m, ncol = num.wb)
  rb.samples <- matrix(NA,nrow = m, ncol = num.rb)

  for (j in seq(2, ((2+2*n)-1), by = 2)){
    for (i in 1 : m){
      wb.samples[i, ] <- sample.int(wb.pool, num.wb, replace = F)
      rb.samples[i, ] <- sample.int(rb.pool, num.rb, replace = F)
    }
    sim.sample <- cbind(wb.samples, rb.samples)
    sim.count <- number_count(sim.sample)
    sample.matrix[, j:(j+1)] <-unlist(sim.count[, 2:3])
  }
  sample.matrix[, 1] <- unlist(sim.count[, 1])
  sample.matrix[, (2+2*n)] <- unlist(sim.count[, 4])
  sample.matrix <- as.data.frame(sample.matrix)
  colname.list <- vector("character", length = n)
  for (k in 1 : n) {
    colname.list[(2*k-1):(2*k)] <- c(paste("count", k, sep = ""), 
                                     paste("percent", k, sep = ""))
  }
  names(sample.matrix) <- c("num", colname.list, "group")
  return(sample.matrix)
}

# to plot the sim and obs data
plot.sim <- function(dat.sim, dat.obs, n, title = ""){
  sample.matrix.wb <- filter(dat.sim, dat.sim[, (2+2*n)] == 1)
  percent.col.num <- seq(3, ((2+2*n)-1), by = 2)
  wb.count <- filter(dat.obs, group == "wb")
  y.max.wb <- max(sample.matrix.wb[, percent.col.num], wb.count[, 3]) + 0.5
  matplot(sample.matrix.wb[, 1], sample.matrix.wb[, percent.col.num], pch = 19, 
          col = "grey", xlim = c(1, 69), ylim = c(0, y.max.wb), xlab = "numbers", 
          ylab = "weight(%)", main = title)
  points(wb.count[, 3], pch = 1, col = 1, cex = 1.5)
  p.array <- vector("numeric", length = 69)
  plot.p(sample.matrix.wb, wb.count, percent.col.num, y.max.wb, 69)

  sample.matrix.rb <- filter(dat.sim, dat.sim[, (2+2*n)] == 2)
  rb.count <- filter(dat.obs, group == "rb")
  y.max.rb <- max(sample.matrix.rb[, percent.col.num], rb.count[, 3]) + 0.5
  matplot(sample.matrix.rb[, 1], sample.matrix.rb[, percent.col.num], pch = 19, 
          col = "grey", xlim = c(1, 42), ylim = c(0, y.max.rb), xlab = "numbers", 
          ylab = "weight(%)")

  points(rb.count[, 3], pch = 19, col = "red", cex = 1.5)
  p.array <- vector("numeric", length = 42)
  plot.p(sample.matrix.rb, rb.count, percent.col.num, y.max.rb, 42)

}

# to run t.test and plot the significant p as a *
plot.p <- function(dat1, dat2, col.list, y, num) { 
  for (k in 1 : num) {
    weight.diff <- dat1[k, col.list] - dat2[k, 3]
    t <- t.test(weight.diff)
    p <- t[[3]]
    if(p < 0.05) {
      points(k, y, pch = "*", col = "black")
    }
  }
}
# call the simulator and plot the results
# lottary machine simulator 
n <- 20 # times of simulation
m <- length(pn[, 2])
sample.matrix <- pb_simulator(n, m, 69, 42, 5, 1)
png(outputdir2, width = 16, height = 12, units = "in", res = 500)
par(mfrow = c(2, 1))
plot.sim(sample.matrix, pb.count, 20)
dev.off()

alt text][Imgur

I tried to repeat the simulation 20 times and run the t – test to compare the simulated data and the observed data. For the white ball numbers, all the frequencies of being drawn are significantly different than the simulated frequencies (p < 0.05, labelled with *). And most of the red balls also have some significant bias, particularly for the large numbers. Then, I wonder whether this bias is because the changes of rules, maybe in recently years, powerball increasing the numbers in the sampling pool. Is this the reason?
An animation could quick show the historical results of drawing:

## R package animation is needed
## difference in years
saveGIF({
  ani.options(interval=.1)
  for (i in 1: length(pn[, 1])){
    mm <- pn[i, 1]
    dd <- pn[i, 2]
    yy <- pn[i, 3]
    date <- paste(yy, mm, dd, sep = "-")
    v <- as.numeric(pn[i ,4:9])

    plot(v, axes=FALSE, frame.plot=TRUE, ylim=c(1, 69), xlab = date, 
         ylab = "numbers", pch = 15)
    Axis(side=1, at = seq(1, 6, by=1), 
         labels= c("wb1", "wb2", "wb3", "wb4", "wb5", "rb"))
    Axis(side=2, at=seq(1, 69, by = 1), labels =T )
    print(i)
  }}, outpath = outputdir3, movie.name = "num_dynamics.gif")

alt text][Imgur

Or a more clear way to show the trend may just plot the frequencies of the number by years.

# slice the table by year
year.range <- (range(pn[, 3])[1] : range(pn[, 3])[2])
saveGIF({
  ani.options(interval=.5)
  for (a in year.range) {
    yy.sample <- filter(pn, yy == a)
    yy.count <- number_count(yy.sample[, 4:9])
    v14 <- cbind(pb.count[, 1], as.character(pb.count[, 4]))
    v14 <- as.data.frame(v14)
    names(v14) <- c("num", "group")
    yy.count.full <- right_join(yy.count, v14)
    yy.count.full[is.na(yy.count.full)] <-0 
    par(mfrow = c(2, 1))
    plot.sim(sample.matrix, yy.count.full, 20, title = a)
  }}, outpath = outputdir3, movie.name = "random_yearly.gif", ani.width = 900, 
  ani.height = 800)

alt text][Imgur

Ha, here is the truth! Through the almost 20 years, the white ball sampling pool is constantly increased every few years starting from likely 49 to nowadays 69; meanwhile, the red ball sampling pool is slight decreased (the pattern in 2016 is likely influenced by the sample size).
So maybe we should rewrite a generic formula to calculate the odds:

# odds of winning powerball jackpot 
pb_odds <- function(wb.dr, rb.dr, num.wb, num.rb){
# first five numbers from white balls
for (i in 1: (wb.dr-1)){ 
  tmp1 <- tmp1*(num.wb-i)
}
tmp2 <- 1/num.rb # red balls
odds <- (1/tmp1)*tmp2
}
return(odds)

Leave a comment