Voter fraud and bayesian inference – part 3

Here’s part1 and part2.

Welcome back. In the previous posts we saw how to do inference using the beta-binomial to get probabilities for the proportion of fake ballots in an election, as well as an upper bound on the probability that the election result is incorrect. We briefly mentioned the hypergeometric distribution but did discuss it further nor use it.

Like the binomial (and beta-binomial), the hypergeometric distrbution can be used to model the number of successes in a series of sampling events with a binary outcome. The distinction is that the binomial models sampling with replacement, whereas the hypergeometric models sampling without replacement. In other words, if we are sampling from a box, the binomial applies when the sample is returned to the box before drawing more samples. The hypergeometric applies when the sample is not returned. But wait, doesn’t that mean that we’ve been doing it wrong?

When auditing ballots we keep track of those already checked, a ballot is never audited twice. Shouldn’t we then be using the hypergeometric distribution? It turns out that the binomial distribution approaches the hypergeometric distribution in the limit of a large total number of items compared to the number sampled. This fits our case, as we can only audit a limited number of ballots compared to all those cast.

beta5
Hypergeometric for increasing values of N. The bottom right is the corresponding beta-binomial.

As we saw in the previous post, the beta distribution is a conjugate prior for the binomial, which makes inference very easy. Unfortunately this is no the case for the hypergeometric. But because of the converging behaviour seen above, we can stick to the beta-binomial’s easy calculations without sacrificing accuracy. For the sake of completeness we will quickly show the posterior for the hypergeometric, following [1]. Incidentally this “manual calculation” is what allowed us to obtain the images above, through the javascript implementation in the jsfiddle.

hyp

Again, this is just Bayes theorem with the hypergeometric likelihood function and a uniform prior. In [1] it is also pointed out that the normalization factor can be computed directly with this expression

hyp2

We use this in the jsfiddle implementation to normalize. Another thing to note is that the hypergeometric posterior is 0 at positions that are inconsistent with evidence. One cannot have less successes than have been observed, nor more than are possible given the evidence. These conditions are checked explicitly in the implementation. Finally, the jsfiddle does not contain an implementation for obtaining the upper bound on the probablity of election error, only code for the posterior is present. How about forking it and adding that yourself?

In these three posts we have used bayesian inference to calculate probabilities over proportion of fake ballots, and from there to calculate probabilities that an election result was incorrect. These probabilities could be used to achieve trust from stakeholders in that everything went well, or conversely to detect a possible fraud and invalidate an election.

I’ll finish the post by mentioning that the techniques we have seen here can be generalized beyond the special case of detecting fake ballots for plurality votes. For example, one could use bayesian inference to conduct ballot audits for the sake of checking tally correctness, not just from failures in authentication, but from errors in counting. See [2] for this kind of more general treatment

 


[1] http://www.wmbriggs.com/public/HGDAmstat4.pdf

[2] https://www.usenix.org/system/files/conference/evtwote12/rivest_bayes_rev_073112.pdf

In this work, because results of audits are not just binary, and because tallies are not only plurality, the authors use dirichlet distrbutions and sampling using posteriors to project possible alternative tallies.

Hackaton Real

71.000 personas han votado ya en la iniciativa #referendumrealya. El referendo que termina mañana 19 de junio sirve como protesta y reclamación del derecho a decidir de la ciudadanía sobre la jefatura de estado. Este proceso que culmina mañana con una participación récord comenzó a fraguarse minutos después de la abdicación del Rey.

Hacktivismo Coral

wKu2pJw

La cada vez más engrasada maquinaria del activismo social se coordinó a través de chats, listas de correo y redes sociales para organizar un referéndum ciudadano (http://referendumrealya.com/#adhesiones). Agoravoting también ha estado presente y el pasado sábado se organizó un Hackaton bastante espontáneo en MediaLab para poner la herramienta a punto para esta iniciativa.

Minshu

Mossos-retiran-multireferendum-Junta-Electoral_TINIMA20140525_0662_3

La iniciativa Multirreferendum que se celebró Cataluña ha servido de referencia para este proyecto, tanto en metodología como en avances técnicos. Al igual que en aquella ocasión, se pretende coordinar el voto presencial y el voto electrónico, esto nos obliga a cotejar los votantes y para ello se creo la aplicación Minshu (democracia en japonés) que permite a los apoderados en la mesa comprobar si ya se ha votado anteriormente. Y que permite combinar tres formas de votación.

3 en 1

#referendumrealya permitirá votar de tres maneras:

  1. a través de la página web, donde el usuario podrá registrarse con su nombre, apellidos y número de identificación (DNI, pasaporte…)
  2. mesas electorales con voto digital, en este caso las mesas sirven para asistir a los votantes a utilizar el sistema online
  3. urnas, también se podrá votar de manera tradicional, este el caso donde se utilizará la aplicación Minshu, ya que los votantes mostrarán su identificación, se comprobará que no hayan votado anteriormente y votarán físicamente

Servidores de Occupy Wall Street

occupy-wall-street-logo2Global Revolution forma parte del movimiento Occupy Wall Street y ha cedido el uso de los servidores, ha puesto 24 GB de RAM a disposición de la iniciativa.

Participación

Aunque la participación ciudadana será el principal logro de esta iniciativa, es notable cómo las relaciones entre colectivos son cada vez más estrechas y tienen capacidad para moverse rápido y sin siglas, con una campaña concreta en cada caso. Como dicen los ingleses, ¿cómo se come un elefante? trozo a trozo.

 

La empresa Agora Voting SL (Software Libre)

Pablo Iglesias se ha referido a la empresa Agora Voting en una rueda de prensa para anunciar la votación de la comisión provisional que gestionará Podemos hasta su asamblea constituyente.

“La empresa se llama Agora Voting, es la misma que coordinó/nos dio soporte técnico para realizar las primarias abiertas que fueron las más participadas en la historia de unas elecciones europeas.”

Hemos escrito empresa en cursiva y negrita, porque aunque Pablo Iglesias la haya definido correctamente en la primera acepción del término, (Acción o tarea que entraña dificultad y cuya ejecución requiere decisión y esfuerzo) no se corresponde con el segundo, más habitual y temido significado del término.

¿Quién es Agoravoting?

En el aspecto jurídico, en el que muchos lo habrán entendido, Agora Voting no es una empresa, es una comunidad de software libre sin ninguna identidad jurídica, simplemente un grupo de personas que se identifican con un proyecto y que colaboran en su desarrollo: bienvenidos a la cultura del software libre.

Agoravoting es un repositorio (una biblioteca pública de software), el fruto de la colaboración voluntaria de muchos. se organiza como otras muchas comunidades horizontales, con una lista de correo pública, Facebook, Twitter

Wadobo, la empresa de Podemos

La confusión generada se debe probablemente a que Podemos ha tenido que pagar por utilizar agoravoting, tal y como se puede consultar en su contabilidad transparente. Es una empresa cooperativa de software libre y que ha servido en esta ocasión de paraguas para la facturación de las primarias de podemos. Pero no es más que un CIF, detrás se encuentran varios programadores que son a su vez miembros de la comunidad.

Factura por servicios

factura_wadobo

La gran diferencia respecto a una factura de una empresa de sofware privativo, como podría ser Microsoft, Apple o Scytl (en el caso del software de votaciones), es que no se factura ninguna propiedad intelectual. Si tu contratas a un informático para que instale Windows en tu ordenador, el informático te cobrará el tiempo que ha dedicado a realizar la instalación y la licencia del software. Esa licencia es un valor inmaterial muy abierto a la especulación.

El software libre elimina ese coste, ya que todas las mejoras y modificaciones realizadas en un proyecto así son de dominio público. La propiedad intelectual no deja de existir, se transforma en un bien universal, y una herramienta que cualquiera puede utilizar.

Sostenibilidad

Agoravoting surgió en un partido político (partidodeinternet.es) con un único punto en su programa, la democracia directa y la esperanza de que las nuevas tecnologías la hagan posible. Desde entonces ha crecido gracias a la colaboración voluntaria de toda la comunidad, pero a medida que aumentan los desafíos, aumenta la necesidad de recursos.

Abundan las opciones, desde los proyectos que funcionan en base a donativos (Wikipedia) hasta empresas como Red Hat, muy similares a cualquier corporación al uso. Podemos ha abierto una nueva vía de desarrollo al invertir en personas y desarrollo en lugar de comprar un producto acabado.

y futuro…

El futuro no está escrito. No hay más destino que el que hacemos nosotros mismos. Agoravoting explora todas las opciones con tranquilidad y curiosidad, porque no hay nada más poderoso que una idea a la que le ha llegado su momento.

 

Voter fraud and bayesian inference – part 2

We left off the discussion with

We want to calculate the proportion of fake ballots in an election based on the results of limited audits. We have seen how the binomial and hypergeometric distributions give probabilities for the results of an audit given an assumption about the proportion of fake ballots. Bayes theorem can be used to calculate the inverse probability that we are after, once we have specified a prior.

Bayesian inference is a process that takes prior information, and adds evidence to obtain a posterior distribution. In our case this posterior distribution will be over the possible proportion of fake ballots in the set of all ballots. Let’s begin with the binomial case. What prior should we use? One answer is that, since we know nothing about the proportion of fake ballots we should be indifferent about each possibility. This translates into a uniform prior, where all proportions are equally likely. For example

P(proportion = fake ballots/ total ballots) = 1 / (total ballots + 1)

Since there are n + 1 possibilities for the number of fake ballots, we give each of them the same weight, which is 1 / (n + 1).

Beta + Binomial = Beta-Binomial

Before plugging this into bayes, a small technical detour.  Notice how the prior is itself a probability distribution, defined over the 0.0 – 1.0 interval. That is, the minimum proportion (0.0) is no fake ballots and maximum (1.0) is all fake ballots. It turns out there is a paramateric probability distribution one can use for this interval, it’s called the Beta distribution. The Beta distribution has two parameters, alpha and beta. The case of our neutral prior we defined above is equivalent to the Beta distribution with parameters (1, 1)

P(proportion ) = 1 / (n + 1) = Beta(1, 1)

We could express other knowledge with different choices of alpha and beta. But what’s the point of using the Beta, besides having a convenient way to specify priors? The point is that the Beta distribution is a conjugate prior of the binomial distribution. This means that the posterior distribution, once having taken into account available evidence, is also a Beta distribution. Meaning that the calculation of the posterior is much easier, as inference is just a matter of mapping the values of parameters of the Beta to some other values. Here is the posterior of the Beta distribution when it is used as the prior of the binomial (this is called the beta-binomial model).

beta

Equations taken from [1]. The first line is just Bayes theorem, but the payoff is that the last line corresponds to a beta distribution, but with different parameters. In summary

beta2

with a beta prior, bayesian inference reduces to remapping the initial parameters alpha and beta, to alpha + k and beta + n – k, where k is the number of successes and n is the number of trials. Conjugate priors are an algebraic convenience that allow obtaining analytic expressions for posteriors easily. End of detour, please refer to [1] for further details.

Armed with our use of the beta-binomial obtaining the posterior given some audit results is simple. If we audited 10 ballots and 3 of them were fake our posterior would simply be

P(proportion = p | fake audit count = 3 out of 10)

= Beta(1 + 3, 1 + 10 – 3)

= Beta(4, 8)

here’s what Beta(4, 8) looks like

beta3

note how the peak of the distribution is at 0.3, it makes sense since in the sample 3 out 10 ballots where invalid. Evidence has transformed our initial uniform prior into the distribution seen above. This meets our original objective, a way to judge how many ballots are fake in the entire set of ballots based on limited audits. But it’s not the end of the story. What we would like also is to have an estimate as to whether or not the election result is correct. As we said previously, this estimation can be used either as a guarantee that all went well or in the opposite case to detect a problem and even invalidate the results.

fraud

The probablity that an election result was correct, given uncertainty about fake ballots, depends on two things. One is the proportion of ballots that are fake, this is what we already have a calculation for. The other thing is the actual election outcome; specifically a measure of how close the result was. The reason is simple, if the election was close, a small number of invalid ballots could cast doubts on its correctness. Conversely, if the result was a landslide, the presence of fake votes has no bearing on its correctness. For our purposes we will stick with a simple example in which the election decides between two options via simple plurality.

Call the difference between the winning and losing option d

d = winner votes – loser votes

In order for the election to be wrong, there must be a minimum of d fake votes. The existence of d fake votes does not imply that the result was wrong, but d fake votes are a necessary condition. Thus a probability that the number of fake votes is greater than or equal to d represents an upper bound on probability that the election was incorrect. Call this E (for error)

P(proportion of fake votes >= d / total votes) = E

(upper limit on the probability that the election was wrong)

We have P(proportion), it is the posterior we got above. How do we get P(proportion >= some constant)? Through the beta distribution’s cumulative distribution function, which is defined in general as

In order to reverse the inequality, we just need to subtract it from 1 (gives us the tail distribution). We finally have

Probability of incorrect result

= P(proportion >= d / total ballots)

= 1 – Cumulative Distribution Function of P(d / total ballots)

 One final correction. Because we have sampled a number of votes with known results, we must apply our calculations to the remaining ballots.

P(E) = 1 – CDF(d – sampled ballots / total ballots – sampled ballots)

Let’s try an example, an election between option A and option B with the following numbers.

Votes for A = 550

Votes for B = 450

Total ballots = 1000

Audited ballots = 100

Audited fake ballots = 4

which gives

Posterior = Beta(5, 97)

d = 100

Minimum fraction of fake votes required to change result = (100 – 4) / (1000 – 10) = 0.1066

Upper bound on probability of error

= 1 – CDF(Beta(5, 97), 0.1066)

= 0.01352

In conclusion, the probability of error due to fake ballots in this election is less than or equal to 1.35%.

beta4

You can find a javascript implementation for everything we’ve seen until now in this jsfiddle. Calculations for the binomial, beta, hypergeometric and cumulative distribution function are done with the jStat javascript library. Wait, you say, what about the hypergeometric? We’ll leave that for the next post, which should be pretty short.

 


[1] http://www.cs.cmu.edu/~10701/lecture/technote2_betabinomial.pdf

Voter fraud and bayesian inference

Wikipedia says

Personation  is a term used in law for the specific kind of voter fraud where an individual votes in an election, whilst pretending to be a different elector.

when someone practices personation multiple times to cast multiple votes we are talking about ballot stuffing. In this post we will consider an election in which authentication is not 100% secure, where personation is difficult but not impossible.  Furthermore we will assume there is some available, but costly method by which  ballots can be audited to determine whether or not they were cast via personation or were in fact valid.

What makes the problem non trivial is that ballot auditing is costly and cannot in principle be performed for the entirety of the ballots cast. Hence we would like to estimate, from a limited number of audited ballots, how severe ballot stuffing was for an election. This estimation can be used either as a guarantee that all went well or in the opposite case to detect a problem and even invalidate the results.

What we need is a mathematical model that given some information about the results of an auditing processes allows us to estimate the proportion of “fake” ballots in the set of all those cast. In other words, we are talking about statistical inference; in this post will use a bayesian approach. Let’s get to work.

Imagine we have a box with all the ballots for an election, and the auditing process consists in randomly taking one out and determining whether it is valid or not, recording the result, and then repeating a limited number of times. After we have recorded the results of all the audits, we would like to know how many of ballots in the entire box are fake. Two things come to mind. First, that the result of each audit is binary, we either get FAKE or VALID. Second, that if the proportion of fake ballots in the box is p, then probability that a randomly chosen ballot is fake is p; the probability that it is valid is 1 – p.

The auditing process as a whole yields a count of fake ballots and a count of valid ballots. If we have probablity p for the result of a single audit , can we assign a probablity to the count resulting from the complete audit process? Yes, the binomial distribution and its brother the hypergeometric distribution do just that[1]. Here they are

In our example, k above corresponds to the count of fake ballots. So these distributions give us a way to calculate the probability that a specific number of fake ballots is obtained in the audit assuming a certain proportion of fake ballots in the entire election. For example, let’s say we have 100 ballots total and we know that 10 of them are fake. What is the probability that if we audited 10 ballots we would obtain a fake count of 3?

P(X = 3) = 0.057395628 (binomial)

P(X = 3) = 0.0517937053324283 (hypergeometric)

Only 5%, it is unlikely we’d find 3 fake ballots with only 10 audits, given that there are only 10 out of 100 in total.

We have a way to calculate the probability of some outcome given some assumption about the proportion of fake ballots. But remember, what we want is exactly the opposite: given a certain result for an audit, we’d like to estimate the proportion of fake ballots in the entire set. It is this inversion that makes the problem a case of bayesian inference, and our friend Bayes theorem gives us the relationship between what we have and what we want.

in our case, it translates down to

P(Proportion of fake ballots | Audit Result) = P(Audit Result | Proportion of fake ballots) * P(Proportion of fake ballots) / P(Audit Result)

What we were calculating before is P(Audit Result | Proportion of fake ballots), which we can plug into the formula, together with the other terms, to get what we want: P(Proportion of fake ballots | Audit Result). The other terms are

P(Audit Result) =

The unconditional probability that a certain audit result occurs. It can be calculated by summing over all possible proportions, like this version of Bayes theorem shows:

As seen in the bottom term. Because the bottom term is common to all values of Ai, it can be interpreted as a normalizing constant that ensures that probabilities sum to 1.

P(Proportion of fake ballots) =

The prior probability that some proportion of fake ballots occurs. This ingredient is crucial for Bayesian inference and the Bayesian approach in general. It is an estimate of  the quantity we want to calculate that is prior to any of the evidence we obtain. It can be used to encode prior knowledge about what we are calculating. If we have no knowledge, we can try to encode that in a “neutral prior”. This last point is a very deep problem in Bayesian inference, as is the general problem of choosing priors. We won’t go into in detail here.

Recap. We want to calculate the proportion of fake ballots in an election based on the results of limited audits. We have seen how the binomial and hypergeometric distributions give probabilities for the results of an audit given an assumption about the proportion of fake ballots. Bayes theorem can be used to calculate the inverse probability that we are after, once we have specified a prior. See it in action in the next post.


[1] There is an important difference, the binomial distribution models sampling with replacement, whereas the hypergeometric models sampling without replacement. In the next post we will consider this difference and its significance for our problem.