Monday, March 2, 2020

Countries grouped by their voting profile in the past years using statistical analysis

Countries grouped by their voting profile in the past years using statistical analysis This is a post about Eurovision and statistics. Probably, not a very original combination. Yet, it is a great way to practice statistics. As everybody who has watched eurovision more than once, there are several countries who always give high votes to each other. I heard once the word "scandimafia" referring the way nordic countries vote each other, and I thought looking into voting groups would be a good way to practice multivariate statistics.

First we need voting data. As extensive as possible. I downloaded the 1975-2018 votes from this link that seems not to work anymore, but I am sure there are many other sources. Basicly the data we need has to have the fields: "year", "from country", "to country", "votes".

What I did, to better detect similarities in voting patterns among countries was to add to each country a vote of 12 points to itself. This way, other countries that vote for this one will show a bigger proximity in the analysis.

Then I tried two different analyses: cluster analysis and Principal Component Analysis.

"Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)." [Wikipedia]



"Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components." [Wikipedia]

In plain English, PCA is a tecnique used to convert several variables (can be dozens, hundreds...) into a small number of variables, while retaining the maximum amount of variability. It is used in fields were large amounts of variables are collected (like ecology), but also for smaller collections of data (the analysis of decathlon results is maybe the most popular in tutorials).

The first two principal components can be plotted in what is called a biplot, and the observations (in our case countries) will be placed according their loadings (values) in these components. To compare with the results from Cluster analysis, the country groups resulting from the Cluster analysis are plotted in the biplot with different colors.




[Originally published in wass.cat]

No comments:

Post a Comment

LinkWithin

Related Posts Plugin for WordPress, Blogger...