6th April 2018
Quick post using Python3 and the Seaborn statisitcal visualization package to start trying to understand the UK gender pay gap data released this week. All UK companies with more than 250 employees are required to provide data on how their female and male employees are paid differently. I decided to drill down to look at how, according to the data self-reported by companies, pay varies by gender in the electricity sector.
8th July 2017
So things didn't quite turn out quite as anyone expected in the snap UK general election...
I wanted to create a visualisation of the results which contrast the seats won with the % of the popular vote, and came up with this infographic. The nice thing about the two semi-circular charts I generated is that they can be nested within each other.
21st August 2016
Four years on from the London Olympics he's only gone and done it again - the double double 5000m/1000m.
. Once again, I tracked the tweets using the twitter streaming API (search terms #gomo,#motime,@mo_farah,#mofarah) before, during and after the race.
The interesting things is, well, the distribution of tweets over time is pretty similar to last time. Even the absolute rates in tweets per second are similar, despite the fact the race started at 01.37am British Summer Time. You can compare them youselves by looking at my original post from 2012.
pyspark, python, data science
5th May 2014
pyspark, python, data science
2nd March 2014
Apache Spark is a relatively new data processing engine implemented in Scala and Java that can run on a cluster to process and analyze large amounts of data. Spark performance is particularly good if the cluster has sufficient main memory to hold the data being analyzed. Several sub-projects run on top of Spark and provide graph analysis (GraphX), Hive-based SQL engine (Shark), machine learning algorithms (MLlib) and realtime streaming (Spark streaming). Spark has also recently been promoted from incubator status to a new top-level project.
In this series of blog posts, we'll look at installing spark on a cluster and explore using its Python API bindings PySpark for a number of practical data science tasks. This first post focuses on installation and getting started.
1st December 2012
This snippet, twitstreamer, is a simple command line tool, written in python3, for retrieving tweets via the twitter streaming API, v1.1. The tweets are written to standard output as CSV or JSON formatted lines.
The tool will read from either of two twitter streaming APIs.
21st November 2012
This snippet, twitfetcher, is a simple command line tool, written in python3, for retrieving tweets via the twitter search API, v1.1. The tweets can be stored into CSV or JSON formatted files.
Twitter only makes a sample of those tweets sent over the previous week searchable, but it is still a very useful free source of data for data science experiments.
python, sna, gephi, nltk
12th October 2012
I started on Coursera's Social Network Analysis course and was looking around for some network data to start analyzing. I've seen a talk by Matt Biddulph at a Big Data London meetup (blog post) on analyzing Wikipedia data and wondered if something similar could be easily done with news data.
It was fairly easy to grab some newspaper articles using the Guardian Open Platform. I then used the python-based Natural Lanuage Toolkit to extract named entities (in particular the names of people) from the articles. A network could then be constructed using names as the nodes, and connecting nodes with a link if at least two articles included both names.
The resulting network could then be loaded into Gephi, an excellent tool for visualizing and anayzing networks.
12th August 2012
Another sports related post, this time inspired by Mo Farah's amazing double gold medals (in the 5000m and 10000m) over the last couple of weeks at the London Olympics.
I used the gRaphael Charting Library and the Twitter search API to show how the rate at which tweets containing the hashtag #gomo varied before during and just after the 5000m London Olympics final. Hover over the chart to display the text for selected tweets.
The main features of the chart are a small peak just before the race starts followed by the huge peak after Mo wins. And I thought it was a long way to jog to the bus stop when running late in the morning!
8th August 2012
The visualization plots matches played (x-axis) against points accumulated (y-axis). Click on "Add club" button to compare the progress against that of the other clubs playing in the England and Wales FA Championship.