Using data to detect sarcasm in Aussie tweets
Australians are renowned for our sarcastic and witty humour. We like to express ourselves by making fun of our friends, even ourselves, in the name of a good laugh. It is generally done in jest and is a form of friendly endearment rather than an intended insult. Many tourists or international students can find it confusing or sometimes offensive when first exposed to Aussie humour.
However, can we use data to support this well known stereotype? We look to Twitter using natural language processing and machine learning techniques to investigate. Let’s see if we can teach sarcasm detection to a computer if it can leave even humans scratching their heads.
Australians on Politics
Over the last 20 years, satire has played an increasing role in the way Aussies express their opinions on politics. We reviewed Australian election related tweets from the 2019 elections and fed it through a sentiment analysis algorithm.
At first glance, our results showed us that Aussies generally felt quite positive towards the election.
Let’s take a look at some of these “positive” tweets:
I love how Mr Bill Shorten used “sophistication” many times #auspol
Nah, it’s cool, we don’t need a planet anyway #auspol #ElectionResults2019
Good on ya, Queensland. #ImBeingVerySarcasticRightNow #Election2019Results #AUSVote19 #auspol
It seems our basic sentiment algorithm has not been able to detect sarcasm; what a surprise.
Machine learning & neural network techniques
A computer reads sentences a bit differently than a human - a computer breaks the sentence down into a series of numbers and decides sentiment based on those numbers. The problem is that the numbers associated with the phrase “it’s cool”, and “Good on ya, Queensland” are associated with positivity. But the surrounding context is sarcastic. In the case of politics, the use of sarcasm is generally reserved for a more negative tone.
We used a number of techniques to detect sarcasm in these tweets, including:
Logistic regression - which gave us an interpretable and somewhat robust performance (we’ll explain further)
Linear SVM (support vector machine) - a more sophisticated machine learning method which ended up yielding similar results to logistic regression
Neural Network (LSTM) - a complex black box that takes into account the sequencing of words within a sentence. Although a popular method for text analysis, we found this technique to be prone to initial overfitting
For the purposes of this blog, we decided to continue with logistic regression because it is the simplest to interpret and break down how it works.
At a high level, this model identified keywords or phrases which were most prevalent in sarcastic text. These “phrases” were then given a “weighting” or score; the more of these “phrases” present in the tweet, the more likely it was to be tagged as sarcastic.
Here is an example of the top 10 “phrases” or features our model identified
Lets take a look at the example below. Our machine learning model picked up the word “just”, which is generally present in sarcasm. This word increases in importance if it is used before a positive word such as love or lovely, in a negative context, i.e. “I just love..” or "isn't it just lovely.."
Also, just lovely to wake up to hear ScoMo won the election. So much for ending climate change and improving the lives of everyone under 40. Cool cool cool fine fine fine. #auspol
This is a basic example of how machine learning begins to learn sarcasm. However, our model is not perfect and has not successfully detected sarcasm in all tweets. Consider the below tweet:
Amaze Balls, The Power Of Bill Short-On Details Is Mind Blowing! #Not #AusPol #AusVotes2019
A person with more context and understanding around Australian politics could probably have picked this up. However, for a machine it can be tricky. When spoken, sarcasm can be identified by the tone and facial expression of the person - however in written form, it can be more ambiguous.
How about this, one?
Our model has picked this up as sarcastic. Honestly, I can't even tell if this person if being completely serious or not. What do you think?
We used our sarcasm model to adjust the initial sentiment analysis results
There is definitely sarcasm detected in the Twitter data. Our model has reclassified 13% of the positive tweets to neutral or negative (note: negative is reserved for tweets that are more explicitly negative).
Our natural next question: where do the most sarcastic Aussies live?
Sarcastic folks: where the bloody hell are you?
We’ve used our model and assigned tweets to Australian SA4 regions.
Colour scale: less sarcastic (light purple) to more sarcastic (dark purple)
Looks like South Australia really stands out from the rest, along with parts of Brisbane, Sydney and Perth. We looked at capital cities below to pull out some stats.
This plot shows proportion of sarcastic tweets compared to total tweets in each city, with the dotted line representing the national average.
On a national average, about 1 in 4 tweets are sarcastic.
That's a pretty high statistic. Although, it is important note that Twitter is not perfectly reflective of the election sentiment in reality. For this case study, we'll conclude that there is ample evidence in our data to support that Aussies are pretty sarcastic.
In terms of cities specifically, Darwin, you guys are definitely taking the lead here with 1 in 3 being sarcastic! Canberra… can't say I’m too surprised there.
Natural language processing is a very complex task. Part of the complication is not only the context of the text, e.g. politics, food or animals, but also the medium / channel in which the text is written. It is used by many companies to explore the gold mine of free form writing to better understand human sentiment and emotions.
In this instance, one such application could be in Australian Tourism. Based on our analysis, we found that about 1 in 3 tweets showed a sarcastic tone during the Australian Election. Whilst the data is certainly skewed, could we still boldly conclude that sarcasm is a large part of the Australian culture? With tourism and education being such a big part of the Australian economy, imagine how many jokes out there are being misinterpreted by our well-mannered visitors!
What if we could more accurately translate and convey the intended messages in language or translation apps? Perhaps this is what we should look into next.
Want to check out our maps and draw your own conclusion? Check out our interactive map and insights here. This has been built using a combination of R and Python. Unfortunately, it has not optimised for mobile devices!
EdgeRed is a boutique data and analytics consultancy specialising in delivering high quality outcomes for our clients. Drop us a note and we'll be happy to have a chat regarding your data and analytics needs.