Data analysis often starts with structured data that’s already stored as numbers, dates, categories etc. However, unstructured data can yield crucial insights if you use appropriate techniques. In this tutorial, we’ll run sentiment analysis on a textual dataset to calculate positive/negative sentiment, and turn the results into an interactive report.
Running sentiment analysis
Let’s imagine we’re a data scientist working for a news company and we’re trying to figure out how ‘positive’ our news headlines are in comparison to the industry. We’ll start with the UCI News Aggregator dataset (CC0: Public Domain) [1] which is a collection of news headlines from different publications in 2014. This is a fun dataset because it has articles from a wide range of publishers and contains useful metadata. https://towardsdatascience.com/media/1f94d9ff94825ce48bf13c781c9addbd
After downloading and cleaning up the data, we get the following result:

We have 8 columns and about 400k rows. We’ll use the ‘Title’ for the actual sentiment analysis, and group the results by ‘Publisher’, ‘Category’ and ‘Timestamp’.
Classifying the headlines
Through the magic of open-source, we can use someone else’s hard-earned knowledge in our analysis — in this case a pretrained model called the Vader Sentiment Intensity Analyser from the popular NLTK library.
To build the model, the authors gathered a list of common words and then asked a panel of human testers to rate each one on valence i.e. positive or negative, and intensity i.e. how strong the sentiment is. As the original paper says: :
[After stripping out irrelevant words] this left us with just over 7,500 lexical features with validated valence scores that indicated both the sentiment polarity (positive/negative), and the sentiment intensity on a scale from –4 to +4. For example, the word “okay” has a positive valence of 0.9, “good” is 1.9, and “great” is 3.1, whereas “horrible” is –2.5, the frowning emoticon “:(” is –2.2, and “sucks” and “sux” are both –1.5.
To classify a piece of text, the model calculates the valence score for each word, applies some grammatical rules e.g. distinguishing between ‘great’ and ‘not great’, and then sums up the result.
Interestingly, this simple lexicon-based approach has equal or better accuracy compared to machine-learning approaches, and is much faster. Let’s see how it works! https://gist.githubusercontent.com/johnmicahreid/d4de6c0303c8b0724f7be93f5deb3c87/raw/c6e33bfb9ca8d10a1e4977083b0ace371ed6fe03/vader_model.py
In this code we import the library, classify each title in our dataset then append the results to our original dataframe. We have added 4 new columns:
- pos: positive score component
- neu: neutral score component
- neg: negative score component
- compound: the sum of the three score components
As a sanity check, let’s take a look at the most positive, neutral and negative headline in the text by using pandas idxmax
:
negative = df.iloc[df.neg.idxmax()]
neutral = df.iloc[df.neu.idxmax()]
positive = df.iloc[df.pos.idxmax()]print(f'Most negative: {negative.TITLE} ({negative.PUBLISHER})')
print(f'Most neutral: {neutral.TITLE} ({neutral.PUBLISHER})')
print(f'Most positive: {positive.TITLE} ({positive.PUBLISHER})')
Running that code gives us the following result:
Most negative: I hate cancer (Las Vegas Review-Journal \(blog\))
Most neutral: Fed's Charles Plosser sees high bar for change in pace of tapering (Livemint)
Most positive: THANK HEAVENS (Daily Beast)
Fair enough — ‘THANKS HEAVENS’ is a lot more positive than ‘I hate cancer’!
Visualizing the results
What does the distribution of our scores look like? Let’s visualize this in a couple of ways using the interactive plotting library Altair: https://towardsdatascience.com/media/d6eb747ac63dca5b916195121e77c35d
Here we’re showing both a histogram for the overall distribution, as well as a 100% stacked bar chart grouped by category. Running that code, we get the following result:

Seems like most headlines are neutral, and health has overall more negative articles than the other categories.
To give more insight into how our model is classifying the articles, we can create two more plots, one showing a sample of how the model classifies particular headlines, and another showing the average sentiment score for our largest publishers over time: https://towardsdatascience.com/media/f6d06113db0d39d1af2b5854b7504d31
This is where the declarative syntax of Altair really shines — we just change a few of the keywords e.g. mark_bar
to mark_point
and we get a completely different yet still meaningful result
By creating interactive visualizations, you enable viewers to explore the data directly. They’ll be much more likely to trust your overall conclusions if they can drill down to the original datapoints.
Looking at the publishers chart it seems that HuffPost is consistently more negative and RTT more positive. Hmmm, seems like they have different editorial strategies…
Creating a report
The final step is to package the results into an interactive report. As data scientists, we often forget to communicate our results effectively. I’ve often made the mistake of spending hours analysing data to answer somebody’s question then sending over a screenshot of a chart with a one-line explanation. My viewer then doesn’t understand the results and may not use them to actually make decisions.
Rule #1— always assume someone looking at your work has zero background and needs to understand from scratch. It’s always worth spending extra time and effort to write up the context and implications of your work.
For this tutorial we’ll use a library called Datapane to create a shareable report from these interactive visualizations. To do this, we’ll first need to create an account on Datapane, wrap our charts inside Datapane blocks and then upload the report. https://towardsdatascience.com/media/19815ff7cdb3ed71bb385e22acaa0f65
We write chunks of text in Markdown format to give context, interspersed with our actual plots and data. You can see an embedded version of the chart with interactive visualizations here:
This is a minimal report that you could share with stakeholders (your boss, your mom etc) to show overall sentiment across the media landscape. From here you can explore comparisons across time, industry and publishers, with the goal of making recommendations for how your organization can improve.
Conclusion
We can summarize what we’ve learned in this tutorial through three points:
- Use sentiment analysis to extract value from unstructured text
- Build charts in interactive libraries like Altair so your viewers can explore the data themselves, reinforcing your overall conclusions
- Spend extra time and effort writing up your results into a report with context