• Latest
  • Trending
Building a Sentiment Analysis report using NLTK and Altair

Building a Sentiment Analysis report using NLTK and Altair

November 19, 2021
Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022
8 Most Common Causes of a Data Breach

5.7bn data entries found exposed on Chinese VPN

August 18, 2022
Fibre optic interconnection linking Cameroon and Congo now operational

Fibre optic interconnection linking Cameroon and Congo now operational

July 15, 2022
Ericsson and MTN Rwandacell Discuss their Long-Term Partnership

Ericsson and MTN Rwandacell Discuss their Long-Term Partnership

July 15, 2022
Airtel Africa Purchases $42M Worth of Additional Spectrum

Airtel Africa Purchases $42M Worth of Additional Spectrum

July 15, 2022
Huawei steps up drive for Kenyan talent

Huawei steps up drive for Kenyan talent

July 15, 2022
TSMC predicts Q3 revenue boost thanks to increased iPhone 13 demand

TSMC predicts Q3 revenue boost thanks to increased iPhone 13 demand

July 15, 2022
Facebook to allow up to five profiles tied to one account

Facebook to allow up to five profiles tied to one account

July 15, 2022
Top 10 apps built and managed in Ghana

Top 10 apps built and managed in Ghana

July 15, 2022
MTN Group to Host the 2nd Edition of the MoMo API Hackathon

MTN Group to Host the 2nd Edition of the MoMo API Hackathon

July 15, 2022
KIOXIA Introduce JEDEC XFM Removable Storage with PCIe/NVMe Spec

KIOXIA Introduce JEDEC XFM Removable Storage with PCIe/NVMe Spec

July 15, 2022
  • Consumer Watch
  • Kids Page
  • Directory
  • Events
  • Reviews
Sunday, 29 January, 2023
  • Login
itechnewsonline.com
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion
Subscription
Advertise
No Result
View All Result
itechnewsonline.com
No Result
View All Result

Building a Sentiment Analysis report using NLTK and Altair

by ITECHNEWS
November 19, 2021
in Data Science, Leading Stories
0 0
0
Building a Sentiment Analysis report using NLTK and Altair

Data analysis often starts with structured data that’s already stored as numbers, dates, categories etc. However, unstructured data can yield crucial insights if you use appropriate techniques. In this tutorial, we’ll run sentiment analysis on a textual dataset to calculate positive/negative sentiment, and turn the results into an interactive report.

Running sentiment analysis

Let’s imagine we’re a data scientist working for a news company and we’re trying to figure out how ‘positive’ our news headlines are in comparison to the industry. We’ll start with the UCI News Aggregator dataset (CC0: Public Domain) [1] which is a collection of news headlines from different publications in 2014. This is a fun dataset because it has articles from a wide range of publishers and contains useful metadata. https://towardsdatascience.com/media/1f94d9ff94825ce48bf13c781c9addbd

YOU MAY ALSO LIKE

Inaugural AfCFTA Conference on Women and Youth in Trade

Instagram fined €405m over children’s data privacy

After downloading and cleaning up the data, we get the following result:

We have 8 columns and about 400k rows. We’ll use the ‘Title’ for the actual sentiment analysis, and group the results by ‘Publisher’, ‘Category’ and ‘Timestamp’.

Classifying the headlines

Through the magic of open-source, we can use someone else’s hard-earned knowledge in our analysis — in this case a pretrained model called the Vader Sentiment Intensity Analyser from the popular NLTK library.

To build the model, the authors gathered a list of common words and then asked a panel of human testers to rate each one on valence i.e. positive or negative, and intensity i.e. how strong the sentiment is. As the original paper says: :

[After stripping out irrelevant words] this left us with just over 7,500 lexical features with validated valence scores that indicated both the sentiment polarity (positive/negative), and the sentiment intensity on a scale from –4 to +4. For example, the word “okay” has a positive valence of 0.9, “good” is 1.9, and “great” is 3.1, whereas “horrible” is –2.5, the frowning emoticon “:(” is –2.2, and “sucks” and “sux” are both –1.5.

To classify a piece of text, the model calculates the valence score for each word, applies some grammatical rules e.g. distinguishing between ‘great’ and ‘not great’, and then sums up the result.

Interestingly, this simple lexicon-based approach has equal or better accuracy compared to machine-learning approaches, and is much faster. Let’s see how it works! https://gist.githubusercontent.com/johnmicahreid/d4de6c0303c8b0724f7be93f5deb3c87/raw/c6e33bfb9ca8d10a1e4977083b0ace371ed6fe03/vader_model.py

In this code we import the library, classify each title in our dataset then append the results to our original dataframe. We have added 4 new columns:

  • pos: positive score component
  • neu: neutral score component
  • neg: negative score component
  • compound: the sum of the three score components

As a sanity check, let’s take a look at the most positive, neutral and negative headline in the text by using pandas idxmax :

negative = df.iloc[df.neg.idxmax()]
neutral = df.iloc[df.neu.idxmax()]
positive = df.iloc[df.pos.idxmax()]print(f'Most negative: {negative.TITLE} ({negative.PUBLISHER})')
print(f'Most neutral: {neutral.TITLE} ({neutral.PUBLISHER})')
print(f'Most positive: {positive.TITLE} ({positive.PUBLISHER})')

Running that code gives us the following result:

Most negative: I hate cancer (Las Vegas Review-Journal \(blog\))
Most neutral: Fed's Charles Plosser sees high bar for change in pace of tapering (Livemint)
Most positive: THANK HEAVENS (Daily Beast)

Fair enough — ‘THANKS HEAVENS’ is a lot more positive than ‘I hate cancer’!

Visualizing the results

What does the distribution of our scores look like? Let’s visualize this in a couple of ways using the interactive plotting library Altair: https://towardsdatascience.com/media/d6eb747ac63dca5b916195121e77c35d

Here we’re showing both a histogram for the overall distribution, as well as a 100% stacked bar chart grouped by category. Running that code, we get the following result:

Seems like most headlines are neutral, and health has overall more negative articles than the other categories.

To give more insight into how our model is classifying the articles, we can create two more plots, one showing a sample of how the model classifies particular headlines, and another showing the average sentiment score for our largest publishers over time: https://towardsdatascience.com/media/f6d06113db0d39d1af2b5854b7504d31

This is where the declarative syntax of Altair really shines — we just change a few of the keywords e.g. mark_bar to mark_pointand we get a completely different yet still meaningful result

By creating interactive visualizations, you enable viewers to explore the data directly. They’ll be much more likely to trust your overall conclusions if they can drill down to the original datapoints.

Looking at the publishers chart it seems that HuffPost is consistently more negative and RTT more positive. Hmmm, seems like they have different editorial strategies…

Creating a report

The final step is to package the results into an interactive report. As data scientists, we often forget to communicate our results effectively. I’ve often made the mistake of spending hours analysing data to answer somebody’s question then sending over a screenshot of a chart with a one-line explanation. My viewer then doesn’t understand the results and may not use them to actually make decisions.

Rule #1— always assume someone looking at your work has zero background and needs to understand from scratch. It’s always worth spending extra time and effort to write up the context and implications of your work.

For this tutorial we’ll use a library called Datapane to create a shareable report from these interactive visualizations. To do this, we’ll first need to create an account on Datapane, wrap our charts inside Datapane blocks and then upload the report. https://towardsdatascience.com/media/19815ff7cdb3ed71bb385e22acaa0f65

We write chunks of text in Markdown format to give context, interspersed with our actual plots and data. You can see an embedded version of the chart with interactive visualizations here:

This is a minimal report that you could share with stakeholders (your boss, your mom etc) to show overall sentiment across the media landscape. From here you can explore comparisons across time, industry and publishers, with the goal of making recommendations for how your organization can improve.

Conclusion

We can summarize what we’ve learned in this tutorial through three points:

  1. Use sentiment analysis to extract value from unstructured text
  2. Build charts in interactive libraries like Altair so your viewers can explore the data themselves, reinforcing your overall conclusions
  3. Spend extra time and effort writing up your results into a report with context
ShareTweetShare
Plugin Install : Subscribe Push Notification need OneSignal plugin to be installed.

Search

No Result
View All Result

Recent News

Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022
8 Most Common Causes of a Data Breach

5.7bn data entries found exposed on Chinese VPN

August 18, 2022

About What We Do

itechnewsonline.com

We bring you the best Premium Tech News.

Recent News With Image

Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022

Recent News

  • Inaugural AfCFTA Conference on Women and Youth in Trade September 6, 2022
  • Instagram fined €405m over children’s data privacy September 6, 2022
  • 5.7bn data entries found exposed on Chinese VPN August 18, 2022
  • Fibre optic interconnection linking Cameroon and Congo now operational July 15, 2022
  • Home
  • InfoSec
  • Opinion
  • Africa Tech
  • Data Storage

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

No Result
View All Result
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Go to mobile version