09 Mar 2011

Building a Social Media Listening Platform from scratch

Projects No Comments

[click above to view a short video]

The brief from management was to build a system capable of collecting brand mentions from all over the web, organizing and analyzing them, and then displaying them on an interface that we could distribute to our clients and account managers.  Mcgarrybowen needed its very own Social Media Listening Platform.

With some further questioning, more requirements were established.  The system should be a peace-of-mind application for distribution to our clients, as a monitor of their real-time brand reputation on the web.  It should be optimized for quick-glance reviews, with broad but shallow content, but with the ability to drill down from top-level reports into granular metrics reporting.  It should be capable of tracking the reputation not just of the brand itself, but of its key competition as well. It would need to be accessible from the web or the iPad, so HTML5 was a must.

With a mammoth assignment like this, our first step was to break it down into manageable, workable chunks.

Finding the data sources:

Our first challenge: Where on the Internet is the brand being mentioned, and how do we collect that data?

We divided our focus into three buckets:

  1. News & Headlines – What is the mainstream news media saying about the brand?  What press releases have been published that reference the brand?
  2. Online Authorities and In-Market sources – What sites are consumers likely to visit when they’re searching for information about the brand or the industry?  What ratings and reviews are they likely to view that mention the brand?
  3. Buzz – What are consumers likely to come across on the web when they are not actively in-market?  What are bloggers, tweeters, and Diggers saying that might affect a consumer’s opinion of the brand?

We knew that a robust solution would eventually take us down the route of building screen scrapers that could collect and organize data from any site we pointed it at, but for the initial prototypes, we decided to focus strictly on sources with well-established APIs.  We picked 30 sources, checked documentation and ran test queries, organized the returned data and did a gap analysis to figure out how we would organize the data across multiple sources.

This is where we ran into our first set of issues:

  1. How often are we querying? Some sources might only need to update once a day, but others, like Twitter, would require near-constant monitoring to keep up with the sheer volume of results.
  2. What format are the returns in? We realized that our pulls would need to be capable of parsing XML, JSON, or CSV depending on the source.
  3. Are we violating anyone’s Terms of Service? We knew we wanted to store everything in a centralized database, but several sources had specific prohibitions against storing their data in external frameworks.
  4. But the biggest question turned out to be:  What, exactly, are we searching for? We quickly realized that our pulls were going to have to be keyword-driven, submitting a given term to the API and logging what the returns were.

A keyword search strategy would need to be established.  We started out by searching only with the brand’s name, but realized that even subtle misspellings would be lost in this search, so we created the “Brand-words” category.   One of our test cases was Marriott, which meant also including “Marriot”, “Mariott”, and “Mariot”.  Sub-brand terms like “Courtyard”, “Renaissance”, and “Residence Inn” filled out this category.

Our second grouping was “Competitor-words”, which included search terms with the names of the top competition within the brand’s industry (in the case of Marriott, we used “Hilton”, “Intercontinental”, and “Four Seasons”).  The final grouping was “Industry-words”, hoping to capture conversations about more general topics within the industry.

Once we’d worked to satisfy all of these issues, we had a clean and stable database, pulling each source regularly, and with an API for getting the data back out.   We now could move on to the next most pressing concern:

How to analyze the data:

Since reputation management is all about keeping people’s opinions more positive than negative about your brand, our first priority for the prototype was to set up a sentiment analysis engine capable of reading through each of our database items and appending an evaluation of how positive or negative they were.  Even the smallest amount of research revealed this to be a huge task, but we were up for the challenge.

We looked at a number of different approaches, both custom and off-the-shelf, and determined that we’d get the most value out of building and training our own Naïve Bayesian classifier, a well-documented method of extracting sentiment from unstructured text.  Given a number of sample text snippets, each with a manually-supplied categorization, the system should, in time, be able to recognize which of the categories any new text snippet should belong in.  Anything that’s noticed as mis-scored can be resubmitted to the system with a correction, gradually increasing the tool’s accuracy over time.

We knew that our system should be capable of recognizing “Positive” vs. “Negative” vs. “Neutral”, but after looking at the data we were accumulating, we were again surprised by how much of the data we were grabbing wasn’t right for the system.  We included a “NSFW” category to weed out the more colorful entries, a “Non-Applicable” category (hadn’t realized that most “Hilton” searches would result in the latest Paris Hilton scandals), and a “Spam” category to filter out the surprising volume of Tweet-spam featuring bogus vacation offers.

A handful of lucky interns were tasked with poring over 20,000 of our database entries and manually scoring them as one of the 6 categories twice, each being scored again by someone else as confirmation.  If they two scores differed, they were flagged for further administrator review.  After amassing this much data, testing confirmed that any new items were being scored with up to 72% accuracy.

Presenting the Data:

With our back-end in order, our attention shifted to the interface: How are our clients going to view the data?  Our only limitation was the desire to have it viewable on the iPad, so a purely-HTML5 Canvas application was needed.

A designer was brought in to prepare the visualization scheme, based around a radial “health” metric that compared the the total number of brand mentions to how many of those were positive or negative.  We ended up with three data views:

  • Up-to-the-minute, live-streaming brand mentions, monitoring a single day’s brand health as they are picked up by the system
  • An at-a-glance historical review review of brand performance for recent pre-set time periods
  • A deep analytics toolkit to monitor the brand’s sentiment over time, keyword-by-keyword and for any selected date range

The combination of these three techniques satisfied all our original requirements and created a platform for future data viz designers to pick up where we left off.

What we learned:

Building a Social Media Listening Platform is not easy, but once the fundamentals are in place, they start to click together like puzzle pieces.  Broken down by challange:

  1. Intake: for each source you plan to monitor, you will need custom scripts to call for new data, clean it up, and store it to your database.  Make sure you know the limits of each API’s Terms of Service and if they have a limit on how often and how much data  you can fetch.  Your data storage will need to grow as you add new keywords and will depend on how general or specific your terms are.
  2. Processing: find a sentiment-analysis algorithm you like and train the heck out of it.  The smarter you can make your system, the more accurate and valuable it’ll be.  Unfortunately, this cannot (in most cases) be automated, so account for the training time in your planning.  Bribe your interns to do it with pizza and iTunes gift cards and you’ll have great success.
  3. Display: compelling data visualization is at least as important as the data itself.  Get a designer who knows what they’re doing, and it’ll be a simple matter of hooking the right data pipes to the right display outputs.

Get these three areas right and it’ll be a piece of cake–or just ask nicely and I can help you make one, should be pretty good at it by now.

No Responses to “Building a Social Media Listening Platform from scratch”

Leave a Reply