Social Media
Hurricane Sandy Data

I'm still hard at work on my social media analyis and monitoring tool, Social Harvest, but in the meantime I thought I'd share some data with the world (at the bottom of this post). I'm thinking about releasing bits of data for free or possibly selling various data sets. This would be additional and aside from the service. If you're interested, please comment below.
This little sample of data includes over 24,000 mentions, from Twitter and Facebook, for hurricane Sandy with a second file containing over 16,000 URLs that were shared regarding Sandy. All of this is in JSON format (an export from MongoDB). You would need to clean it up a tiny bit if you wanted to load it all as a single JSON object.
What can you do with this data? In short: anything you want. If you are going to publish something online, or otherwise, I do ask that you credit/cite Social Harvest as the source of the data.
What is in this data? There is no message text, or personal information, in accordance with the terms of service from Twitter and Facebook. However, message ids are in there so you can go back and track down original content if desired. Here is a list of the fields and what they contain:
- s_type: The source, "tweet" or "facebookMessage" is all that you will see in this case.
- _id: The id for the item, internal from Social Harvest (I left it in the export in case you need a unique id for each item).
- s_id: The source id, this will let you go back to the original item using Twitter or Facebook's API.
- d: The date ($date is a MongoDB thing, but the values are timestamps).
- g: The gender, m = male, f = female, and u = unknown (Twitter does not give or ask for gender, but I have some algorithms to get it in some cases but not all).
- geo: Geo-location data if available.
- l: Locale information.
- h: List of hashtags for the mention (Twitter only).
- pop: Popularity info such as number of likes or retweets, etc.
- s: Sentiment, -1 is negative, 0 is neutral, and 1 is positive.
For the shared links report; you get some similar fields with a new field being "u" which is the URL that was shared.
The data is from Saturday, October 27th at 10am EST right before the storm until the next Sunday morning, 9:30am EST, November 4th. So just a little bit of time right before the storm and then a few days after it covering about a week total (if I update the files I will update this post, I know that people are still without power and are talking about Sandy still and Social Harvest is still harvesting). With this data, you can really tell a lot about what people are saying about hurricane Sandy. If you were to plot, what you can, on the map you'll see what you expect - a lot of mentions from the Northeast United States (pictured above).
Another interesting thing about this data is that you can easily see the most popular, or viral, media. The most viral photo about superstorm Sandy was this photo on Facebook. Which, not surprisingly, did indeed catch the attention of a few people and made headlines of some news sites. So some very interesting stuff and I'm glad to see it all validated.
What else do I have? Other major US events such as the U.S. presidential elections. Depending on the response from all this, I may release that data next. I'm really excited about the data that Social Harvest is turning up and I will post with updates on that service on my blog here in the future.
Also, please comment here if you do use this data for anything visually exciting...I would love to see it! As would many people.
You can find the JSON exports for hurricane Sandy here and here. Enjoy!
Data Mining: Spotting Questions
Another feature that will exist in Social Harvest is question detection. I want to be able to determine and extract questions (perhaps even group and rank the most popular questions asked - one day) to present to a user. More and more companies are using social media (and blogs) for customer service these days. However, if you don't have someone hawking over with a tool like TweetDeck, then questions will fall on deaf ears. Actually, TweetDeck should build in question detection. It could even be built into the ActionScript code within their AIR client
It's not all that hard actually. Of course I say that without the disclaimer of accuracy. Let me rephrase; it's not all that difficult to detect a good amount of questions, but it is not fool proof. Why? Well, people don't always type with perfect grammar and puncuation for starters. Especially not on the internet and most definitely not on Twitter where one is limited with characters.
We can rely upon question marks naturally. There's very few cases where we see a question mark that doesn't indicate some sort of question (even if rhetorical). This single, simple, regex rule will get you more than half the questions out there - easy. I don't have exact numbers, maybe after I gather questions I can start to report on how many questions I've found that don't use the question mark. In my most basic of tests, I've discovered about 33%. However, that's totally inaccurate, do not assume anything.
Then we get into the more complex methods. Could you use a Naive Bayes classifier? Eh, yea...You'd be hitting on word frequencies for words such as; who, what, when, where, why, and so on. The 5 W's are the next best way to determine if a piece of text is a question or not. However, I don't think you need a Naive Bayes classifier to come to your conclusion. A more simple tree would do ya.
I came across an interesting research paper on the ACL that tries various methods. They do use patterns and do take into consideration all the obvious things we've gone over here; the 5W's and question marks, etc. One thing that surprised me with their testing is the accuracy and the performance of various methods seen in table 3. Again they are not saying that 94% of the questions out there always have question marks, but they are saying that the accuracy rate was 94%. That's interesting, what was that 6% in their sample data that was either missed or contained question marks that weren't for questions? I know it came from Yahoo! but I don't know what exactly...But I was also fascinated by their sequential and syntatic pattern matching. The sets were large, 1,314 and 580, but I don't think something that would take up a lot of disk space nor take very long to loop through. Again, table 3 impresses me with the accuracy of things. Still not quite as good as relying upon a question mark, but very close. The most important part of this is that you can get near the same accuracy with very faster performance without relying upon a question mark.
Why is this interesting? Well, back to the main problem at hand. We are now in a world with extremely poor grammar. We are limited by the length of a Tweet and as such you may not see a question mark where a question is implied. By combining both approaches I think you will end up with a pretty comprehensive question detection system.
So I'm working on it. So far I've had some good succeess, but I have yet to really test things out. I have no control for my tests so I can't determine accuracy. I'll get to it eventually. For now, I wanted to point out three examples that I've detected as questions.
- I just read an article about who has the best chance to beat Obama. Is anyone surprised that it is Ron Paul? #RonPaul2012
- Are you looking forward to a 20 year recession, or are you finally ready for http://t.co/xLmgNA29
- Why bother with a budget when you can always print more money?? We really lack leadership!!! http://t.co/xLmgNA29
These were all detected by my system to be questions. Can I fault the system? No. I think it did a good job...But you can probably very quickly determine which one would go unanswered by someone. #3 is pretty rhetorical, right? I suppose someone could comment back on it, but it's clearly rhetoric and figurative to illustrate this person's point.
Interestingly enough the system also caught #2. This is a prime example of when a question is implied but you see no question mark. I'm not even sure this person ran out of space for one...There just wasn't one. Yes, it's slightly rhetoric, the desire is that the viewer clicks the link to find out the answer. However, look at what's been keyed in on here. The "are you" parts. Certain comabinations of words explicitly mean question. If no question mark follows certain phrases like that, then it would be poor grammar or missing puncuation. There are rules in the English language (even if there are exceptions at times).
Then of course without question, no pun intended, #1 is a question. Relying on the question mark easily caught that one. However, "is anyone" would also suffice. It's unlikely to have "is anyone" as a statement. It might be an answer to a question, right? Who can read this blog? Answering, "That is anyone." ...But that's incorrect grammar. "Anyone can" ... or "That would be anyone." However, you can definitely run into problems with always relying upon these rules.
I think the accuracy levels of over 75% are acceptable. I think over a large data set the number of misses will be small and if presented to a user in the right way...Easily ignored. Don't forget that with good UI we can hide away mistakes and inaccuracies from the system or at least prepare the user to deal with them in a simple way. If you give the user the ability to, say, delete the possible question from their view. Then it takes all of a click and a second to remove an item that doesn't even appear that often. Let's put it this way, if you are presented with 10 questions from a collection of 1,000 tweets and 1 of those is really not a question...Would you spend more time clicking a button to remove the one error? Or would you spend more time going through 1,000 tweets manually to find all the questions?
So that's question detection - without any code examples. Simple, fun, very powerful and helpful. I wouldn't be surprised if you saw more tools provide these features in the future as "internet noise" grows.
Data Mining: Determining Gender
I recently had a talk with a person about collecting demographics from the web. He brought up some very good points and it was a nice conversation. One of the things that I keep going back to in my head days later is how he felt that a 33% accuracy rate on determining gender was terrible. Perhaps I wasn't clear enough about the fact that it was strictly from Twitter. I think I did mention that though.
Here's the deal folks. Getting demographic information from data mining the internet is a very weak game. Don't expect to be able to add up your findings for male/female (or even by location, etc.) and get 100%. That's plain silly and impossible. This guy was talked as if 70% was common. On the internet? Hardly. I actually do not know a good number to shoot for, but as I continue my research and build Social Harvest I will know what that is.
I believe 33% for Twitter to be very good. Of course Facebook is going to allow a greater deal of accuracy, I'm assuming 100% since Facebook actually asks for gender upon registration and displays with the basic info for each user. On the other hand, Twitter doesn't ask for gender. Additionally, you can set your name to whatever you like.
So the problem is that many times you'll have company accounts with no name. That's just the most basic example of why you can never have 100% or even 70% accuracy. If you want to expand it beyond Twitter and look at comments on blogs...You can quickly see that your success rate is going to go down the tubes real quick.
Now, my question is this: Why do all the social media monitoring services such as Radian6, Sysomos, etc. give you results that add up to 100% for gender? They're flat out lying to you. Talking to a friend a while ago he didn't realize this at first. He said, "I don't know, but somehow they just know. Maybe they pay extra for that." ... No, sorry. So I quickly went into the Sysomos demo and pointed out from the very first page of results where a tweet from a user marked as male was actually female. That or they had gender re-assignment and that tool really did know something we didn't!
Why do they do this? My theory is they are afraid to tell customers that they don't know. There's gray areas out there and it's my belief that we should be aware of them...Because by randomly choosing male or female, we're actually skewing the results. It's far better to say, "of the 300 we know about, this many are male..." than it is to simply lie about it. People are trying to target ads based on this data and it's horrible to knowingly be inaccurate. There's always a margin of error and that's a different story.
So how do we determine gender? Well, I can't exactly spill all the beans...But there's actually some very, very good ways to do so. I'll give you a hint. There's some free databases available to you out there from big brother. When I say big brother, I mean the US government. That said, here's the obvious challenge. People named "Pat" and "Sam" are going to also be gray areas just as much as people on Twitter who do not give a first name. You have to put them in the uncertain category as well. It's unfortunate, but you have to.
What about advanced methods? Well, sure there are a few. You can actually analyze the text that people post and determine if it's male or female by their writing style. You can also try to grab page colors to factor into your probability and even use something like Face API to try to analyze profile pictures. I find photo recognition a very interesting thing. However, all of these clever attempts also are subject to a hefty margin of error. The profile photos are very small for Twitter and are typically poorly lit, etc. Additionally, many people don't even use a photo of themself. You also can't go based on someone's writing style or interests. You may have a user screaming "I love Transformers and Star Trek!" all over, but you really can't count on them being male. Additionally, do you realize the task that's now put before you? All this work just to determine the gender of a single user who posted a single status update on Twitter. Think about doing that thousands or even millions of times over. You want those results sometime this decade, right?
Even if money was no object and you had several computers do this processing to offset the time it took...Even if you also went off and searched Google for people's names to see if you can find additional supporting images... You are still subject to a margin of error. The time and effort...The sheer cost is not worth it.
So I say embrace the gray areas of data. Understand them and know why they exist. In the case of gender, it's simply the nature of the internet. No one requires you to register and expose your identity on the internet. That's the beauty of it. If you're trying to gather demographics on the net, please keep that in mind. If you can't accept and understand that, then you probably aslo don't understand the internet well enough to be advertising or working with it in a professional manner for a job of some sort.
I will continue my research and hope to find ways to improve things beyond 33%, I have decades of data and clever algorithms to help me do that, but for now...I'm quite happy to have the most accurate system for determing gender from Twitter...That I've ever seen at least.


Social Networks