Hurricane Sandy Data
I'm still hard at work on my social media analyis and monitoring tool, Social Harvest, but in the meantime I thought I'd share some data with the world (at the bottom of this post). I'm thinking about releasing bits of data for free or possibly selling various data sets. This would be additional and aside from the service. If you're interested, please comment below.
This little sample of data includes over 24,000 mentions, from Twitter and Facebook, for hurricane Sandy with a second file containing over 16,000 URLs that were shared regarding Sandy. All of this is in JSON format (an export from MongoDB). You would need to clean it up a tiny bit if you wanted to load it all as a single JSON object.
What can you do with this data? In short: anything you want. If you are going to publish something online, or otherwise, I do ask that you credit/cite Social Harvest as the source of the data.
What is in this data? There is no message text, or personal information, in accordance with the terms of service from Twitter and Facebook. However, message ids are in there so you can go back and track down original content if desired. Here is a list of the fields and what they contain:
- s_type: The source, "tweet" or "facebookMessage" is all that you will see in this case.
- _id: The id for the item, internal from Social Harvest (I left it in the export in case you need a unique id for each item).
- s_id: The source id, this will let you go back to the original item using Twitter or Facebook's API.
- d: The date ($date is a MongoDB thing, but the values are timestamps).
- g: The gender, m = male, f = female, and u = unknown (Twitter does not give or ask for gender, but I have some algorithms to get it in some cases but not all).
- geo: Geo-location data if available.
- l: Locale information.
- h: List of hashtags for the mention (Twitter only).
- pop: Popularity info such as number of likes or retweets, etc.
- s: Sentiment, -1 is negative, 0 is neutral, and 1 is positive.
For the shared links report; you get some similar fields with a new field being "u" which is the URL that was shared.
The data is from Saturday, October 27th at 10am EST right before the storm until the next Sunday morning, 9:30am EST, November 4th. So just a little bit of time right before the storm and then a few days after it covering about a week total (if I update the files I will update this post, I know that people are still without power and are talking about Sandy still and Social Harvest is still harvesting). With this data, you can really tell a lot about what people are saying about hurricane Sandy. If you were to plot, what you can, on the map you'll see what you expect - a lot of mentions from the Northeast United States (pictured above).
Another interesting thing about this data is that you can easily see the most popular, or viral, media. The most viral photo about superstorm Sandy was this photo on Facebook. Which, not surprisingly, did indeed catch the attention of a few people and made headlines of some news sites. So some very interesting stuff and I'm glad to see it all validated.
What else do I have? Other major US events such as the U.S. presidential elections. Depending on the response from all this, I may release that data next. I'm really excited about the data that Social Harvest is turning up and I will post with updates on that service on my blog here in the future.
Also, please comment here if you do use this data for anything visually exciting...I would love to see it! As would many people.