Data Mining: Spotting Questions
Another feature that will exist in Social Harvest is question detection. I want to be able to determine and extract questions (perhaps even group and rank the most popular questions asked - one day) to present to a user. More and more companies are using social media (and blogs) for customer service these days. However, if you don't have someone hawking over with a tool like TweetDeck, then questions will fall on deaf ears. Actually, TweetDeck should build in question detection. It could even be built into the ActionScript code within their AIR client
It's not all that hard actually. Of course I say that without the disclaimer of accuracy. Let me rephrase; it's not all that difficult to detect a good amount of questions, but it is not fool proof. Why? Well, people don't always type with perfect grammar and puncuation for starters. Especially not on the internet and most definitely not on Twitter where one is limited with characters.
We can rely upon question marks naturally. There's very few cases where we see a question mark that doesn't indicate some sort of question (even if rhetorical). This single, simple, regex rule will get you more than half the questions out there - easy. I don't have exact numbers, maybe after I gather questions I can start to report on how many questions I've found that don't use the question mark. In my most basic of tests, I've discovered about 33%. However, that's totally inaccurate, do not assume anything.
Then we get into the more complex methods. Could you use a Naive Bayes classifier? Eh, yea...You'd be hitting on word frequencies for words such as; who, what, when, where, why, and so on. The 5 W's are the next best way to determine if a piece of text is a question or not. However, I don't think you need a Naive Bayes classifier to come to your conclusion. A more simple tree would do ya.
I came across an interesting research paper on the ACL that tries various methods. They do use patterns and do take into consideration all the obvious things we've gone over here; the 5W's and question marks, etc. One thing that surprised me with their testing is the accuracy and the performance of various methods seen in table 3. Again they are not saying that 94% of the questions out there always have question marks, but they are saying that the accuracy rate was 94%. That's interesting, what was that 6% in their sample data that was either missed or contained question marks that weren't for questions? I know it came from Yahoo! but I don't know what exactly...But I was also fascinated by their sequential and syntatic pattern matching. The sets were large, 1,314 and 580, but I don't think something that would take up a lot of disk space nor take very long to loop through. Again, table 3 impresses me with the accuracy of things. Still not quite as good as relying upon a question mark, but very close. The most important part of this is that you can get near the same accuracy with very faster performance without relying upon a question mark.
Why is this interesting? Well, back to the main problem at hand. We are now in a world with extremely poor grammar. We are limited by the length of a Tweet and as such you may not see a question mark where a question is implied. By combining both approaches I think you will end up with a pretty comprehensive question detection system.
So I'm working on it. So far I've had some good succeess, but I have yet to really test things out. I have no control for my tests so I can't determine accuracy. I'll get to it eventually. For now, I wanted to point out three examples that I've detected as questions.
- I just read an article about who has the best chance to beat Obama. Is anyone surprised that it is Ron Paul? #RonPaul2012
- Are you looking forward to a 20 year recession, or are you finally ready for http://t.co/xLmgNA29
- Why bother with a budget when you can always print more money?? We really lack leadership!!! http://t.co/xLmgNA29
These were all detected by my system to be questions. Can I fault the system? No. I think it did a good job...But you can probably very quickly determine which one would go unanswered by someone. #3 is pretty rhetorical, right? I suppose someone could comment back on it, but it's clearly rhetoric and figurative to illustrate this person's point.
Interestingly enough the system also caught #2. This is a prime example of when a question is implied but you see no question mark. I'm not even sure this person ran out of space for one...There just wasn't one. Yes, it's slightly rhetoric, the desire is that the viewer clicks the link to find out the answer. However, look at what's been keyed in on here. The "are you" parts. Certain comabinations of words explicitly mean question. If no question mark follows certain phrases like that, then it would be poor grammar or missing puncuation. There are rules in the English language (even if there are exceptions at times).
Then of course without question, no pun intended, #1 is a question. Relying on the question mark easily caught that one. However, "is anyone" would also suffice. It's unlikely to have "is anyone" as a statement. It might be an answer to a question, right? Who can read this blog? Answering, "That is anyone." ...But that's incorrect grammar. "Anyone can" ... or "That would be anyone." However, you can definitely run into problems with always relying upon these rules.
I think the accuracy levels of over 75% are acceptable. I think over a large data set the number of misses will be small and if presented to a user in the right way...Easily ignored. Don't forget that with good UI we can hide away mistakes and inaccuracies from the system or at least prepare the user to deal with them in a simple way. If you give the user the ability to, say, delete the possible question from their view. Then it takes all of a click and a second to remove an item that doesn't even appear that often. Let's put it this way, if you are presented with 10 questions from a collection of 1,000 tweets and 1 of those is really not a question...Would you spend more time clicking a button to remove the one error? Or would you spend more time going through 1,000 tweets manually to find all the questions?
So that's question detection - without any code examples. Simple, fun, very powerful and helpful. I wouldn't be surprised if you saw more tools provide these features in the future as "internet noise" grows.