Data Mining: Determining Gender
I recently had a talk with a person about collecting demographics from the web. He brought up some very good points and it was a nice conversation. One of the things that I keep going back to in my head days later is how he felt that a 33% accuracy rate on determining gender was terrible. Perhaps I wasn't clear enough about the fact that it was strictly from Twitter. I think I did mention that though.
Here's the deal folks. Getting demographic information from data mining the internet is a very weak game. Don't expect to be able to add up your findings for male/female (or even by location, etc.) and get 100%. That's plain silly and impossible. This guy was talked as if 70% was common. On the internet? Hardly. I actually do not know a good number to shoot for, but as I continue my research and build Social Harvest I will know what that is.
I believe 33% for Twitter to be very good. Of course Facebook is going to allow a greater deal of accuracy, I'm assuming 100% since Facebook actually asks for gender upon registration and displays with the basic info for each user. On the other hand, Twitter doesn't ask for gender. Additionally, you can set your name to whatever you like.
So the problem is that many times you'll have company accounts with no name. That's just the most basic example of why you can never have 100% or even 70% accuracy. If you want to expand it beyond Twitter and look at comments on blogs...You can quickly see that your success rate is going to go down the tubes real quick.
Now, my question is this: Why do all the social media monitoring services such as Radian6, Sysomos, etc. give you results that add up to 100% for gender? They're flat out lying to you. Talking to a friend a while ago he didn't realize this at first. He said, "I don't know, but somehow they just know. Maybe they pay extra for that." ... No, sorry. So I quickly went into the Sysomos demo and pointed out from the very first page of results where a tweet from a user marked as male was actually female. That or they had gender re-assignment and that tool really did know something we didn't!
Why do they do this? My theory is they are afraid to tell customers that they don't know. There's gray areas out there and it's my belief that we should be aware of them...Because by randomly choosing male or female, we're actually skewing the results. It's far better to say, "of the 300 we know about, this many are male..." than it is to simply lie about it. People are trying to target ads based on this data and it's horrible to knowingly be inaccurate. There's always a margin of error and that's a different story.
So how do we determine gender? Well, I can't exactly spill all the beans...But there's actually some very, very good ways to do so. I'll give you a hint. There's some free databases available to you out there from big brother. When I say big brother, I mean the US government. That said, here's the obvious challenge. People named "Pat" and "Sam" are going to also be gray areas just as much as people on Twitter who do not give a first name. You have to put them in the uncertain category as well. It's unfortunate, but you have to.
What about advanced methods? Well, sure there are a few. You can actually analyze the text that people post and determine if it's male or female by their writing style. You can also try to grab page colors to factor into your probability and even use something like Face API to try to analyze profile pictures. I find photo recognition a very interesting thing. However, all of these clever attempts also are subject to a hefty margin of error. The profile photos are very small for Twitter and are typically poorly lit, etc. Additionally, many people don't even use a photo of themself. You also can't go based on someone's writing style or interests. You may have a user screaming "I love Transformers and Star Trek!" all over, but you really can't count on them being male. Additionally, do you realize the task that's now put before you? All this work just to determine the gender of a single user who posted a single status update on Twitter. Think about doing that thousands or even millions of times over. You want those results sometime this decade, right?
Even if money was no object and you had several computers do this processing to offset the time it took...Even if you also went off and searched Google for people's names to see if you can find additional supporting images... You are still subject to a margin of error. The time and effort...The sheer cost is not worth it.
So I say embrace the gray areas of data. Understand them and know why they exist. In the case of gender, it's simply the nature of the internet. No one requires you to register and expose your identity on the internet. That's the beauty of it. If you're trying to gather demographics on the net, please keep that in mind. If you can't accept and understand that, then you probably aslo don't understand the internet well enough to be advertising or working with it in a professional manner for a job of some sort.
I will continue my research and hope to find ways to improve things beyond 33%, I have decades of data and clever algorithms to help me do that, but for now...I'm quite happy to have the most accurate system for determing gender from Twitter...That I've ever seen at least.