Shift8 Creative Graphic Design and Website Development

mongodb

Saving Lots of Data With MongoDB

Posted by Tom on Thu, Jul 26 2012 10:14:00

While building Social Harvest, there were many challenges with data...But among the larger issues at hand was how to store all of the data harvested from multiple APIs. There's a lot! 

One very clever trick that I saw at a MongoDB conference was to actually use (and this is for Twitter specifically) the data feed from the command line and do a direct import. This is great and very amazing that you can literally take the Twitter hose and plug it into your MongoDB instance. However, this really wasn't going to work for Social Harvest. For staters, we're not trying to clone all of the data. In fact, it's against many other social networks' terms of service to do so. Instead, we need to store as little data as possible (for efficiency and to fall within ToS) and then we also need to use this data to form assumptions and calculations. Social Harvest analyzes data and runs it through various algorithms. The end result is a far cry from what we start with.

So to start with, just at a baseline, I got the entire process figured out for harvesting, processing, and storing the data. The big issue at hand was the amount. It's simply not possible to hold all of this data in memory (and for great lengths of time) while waiting for it to all be collected, processed and stored.

So instead, I wrote data out to JSON files. This allows me to also pull in data from services that did not supply a JSON feed (like Twitter does) and naturalize all of the data. These files were saved on disk and using PHP, were simply appended to. This was fast enough to handle the thousands of bits of data that came in...And remember, for each harvesting process, we can have another file that's written simulatenously. So what we're doing is splitting up the entire process into manageable chunks. This also left us with a sort of journaling system. All data is being written to disk first before insterting into MongoDB. This also means that if for any reason, the harvesting process were to stop...We could continue where we left off. Perhaps with some minor cleaning of the JSON file or not.

Harvesting is a multi-stage process. So far we've only taken the data from the stream or API and stored it down to a JSON file on disk. This is indeed less efficient than cleverly streaming it directly into MongoDB. However, we're not done with it and this is part of why it's being stored to disk.

Next. The data has refernces to users and those users need to be analyzed as well. We want to find out where they are from, how many followers they have, etc. However, we also don't want to do this for every single status update because often we will see multiple updates from the same user. So duplication is bad and extra uncessary processing is not nice for our servers...But more importantly, we want to be respectful of Twitter, Facebook, YouTube, etc. and obey rate limits and reduce the amount of requests that we must make to them. So, when we can, we cached the data that we harvested in mid-harvest.

So for example, let's say there are 1,000 status updates and 250 of them are from one very chatty user. Rather than retrieve that user's information via an API 250 times, we only need to retrieve it once. In our case, this was stored in memcached. This data is gone after a day or less. Often, a few hours. This keeps us in line with various services' terms of service. We aren't trying to take any personal information or hang onto anything. However, we do need to hang on to it for just a tiny bit of time in order to make less requests to the service. This of course also helps us with speed.

After we've harvested all of the status updates and then all of the user data for each update, we have a few JSON files on disk. These can now be directly imported to MongoDB.

Running a mongoimport via command line is extremely fast. It's significantly faster than having the application do insert after insert. We remove the bottleneck of whatever programming language we're using and the use of the driver and gain the raw power of MongoDB from the command line. Much like that clever streaming import example that we've seen.

Of course once this import is complete (it takes literally seconds), we remove the JSON files. 

Now, the harvesting done is through various APIs and the other reason why we can't exacty directly stream to MongoDB is because we're not using the Twitter hose. The hose is great, but overrated. There's a few issues with it. First, we still need more data from Twitter. Second, we're not just Twitter exclusive! There's other services to consider. Third, we don't want a direct clone of data. Going back and working with that data and performing updates later is a hassle and an altogether new challenge that we can simply avoid. Last, it doesn't go back in history at all. The harvest process that Social Harvest uses is more than just real-time. It's also historical. This increases the amount of data being harvested; especially, when there is a good backlog of data.

Additionally, we've also standardized all of the various services into a JSON format native to MongoDB and we've set ourselves up with the ability for parallel processing. We can split the files up even more if we need to and have multiple servers doing this. It doesn't need to be just one. The caching is done in memcache, but we could have also used MongoDB and that cache is also accessible across servers.

So when you really have data coming in from various srouces (remote, local cache, etc.) and then need to manipulate that data or trim the data down, you can't use the direct stream import. This is how I solved this challenge and I hope that it's given you some ideas for challenges that you may face as well.

MongoDB: Dealing With Data That Gets Confused as Sub-Objects

Posted by Tom on Wed, Apr 04 2012 09:00:00

So I came across an interesting challenge the other night. I want to store in MongoDB a bunch of URLs and how often they are accessed. Call it simple metrics for a web site. The structure for my JSON response that I want is like so:

{
'http://www.site.com/whatever/page.html' : 23,
'http://www.site.com/another.html' : 4
}

The problem is saving this into MongoDB field isn't possible in this structure. If you don't know why, it's because MongoDB will see periods as an indicator for sub-objects. So if I go to save that under a "urls" field in my collection I'll end up with { http://www { site { com/whatever and so on.

encodeURIComponent? Sure, but periods don't get encoded to ASCII equivelant of %2E and don't need to according to RFC. Even though we could probably replace them, what if we don't want to end up with % symbols everywhere? What if we have, instead of URLs, geo coordinates? Lat/lon pairs that are not strings would not pass through encodeURIComponent. We would need to cast as string and send through, etc.

Why not add slashes or other characters? Because, if you are dealing with URLs, those characters could be mistaken for valid. There's going to be limited options for you to replace a period.

So encodeURIComponent plus some replace magic is one possible solution (didn't really do it for me though). I prefer to base64 encode the values. Of course base64 isn't native to JavaScript, but thanks for PHP.js we have some pretty sweet functions ready to use. See here for a base64_decode() equivelant to PHP's. Of course, both the encode() and decode() functions for base64 also require a UTF8 encode/decode function. So, there's 4 functions in all that you'll need.

Here's what the values look like stored in MongoDB now (for a hypothetical "urls" field):

'urls' : {
'aHR0cDovL3d3dy5zaXRlLmNvbS93aGF0ZXZlci9wYWdlLmh0bWw=' : 23,
'aHR0cDovL3d3dy5zaXRlLmNvbS9hbm90aGVyLmh0bWw=' : 4
}

So how do you use this in MongoDB? Simple, you can place the code in any map/reduce or finalize MongoCode. You probably don't want to keep doing that over and over though in each of your files and if you work from the command line, it'll be a nightmare. So you can also save stored JavaScript in MongoDB! Then you can simply call the function as if it was native.

Here's a few sites with further reading on stored JavaScript:

Now when you run aggregation, say a group() query, you can decode the values back. You could also decode the values in any other language that has base64 decode capability like PHP. You could keep the PHP.js functions on the front-end and let someone's browser do the work as well.

How much time does it take for all this? What's the overhead? Well, I haven't benchmarked it extensively. I only benchmarked my aggregation process, but I can say that it didn't take anymore time really. We're talking fractions of a second. Admittedly the job only took 2 or 3 seconds anyway, but regardless if I was running the encode function or not it was the same time. I imagine a much larger job would see a noticeable difference, but also keep in mind that a larger job is taking time anyway. So if you're concerned about using this function for a query that you want to happen without a page load timing out...Don't.

I'm also interested in seeing if there's also compression functions that can be used to save data. LZW compression has been implemented in JavaScript and the patent has apparently run out on that so it's kosher to use. Keep in mind that base64 requires about 33% more space for the data. If you're trying to keep an effecient document size, it may not always be the answer. However, the values are definitely key safe and I imagine any other kind of encoding to avoid periods also adds size.

MongoDB in the Cloud

Posted by Tom on Sun, Sep 18 2011 11:58:00

So I'm growing tired of configuring servers. I simply don't have the time...Between designing sites/apps and then actually coding them on top of all sorts of project management, dealing with clients, getting paid, etc. There's just no time with all the projects I work on. So I'm looking more and more toward various PaaS solutions (platform as a service). I've been looking at RedHat's solution as well as VMWare's CloudFoundry and Orchestra.io as well. Only Orchestra has support for PHP currently (aside from RedHat's service) as well as support for MongoDB with PHP. I'm sure I'll have a comparison/review/my two cents for those services later.

However, today I'm going to talk about two hosted MongoDB solutions that I've come across. When I was at the MongoSV conference the other year I met some of the vendors there and MongoHQ really stuck in my head. There was also another company there, Mongo Machine. They go about their pricing differently. Mongo Machine is more cost up front so I haven't tried it to be honest. I'm not sure I will either. If I'm at the point of putting in that much money, then I'm going to just host my own database on my own servers. 

A side note. Hosting your own MongoDB (single or cluster) is going to likely yield better performance, especially with your cost/performance ratio. Plus you gain control over what your server setup is like. So there's two reasons why I'd suggest one of these services. First, it's amazingly simple to setup and manage and you don't need to worry about scaling. So convenience is number one. The second reason is more of a scenario. When you don't need super performance because your traffic isn't as high or maybe you're not doing anything as intensive (lots of map/reduces, etc.), it's probably a very good idea to use one of these services. Think about your own personal website. It's likely that you can have a free database solution for your own blog or something. Pretty cool!

Back on point. I've signed up for and have created two databases on MongoHQ and also MongoLab. MongoLab either was not at the MongoSV 2010 conference or I didn't see them...But I like them. So I'm going to compare these two services because they are extremely similar in pricing models, interface, and everything. 

Getting Started
It took all of 5 minutes to setup both services. MongoHQ I had to enter a credit card number, MongoLab I did not.

Configuration
I'm going to talk about Lithium here since that's what I use for a PHP framework (and you should probably too, if I can be an advocate for a few seconds here). Setting up Lithium to use MongoDB on MongoHQ and MongoLab was easy...Once you know what your config array should look like and also once you realize that you have to set the default timeout from 100ms to something higher, like a few seconds. The port number has to be in the host key value. The login key has to be set as well as the password key. This process was identical for both services.

However... Here's an interesting difference. MongoLab has several options when it comes to "where" you're hosting your database(s). You can choose to use Amazon EC2, Rackspace Cloud, or Joyent. This is a major point to MongoLab over MongoHQ. The reason is because one of my gripes is with performance. This is mainly due to hostname lookups and such. If you so happen to be using Rackspace for your hosting (again, I'm going to be a fan for them) you can use the private IP that's within their network! This should (I haven't tried it because I'm on Slicehost and have yet to move my server) help with the timeout setting that you just had to increase in your configuration. I imagine the same goes for Joyent and Amazon EC2 when choosing those locations for your database as well.

MongoHQ uses Amazon EC2 exclusively and you do not get to choose. I'm not sure which region or how it's setup, but I imagine if you also use Amazon EC2, you may get a performance bump when using MongoHQ when it comes to connecting to the database.

Pricing
The pricing models are the same. They both offer a free tier, but, what you get for what you pay is much different between the two. Not incredibly different...Except for when it comes to the free tier. You can likely run an entire blog/personal web site off MongoLab for free because they give you 240MB for free while MongoHQ gives you 16MB for free. Both then have similar plans, but not identical. You're talking about $5/mo differences here and there depending on which tier you fall into between the two services. Nothing to worry about.

It's only when you get into needing replication that things start to get different. MongoLab gives you replication on their plans that cost money. MongoHQ has it available, but you have to pay $300/mo to get it. MongoLab you get it even with their $10/mo plan.

Both have backups, that's cool. Both seem to be monitored, etc. What I'm not sure about is if MongoLab has you on a dedicated instance. MongoHQ lets you know that you are when you hit the mid range to higher end plans. The only thing MongoLab has is their dedicated plan which has a variable cost and I imagine you'd need to get in touch with them to price that.

So I'm not sure what that all means, but it could definitely affect performance. Maybe MongoLab can shed some light on that...Or maybe bcause MongoLab has it all replicated, the dedicated instance per account isn't as important because they are scaling with MongoDB's features.

I have to say neither wins the pricing category. They are comparable, but if I had to choose...I'd say MongoLab because their free tier is better.

Features
I have to say that I think MongoLab is going to have better features here. 

Update: Anyone who previously read this section would have seen info about how MongoLab offers replication. Which it does, but I have been informed it is not replica-sets. Meaning the failover is not automatic. Replica-sets are offered on their higher tier plan. This is more consistent with MongoHQ. Not knowing much about the internals of both services, I can't say if one is better than the other when it comes to scaling.

Here's another important note. MongoHQ seems to be running on an old version of MonogDB depending on which pricing plan you choose. You could be on 1.6.x or you could be on 1.8.x. 1.6.x is a bit old considering 2.0 is now out. MongoLab uses 1.8.x for everything. The "micro" instance on MongoHQ runs a 32-bit instance of MongoDB where the rest are 64-bit. MongoLab appears to use 64-bit for everything, but I could be wrong. They don't explicitly state that anywhere. Do you "need" 64-bit? No, not for 16MB or 240MB of storage.

Both services have a nice interface for browsing and even editing documents in your database. I love both. They have import/export features and it's everything you'd want. I don't think either service is better than the other when it comes to your database browser. MongoLab has the whole dark theme thing going while MongoHQ has a light design...If that matters to you. The important thing to note here is that MongoHQ allows you to hook up any database to their GUI. You can't do that with MongoLab. It's minor, but a cool feature.

Both services also have a REST API. This is neat for mobile apps, etc. Situations where you don't have acccess to a MongoDB driver...Which I think is pretty rare, but you never know. However, what MongoLab's API doesn't seem to do (which I wish it did) is allow you to deploy new databases. Being able to setup new databases via an API might make for a very nice solution when it comes to certain applications. For example, you may wish to create a service where every user who signs up pays you and each of those users gets their own database for security reasons and also so you can track their usage so you can in turn, charge them money to recover your hosting costs. Automation on that, would be great. MongoHQ appears to let you do that with their API.

Both services show you your database stats as well. I think MongoLab presents them with a little more helpful info which is good if you're new to MongoDB. MongoLab allows you to profile things as well whereas MongoHQ doesn't have anything like that built into their GUI. I imagine you could code your own profiling tools within your app though.

Overages
There's "soft" limits or overages that you can run into with both services. They both seem to be fair with this, but you need to be on top of your databases and if you are moving out of one tier, you need to switch your plan to the next tier. However, it's important to note that MongoHQ only has a soft limit on their high end plan. This means your database will not accept any more writes until you upgrade. I can only imagine for (just) the free plan on MongoLab the same is true. Given that you don't need to enter a credit card, I'm not sure how they would allow you to just keep using more and more. You get charged overages with the other tiers until you can switch plans. 

Nothing to really worry about with both services, but I think MongoLab handles things a little bit more nicely in case you're one of those people who don't really pay attention to your database and how it may be growing in size.

Reliability/Performance
This is something I can't speak about. I haven't used either service long enough to know. It can be completely possible that MongoHQ performs better and is more reliable than MongoLab. That would ultimately be the deciding factor for me to use MongoHQ over MongoLab, despite the pricing and features. It's simply more important than a database stays up.

My Choice
Both are very comparable services and are great. The differences between the two are minor and unless you're familiar with MongoDB, you likely won't really know what the differences are. You may not even care.

I personally will continue to play with both services since they have a free tier, but I am leaning toward MongoLab. I think they are definitely a service to follow and use. If you have a personal site that you want to use MongoDB, then I'd suggest trying one of these services. If you're on shared hosting, you likely need to use one of these services in order to use MongoDB. 

I think they also currently have the edge due to four reasons -- listed in order of what's most important to me.
#1 Choice of EC2, Rackspace, or Joyent with private IPs for (hopefully) better performance
#2 Replication
#3 Their free tier gives you more storage space
#4 Version 1.8.x of MongoDB

How About Just Reduce in Lithium?

Posted by Tom on Tue, May 24 2011 22:32:00

So after realizing my big problem with schema design for my little metrics example, I've changed the design a bit. Obviously if you store all this visitor data on one document, you're going to exceed the 4MB limit. So this was not just an example for me, this was to be actual code! Yikes! I'm glad I caught it a few hours later. So now imagine a schema where your "metrics" collection has a bunch of documents with all the same recorded information. Illustrated like so:

/* 0 */
{
  "_id": {
    "$oid": "4ddc78d27f8b9ada44000002"
  },
  "browser": "Firefox",
  "browser_ver": 4,
  "os": "Win7",
  "mobile": false,
  "lang": "en-us",
  "project_url": "example",
  "_key": "192.168.126.2@example"
}

/* 1 */
{
  "_id": {
    "$oid": "4ddc78d97f8b9ade44000004"
  },
  "browser": "Chrome",
  "browser_ver": 11,
  "os": "Win7",
  "mobile": false,
  "lang": "en-us",
  "project_url": "example",
  "_key": "192.168.126.3@example"
}

/* 2 */
{
  "_id": {
    "$oid": "4ddc78dd7f8b9adc44000001"
  },
  "browser": "Chrome",
  "browser_ver": 11,
  "os": "Win7",
  "mobile": false,
  "lang": "en-us",
  "project_url": "example",
  "_key": "192.168.126.1@example"
}

Again this example is specific for some of my needs. I still want unique metrics on a per project basis in the system. So the _key field has a unique index on it. That simply takes care of that, whereas before I was using the $set operator, now I just save without worry. The slight difference is this time there won't be any updating of information if the user came back with a different browser later on. Like before the metrics aren't skewed because each "unique" visitor only has one entry per page/project tracked. Also like before, the downside is this isn't truly unique...It's more of a "per household" or a "per office" situation.

I now just have to make a very simple query to get all documents where the project_url is "example" in this case. This will eventually return a huge number of documents. Far too many to just pull up and loop using PHP each time I want an aggregate. So we don't need a map reduce now. We could use one, but it's overkill. We just want "half" of that...The reduce part. 

So we use the group() command and it kinda all works out to be similar in nature as the map/reduce. Again with Lithium, we want to get the connection using $db = ProjectMetric::connection(); and then we can run $db->connection->command() calls. Also don't forget the MongoCode class. We'll need that again for the reduce and finalize part of this.

Here's the juicy stuff.

$db = ProjectMetric::connection();

// We just need a reduce, not a map reduce in this case, it's actually a group() command
// We're also going to use a finalizer to format the data afterward.
$reduce = new \MongoCode("function(doc, prev) { ".
    // count browsers
    // first if the browser doesn't exist, set a key name for it with a value of 0
    "if(typeof(prev.browsers[doc.browser]) == 'undefined') {".
       "prev.browsers[doc.browser] = 0;".
    "}".
    // then increment
    "prev.browsers[doc.browser] += 1;".
    
    // count operating systems
    "if(typeof(prev.operating_systems[doc.os]) == 'undefined') {".
       "prev.operating_systems[doc.os] = 0;".
    "}".
    "prev.operating_systems[doc.os] += 1;".
    
    // count mobile devices... this is easier it's just one value
    "if(doc.mobile === true) {".
        "prev.mobile += 1;".
    "}".
    
    // count languages
    "if(typeof(prev.languages[doc.lang]) == 'undefined') {".
       "prev.languages[doc.lang] = 0;".
    "}".
    "prev.languages[doc.lang] += 1;".
"}");

// Make the output pretty, we don't want any spaces in key names for example...
// So we're taking those key names and setting them as values of a new "name" key.
// What if extract() was used or something? Those spaces would be an issue.
$finalize = new \MongoCode("function(prev) { ".
    // browsers
    "for (var x in prev.browsers) {".
        "prev.browsers.push({ name: x, count: prev.browsers[x] });".
        "delete prev.browsers[x];".
    "}".
    
    // operating systems
    "for (var x in prev.operating_systems) {".
        "prev.operating_systems.push({ name: x, count: prev.operating_systems[x] });".
        "delete prev.operating_systems[x];".
    "}".
    
    // languages
    "for (var x in prev.languages) {".
        "prev.languages.push({ name: x, count: prev.languages[x] });".
        "delete prev.languages[x];".
    "}".
    
    // mobile is different and is fine as is
    
    // return the data
    "return prev;".
"}");

// Run the command
$metrics = $db->connection->command(array(
    'group' => array(
        'ns' => 'project_metrics',
        'key' => 'project_url',
        'initial' => array('browsers' => array(), 'operating_systems' => array(), 'languages' => array(), 'mobile' => 0),
        '$reduce' => $reduce,
        'condition' => array('project_url' => $url),
        'finalize' => $finalize
    )
));

// Set the results, they'll be formatted how we want them to be
$results = $metrics['retval'][0];

// We can get unique visitors basically by the total number of documents pulled back
// because there is a unique index on the collection, It's on the "_key" field which
// takes the visitor ip and combines it with the "project_url" value. So there can't
// be duplicate entries so unique visitors is simply the total count.
$results['unique_visitors'] = $metrics['count'];

I commented it pretty well (honestly for my own benefit haha), but I'll briefly explain. The $db->connection->command(array('group'... part. We need to define "ns" which is the collection name, we also need to specify the "key" which in this case is the project url. This key is what to group the documents on.  So all documents with a "project_url" key of a given value are going to be grouped together. If we wanted to retrieve aggregate data for multiple URLs then this would come in handy. In this example, the "condition" key is going to kinda ruin the whole grouping thing because we're only going to return documents of the value we're grouping by. So one group always. Not several groups of data per URL.

The magic parts are the $reduce, initial, and finalize keys.  The reduce is actually another reduce function using the MongoCode class. So that's very similar to map/reduce. Same goes for the finalize function. The initial key is new. That essentially sets some default values for us to use. These will persist so as we loop we can easily keep adding on to them. In the most basic situation we have this "mobile" key set to 0. In the reduce function we're just incrementing it by one everytime the mobile value from the document (also named "mobile") is true. Simple. The others are also just counters.

The final results returned have array key names that have spaces in them. Ew. We don't want that in PHP. So I just made a simple finalize function that went through and moved the value (the count) to an array for each item to count that has a "count" key and a "name" key. The value for the "name" key of course can have spaces in it and there won't be any problems for PHP. It is possible to have spaces in the key names, but if we wanted to run extract() or something, there'd be issues.

That's about it. Another example. I'm not sure exactly how practical these examples are, but they do show you how you can make these operations in PHP when using Lithium. Once I wrap my head around things a bit more, I may try to give back a little and take a stab at making methods within the MongoDb adapter for these things.

On a final note. The group() command is designed to be executed in "real time" whereas map/reduce is more of a background task. So, back to practicality. This may still not serve my needs as the collection grows because the query could end up taking a long time. At which point I guess I'll have another map/reduce example to show for.

Map/Reduce in Lithium for Visitor Metrics

Posted by Tom on Tue, May 24 2011 10:01:00

Update: So after I got a little further along with this very example in a real life project I realized that while it makes for a very simplistic illustration of map/reduce (that I personally found helpful when learning how to perform map/reduce), it is not a very good real life example. The reason being... Smile ...The document size limit in MongoDB. Doh! You couldn't store metrics like this. However, ignoring the purpose of this, you can still continue reading about how to perform a map/reduce within Lithium.

Original Blog Entry

I'll start off by saying I love MongoDB and map/reduce after putting it off for some time. I dreaded learning the map reduce functions big time. It turns out, it's not that bad. A friend asked me to explain it in 10 words or less. So I did. It's not really all encompasing of the features, but it's a real good example for what map/reduce can do for you.

Use JavaScript to identify/"map" data to loop it to aggregate/"reduce."

Ok, so that's 12 words technically, I cheated by adding slashes and combinging two words. It's also really poor grammar. Anyway, that's the idea. In this example, I wanted to collect information about visitors on a web app. Obviously I'm not a masochist, I'd use Google Analytics if I could...Sadly, I could not. So what to do? Well, we can use MongoDB to record all this data and then use map/reduce to get some totals.

I may eventually turn this into a Lithium library (especially because there's a good browscap and language detection class that I'm not illustrating here), but for now I'm going over things at a high level and focusing on the actual map/reduce process.

That said, imagine a data set like this:

"metrics": {
    "pageviews": 63,
    "visitors": {
      "192-168-126-1": {
        "ip_address": "192.168.126.1",
        "browser": "Chrome",
        "browser_major_version": 11,
        "operating_system": "Win7",
        "mobile_device": false,
        "primary_language": "en-us"
      },
      "192-168-126-2": {
        "ip_address": "192.168.126.2",
        "browser": "Firefox",
        "browser_major_version": 4,
        "operating_system": "Win7",
        "mobile_device": false,
        "primary_language": "en-us"
      },
      "192-168-126-3": {
        "ip_address": "192.168.126.3",
        "browser": "Chrome",
        "browser_major_version": 11,
        "operating_system": "Win7",
        "mobile_device": false,
        "primary_language": "en-us"
      }
    }

Now, we have this "metrics" field where ever you like, but in my case on a document that contains some other information. Why not a separate "metrics" collection? We could and then we could also put in things like page URLs that were hit on the site to start getting analytic information about the pages on our site. In my case, I just wanted to get a sense for some high level information about my visitors. For now.

So the first thing here that you'll notice (and I've written about the $set operator before) is that each IP address is the key for each entry. The dots have been replaced with dashes so that it works as a key. Otherwise, I'd have a pretty deep object on my hands. Surprised

So each time a page is loaded the pageviews count goes up and the visitor's browser information is captured using $set so that if the user from the same IP address came back again with a different browser, it would update. My metrics would not be skewed. Yes, it's sad that we don't realize when/if the user actually uses two different browsers...More sad that we're likely counting entire office buildings as one user, but that's just how the cookie crumbles in this case.

Ok, so we have that data and we have some controller action in our Lithium project that's going to return to us an array that we'll pass to the view template to make some pretty pie charts. Why not pie charts? I love pie charts, they give everyone a sense of satisfaction that looking at numbers is really fun! ...Or something like that.

We'll dive right in. Here's the entire action I'm using with the map/reduce code. Note that Lithium's MongoDb adapter does not have any options for map/reduce in the find() or any other method. I may write something in the future for that myself if I end up doing enough of these (and I likely will). However, we can make straight up command() calls from it.

 public function metrics($url=null) {
        if(empty($url)) {
            return false;
        }
        
        $db = Project::connection();
        
        // construct map and reduce functions
        $map = new \MongoCode("function() { ".
            "emit(this.metrics.visitors, this.metrics.visitors);".
        "}");
        
        $reduce = new \MongoCode("function(k, vals) { ".
            "var visitors = vals[0];".
            "var unique_visitors = 0;".
            "var b_counts = new Array();".
            "var browsers = new Array();".
            "var os_counts = new Array();".
            "var operating_systems = new Array();".
            "var mobile_devices = 0;".
            "var ln_counts = new Array();".
            "var languages = new Array();".
            
            // loop all the emitted visitor metrics to aggregate some data
            "for (var i in visitors) {".
                // count browsers
                "if(typeof(b_counts[visitors[i].browser]) == 'undefined') {".
                    "b_counts[visitors[i].browser] = 0;".
                "}".
                "b_counts[visitors[i].browser] += 1;".
                
                // count operating systems
                "if(typeof(os_counts[visitors[i].operating_system]) == 'undefined') {".
                    "os_counts[visitors[i].operating_system] = 0;".
                "}".
                "os_counts[visitors[i].operating_system] += 1;".
                
                // count the primary languages
                "if(typeof(ln_counts[visitors[i].primary_language]) == 'undefined') {".
                    "ln_counts[visitors[i].primary_language] = 0;".
                "}".
                "ln_counts[visitors[i].primary_language] += 1;".
                
                // count the number of mobile devices
                "if(visitors[i].mobile_device == true) {".
                    "mobile_devices += 1;".
                "}".
                
                // count the number of unique visitors
                "unique_visitors += 1;".
            "}".
            
            // loop browsers counted and set for output
            "for (var x in b_counts) {".
                "browsers.push({ name: x, count: b_counts[x] });".
            "}".
            
            // loop operating systems counted and set for output
            "for (var x in os_counts) {".
                "operating_systems.push({ name: x, count: os_counts[x] });".
            "}".
            
            // loop languages counted and set for output
            "for (var x in ln_counts) {".
                "languages.push({ name: x, count: ln_counts[x] });".
            "}".
            
            // return the output
            "return { 'browsers': browsers, 'operating_systems': operating_systems, 'languages': languages, 'mobile_devices' : mobile_devices, 'unique_visitors': unique_visitors }; }");
        
        $metrics = $db->connection->command(array(
            'mapreduce' => 'projects', 
            'map' => $map,
            'reduce' => $reduce,
            'out' => array('merge' => 'mapReduceMetrics')
        ));
        
        $cursor = $db->connection->selectCollection($metrics['result'])->find()->limit(1);
        foreach ($cursor as $doc) {
            $results = $doc['value'];
        }
        
        // Get the total page views for this project
        $pageviews = Project::find('first', array('fields' => array('metrics.pageviews'), 'conditions' => array('url' => $url)));
        $results['pageviews'] = $pageviews->data('metrics.pageviews');
        
        return $results;
    }

Yea, it's not the prettiest to look at. It's my first run through and it's literally based off an example from php.net so that's why there's all those lines concatenated together like that. I wouldn't normally do that. Nor would I use heredoc...But something a little nicer, at least single quotes instead of double. Anywyay, with that you will be returned a nice array (in $results) that will show all the counts for browsers and such. Note, I did not take into account the browser major versions here in this example. Also note that I separately stored a pageview count on the document which does not require a map/reduce to retrieve. 

Now let's look at it deeper. There's a lot of good articles on map/reduce if you spend time with them, they should be pretty clear. Here is a good one. Then you can also look at the MongoDB Cookbook site's example. Also php.net's example. You'll see that you can use map/reduce for many things. Let's go over how I'm using it.

First, the map function. Pretty simple. In fact, you likely wouldn't do what I'm doing here. The idea of it is to basically grab keys and values for a given collection. Those keys should be unique. So in my case metrics.visitors are unique keys. They are also the values that I need. What this does is returns the values to a reduce function.

The reduce function. More complex, but it's all nice friendly JavaScript. Here you're just looping the values that are passed and simply counting some of them. As a disclaimer, my example could have probably been written a lot better and cleaner. I only loop once which is what I was concerned about mainly. The rest can be refactored later.

At the end of whatever you decide to do with all that data, you'll return your values. I'm returning an object here with all the counts. Here's what PHP gets back in $results:

array
  'browsers' => 
    array
      0 => 
        array
          'name' => string 'Chrome' (length=6)
          'count' => float 2
      1 => 
        array
          'name' => string 'Firefox' (length=7)
          'count' => float 1
  'operating_systems' => 
    array
      0 => 
        array
          'name' => string 'Win7' (length=4)
          'count' => float 3
  'languages' => 
    array
      0 => 
        array
          'name' => string 'en-us' (length=5)
          'count' => float 3
  'mobile_devices' => float 0
  'unique_visitors' => float 3
  'pageviews' => int 256

...And there ya have it. What I would do next is actually cache this data so each time I called the action, it didn't have to run the map reduce which could be quite expensive over time with a lot of data.

Cool note: In this example you see the $metrics = $db->connection->command(...) part? Run a var_dump() on $metrics. It will have some handy information for you. It could tell you about an error when it comes to parsing your functions (though I'm not sure how to actually debug things, sorry). It also will tell you if everything was ok and ran successfully. You may wish to check this before returning data. It's on my to do list myself. Also, it will show you how long the operation took which is very handy. You might need/want to index some fields and cache results based on how long things are taking.

Another note: With map/reduce you're actually outputting to a collection. So you're going to pick up your results with another query to that (temporary or not so temporary) collection. This changed in MongoDB version 1.8.0. You now have to specify that 'out' key in the command() call. Here's more information on that

Hopefully these snippets will be of some help to people. I didn't want to go too far in depth with explaining everything, I think there's other really good articles on that out there. My hope is that seeing an example, as it works within the Lithium framework, will be helpful.

Family Spoon and MongoDB

Posted by Tom on Fri, Mar 25 2011 07:15:00

Family SpoonI recently soft-launched my personal project, Family Spoon. You can go to it and use it, it works. It's a recipe sharing website for you, your family and friends. Basically it allows you to create recipes and share them (or keep them private). There's a bit more to it and the site will continue to grow over time. However, with just the sharing there were some considerable hurdles to overcome.

Database Schema
The juicy schema details? Look at the bottom for an example document structure, but for readability I'm not going to paste that here. I'll breifly go over the architecture a little bit. The site was developed in PHP using the Lithium framework. The use of MongoDB was critical to the way in which the site was built, especially for speed reasons. Each "recipe" on the site actually sits in a "page" collection in MongoDB. It's saved there by a "Page" model. Nearly all pages (except static pages and other form pages) are treated this way on the site. Not active, but built is a "blog" section for the site and that also uses this Page model and collection. Obviously, a recipe and a blog entry have very different information. It's due to MongoDB's schemaless design that these two "ideas," which are really the same thing (they are web pages with content), can exist in the same space. Both have titles, but one has ingredients and the other some body copy (and yes some more fields). So right there that saved a ton of time. You need to be aware of, but not meticulously design (and overdesign) your database schema and ensure it's properly normalized, etc. I can now, worry free, add a new model that extends the Page model to add a completely new section to the site and I know it's going to be stored in the same place in the same manner. Done.

In the past? With something like MySQL I would have many many tables to ensure I properly optimized and normalized things. Then when I wanted to add a new section, it would mean either adjusting, or if I did a really really good job, adding at least one more table for the new section. Now, what happens if (and I will) in the future I need to add more than just prep time, cook time, etc. on the recipe document? No problem! I can easily stick in another field and not even think twice. I can also remove them. Again, with other database, I'd have to watch and worry about schema changes. It would take much longer to not only build a site, but also change it. MongoDB makes maintaining a database for a site easy.

Furthermore and leading into the next section is datatypes. Along with database schema, we would traditionally have to think about exactly what kind of data was stored in these tables. Reserving too much is a bad thing and we're sworn off from things like BLOB, so you have to play this delicate planning game. Am I really going to have a user that has a recipe title more than 255 characters? Nah, no way. Really? Oh, well they can't. They just can't. The user will deal with it. The user will live. Eh...ok...But not great. We had to draw the line in the past, but no more! (except for maybe 4MB for now) Smile

The other big thing here with schema and data types is tagging. We love to tag these days. Photos, articles, people, etc. Now, recipes. Family Spoon allows you to tag your recipes on the site in order to be able to find them more easily in the future. These are custom user entered phrases that get stored as an array in the database. Before, we would immediately create a "tags" table to ensure the same word wasn't repeated God forbid. We would join a million times and a week later, we'd have our results. Ok, with memcached back seat driving we'd get there faster. So with MongoDB, we just store those tags in the natural manner we would expect. Again, done and done.

Search for RecipesSearching & Filtering
I'll focus on permissions and search (and filtering) next. First, search. The site features a search box at the top of each page and depending on what area you're in, it will search for different things (the search button copy changes to indicate this too). These are regex searches on the database. It was the use of MongoDB that allowed for this whereas FULLTEXT searches with MySQL wouldn't really work so well. Then again we have problems with searching tags and all those JOINs. No good.

So MongoDB is allowing Family Spoon to run a regex search on recipe titles and tags. It's extremely easy to adjust this to search for other things...Say, ingredients?? Then perhaps filter by ingredient? So let's take a detour. Say we want to add a feature to the site that shows a user only recipes that do not contain shrimp because they have an allergic reaction to that food. Done. It's extremely simple to build this query and have it working on the site. A query for MySQL would work with WHERE IN() and not and ... all that good jazz. Sure, but it's not just about the query. It's about being able to handle that tall order. If I was using MySQL for Family Spoon and wanted to show people recipes that did not contain shrimp...I would have to put all sorts of caching strategies in place and perhaps if a popular enough feature, would need additional hardware to support the increased load. In fact, I wouldn't do it. I'd use a search engine like Solr. Ok, so now something else to setup and configure. Yes, we know how to do it, but time, time, time.

Back to regex searches. Again, you simply wouldn't do it with MySQL. You'd setup a search engine. I may eventually need one for Family Spoon (for things like weighting), but right now, no. There's no additional search engine. It's just running queries on MongoDB. We're also filtering. We're going to roll right into permissions here with that.

Permissions & Access
So, we have filtering in addition to search because we can show "your" recipes, "public" recipes, or "your family" recipes. This is determined by the access that each recipe gets assigned by it's creator. So in addition to search, we're searching specific recipes. While searching for public recipes, we mean only return recipes that also are flagged as being shared with the public. When searching your own, obviously where your owner id is set. When searching or accessing your family recipes...This is a little different. We now have to determine exactly which recipes you have permission to see.

Hey remember that ACL thing in MySQL? Haha, yea you know the one where you have all those tables and numbers and JOINs? Kiss that one bye bye. The access rules are extremely simple for Family Spoon and without some awkward tree table to screw up, they're also more reliable. It's a faster query too. In fact, most the time just for loading the recipe page document you have all the information you need to determine access. So why would we want to then jump through a bunch of hoops to determine access? Be efficient...And should access be allowed, you also now have the data to display. One query. The user's data is cahced in the session. So... What more can I say? Yes, there's a bit more to it and not all situations allow for that. Family Spoon also uses Facebook and you can share recipes with your Facebook friends. So that's another call to somewhere else, but for the most part, we aren't really talking about a lot of strain on the database.

Recipe SharingFamily Spoon can get a bit more specific too. You can share recipes with specific individuals. Again, an array of ids is stored and a simple $in takes care of it. Or, returning the entire field and using PHP's in_array() function takes care of it. How many people can a recipe stored with? How many ids could be in there? A lot less than the number of rows that would be in ARO and ACO tables. So, we're efficient too when using MongoDB. Not just fast, but efficient.

The really "rapid" part that MongoDB helps us with is when it comes back to the code. Building forms and saving the data down to the collection is far easier than it would otherwise be with an ACL system inside a relational database. There's less being written to the database and less code to do it. Sure, your framework could give you a wonderful API for working with permissions, but the minute you need to deviate from that, you could be opening a can of worms.

Ratings & Voting
Along the same lines is something like ratings. On recipes we can have a star rating. Anything that we need to tally and aggregate or increment even. MongoDB makes this easy. So ratings alone aren't a show stopper for any database or system, but when you then say, "Ah yes, but I only want each person, user, IP, to vote one time only." Then you have to put your thinking cap on. You're going to end up with some more JOINs... Also, more code. You're going to have to make the query and check to ensure that some person isn't trying to vote twice. With MongoDB, you can use the $set call and simply keep adding these ratings to a field and make the key equal to the user id or IP and the value their rating. If they vote again, they aren't adding a new rating, they are simply updating their own. Ta da. Now we've reduced the number of tables, data, and code required to setup a rating system. I've actually posted another blog entry about this here.

Conclusion
So, for many reasons, using MongoDB has made building Family Spoon a much quicker process. Trust me, I actually did it both ways. Family Spoon has been completely re-built. While the other version never really appeared out in the wild, it did exist (and was online). Previously, Family Spoon was built using the CakePHP framework with MySQL for the database. So I can most definitely tell you the differences in the amount of code and time planning between the two versions. I rebuilt the site much faster not just because I knew what I was going to do the second time around, but also because I had less to think about when it came to the database schema. Yes, you need to be aware of "schema" and you can't go hog wild, but you also get more forgiveness and MongoDB works with you to solve your problems. It's very flexible. It's not something that you need to work around, it's something that you get to work with. Anytime that you have a situation like that as a developer, your day is going to be much more happy and productive.

Do I hate MySQL? No. Definitely not, years and years worth of use and relationship aren't easily erased. It's comfortable and it's familiar and just fine. In fac,t still preferrable under certain circumstances. Just not mine. Let's make this clear, I am the only person developing Family Spoon. Just one developer and, without using MongoDB and the Lithium framework, I can tell you that it would have taken me a lot longer to not only get the site online. It also would have taken longer to provide some of the advanced features that people will be looking for (filtering, searching, etc.). Both MongoDB and Lithium not only served me with rapidly getting the project up and working (with all the core functionality that I needed) but these technologies will also be serving me into the future with rapidly being able to grow the site and add new features.

Last, so this didn't interrupt your reading pleasure, this is an example document from the database. It's just an example and not complete, but I wanted to highlight the schema and how things like tagging and ingredients worked. Each ingredient is broken out and that's going to go a long way for filtering. Filtering without JOINs.

{
  "_id": "4d6564cb9bae6c1066000000",
  "created": "Wed, 23 Feb 2011 11:49:31 GMT -08:00",
  "directions": "Place the chicken in the crockpot. ...",
  "ingredients": [
    {
      "ingredient": "boneless chicken breasts",
      "quantity": "3",
      "measurement": "lbs"
    },
    {
      "ingredient": "milk",
      "quantity": "1",
      "measurement": "cup"
    },
    {
      "ingredient": "salt"
    },
  ],
  "modified": "Fri, 11 Mar 2011 09:46:31 GMT -08:00",
  "owner_id": "xxxxx",
  "owner_ids": [
    "xxxxx",
    "123456"
  ],
  "page_type": "recipe",
  "public": true,
  "public_rating": {
    "127-0-0-1": "4",
    "255-255-255-0": "5"
  },
  "published": true,
  "serves": "",
  "share_with_friends": false,
  "tags": [
    "chicken",
    "crockpot",
    "broccoli"
  ],
  "time": {
    "prep": {
      "amount": "15",
      "unit": "minutes"
    },
    "cook": {
      "amount": "4-6",
      "unit": "hours"
    }
  },
  "title": "Crockpot Broccoli Chicken",
  "url": "crockpot-broccoli-chicken"
}

MongoDB Local Admin GUI

Posted by Tom on Wed, Feb 23 2011 09:39:00

I'm very excited to see a few more admin GUIs for working with MongoDB out there. I'm on Windows so I decided to try MongoVUE. Works great! I really haven't used it for a lengthy period of time, but my first reaction is that it's really good. For the OS X users out there, I think MongoHub might be worth taking a look at. 

So how to connect? Well, if you're working locally, it should be pretty trivial...But what if you're trying to connect to a firewalled server? Tunnel time. However, MongoVUE (I don't know about MongoHub) doesn't have built in tunnel support. That's probably the only thing I don't like about it so far. Not to worry though, you can simply make an SSH tunnel using PuTTY (or KiTTY, which I like better). Here is a really good article on how to setup a tunnel, it takes no time at all. After you're done, connect to localhost on the port you defined for the tunnel and you should be set!

MongoVUE gives you several ways to view the data in your collection. You can use the "text" view which shows you a JSON string that you're probably used to see in the mongo console, but you can also choose a tree and table view which is probably what you're hoping to see with an admin GUI for any database. The only other thing I might like to see is an actual console within this tool so you can manually enter commands. While I don't see that (if it's there somewhere) I do see that it will return the commands used while browsing and performing actions with the GUI. This is nice because you can then copy and paste those commands to an actual terminal window or something. I'm definitely looking forward to seeing feature enhancments for this tool.

Creating a Simple and Efficient Rating System with MongoDB

Posted by Tom on Fri, Feb 18 2011 13:51:00

I wrote a previous article on using MongoDB's non-locking updates and it's $inc operator in order to allow for a simple counter that would create minimal impact on your database. Related to that, let's suppose we want to make a rating system. Say it's a star rating; however, it can be any kind of voting or polling system really.

So here's the trick. You're storing this star rating on a piece of content (a document) and you already queried that so you can display the data on the page. We're in NoSQL land so we don't use JOINs and we really don't want to query another "rating" collection of some sort. There's no point. So we keep under this document a "ratings" array. Now when you load the document you can simply count and divide to get your average with the language of your choice. You could also make another count() query to MongoDB but again we want one query.

So how do you store ratings? You want to do an non-locking update to the database and much like incrementing, we want the impact on the database to be low. It's not that MongoDB isn't an amazing database that is more than capable of handling these updates, even if you need to make a read first, but we want to be efficient. Let's assume we're making millions of queries and this is all adding up to save us more than a few pennies in hosting costs.

Ok, so let's take a look at the $set operator. It's a simple one, nothing magical. It's just some cleverness with what we are setting. Under this "ratings" array we're going to set each users IP address (substitute for user id if you have some sort of authentication system you'd like to rely on instead) into this ratings array with a value. 

$set will set the value or update it. So you don't have to worry about repeat voters! They simply will change their vote and not skew your results. If you don't want the users to be able to vote twice then you can simply disable the link to vote on the page by looking for the IP in the array of ratings. Which, yes, is a simple deterrent and the user could technically lift the URL to hit to make the request to vote again...But again, your results aren't skewed and if you need further protection you can come up with something else. 

Let's apply this to my favorite framework here, Lithium. Here's what your query might look like:

$query = array(
  '$set'=> array('rating.' . str_replace('.', '-', $_SERVER['REMOTE_ADDR']) => $rating)
);

$conditions = array('_id' => $some_id);

$result = Page::update($query, $conditions, array('atomic' => false));

Do you see the gotchya? It's the IP address. You can't have a key name with dots in it. When passed to MongoDB it's going to translate to object hierarchy. So you'd end up with something like: rating" : { "192" : { "168" : { "126" : { "1" : "3" } } } } ... No good. So if we replace the dots with dashes, underscores, or hash it, etc. then you'll end up with something like this instead: "rating" : { "192-168-126-1" : "5", "192-168-126-2" : "3" }

Now if the same IP address voted again it would simply change the value in the "rating" array on the document. This way a user couldn't skew the rating by contstantly clicking on a star rating widget, submitting a form, etc. Of course you can further limit the impact to your database and server by setting a cookie.

Now it's just simple math to get the average. If you're using PHP, you'll run a count() on the ratings array and then add up the values and divide. Simple!

What would you have to do alternatively? Well, you might store another field with all IP's or user id's that voted in addition to a rating field... But now you've disassociated users with their votes. You could also store a different field for each option and then use $addToSet to add the IP to each option. Great, but then you allow a user to vote once for each option. You won't be completely skewed, but you wouldn't be as accurate. What about MySQL and relational database land? Well, you probably don't need to hear about all the ways you can use JOIN and how many rows you'll be scanning through to get your results and what kind of indexes you need to make sure you build.

Hope you found this little schema example useful in a schemaless world.

Lithium Quick Tip: Incrementing

Posted by Tom on Thu, Jan 20 2011 13:13:00

As I was pointed out today in the wonderful #li3 channel, MongoDB has an $inc operator, awesome! One I didn't see before (there are oh so many handy ones). So how do you write this in Lithium land? Well because we like being efficient and all, let's also look at how to do this without first reading the record (ie. an "update()" call).

Model::update(
// query
array(
  '$inc' => array(
    'field_name' => 1
  )
), 
// conditions
array(
  '_id' => $id
), 
// last but not least
array(
  'atomic' => false
)
);

There you have it. To decrement you'd put -1. Now also note that in my case (and it's worth mentioning) the value I had to increment was within an array actually. To get at that field you can simply use dot syntax to jump down into it. So the query would be more like:

array(
  '$inc' => array(
    'field_name.value_within' => 1
  )
)

Putting it altogether you can in pretty much one line, one call, very nicely increment or decrement values. Since MongoDB is so wonderful this gives you a very easy way to keep say a real time hit count or something for content within your Lithium application.

1 | 2