Shift8 Creative Graphic Design and Website Development

MongoDB: Dealing With Data That Gets Confused as Sub-Objects

Posted by Tom on Wed, Apr 04 2012 09:00:00

So I came across an interesting challenge the other night. I want to store in MongoDB a bunch of URLs and how often they are accessed. Call it simple metrics for a web site. The structure for my JSON response that I want is like so:

{
'http://www.site.com/whatever/page.html' : 23,
'http://www.site.com/another.html' : 4
}

The problem is saving this into MongoDB field isn't possible in this structure. If you don't know why, it's because MongoDB will see periods as an indicator for sub-objects. So if I go to save that under a "urls" field in my collection I'll end up with { http://www { site { com/whatever and so on.

encodeURIComponent? Sure, but periods don't get encoded to ASCII equivelant of %2E and don't need to according to RFC. Even though we could probably replace them, what if we don't want to end up with % symbols everywhere? What if we have, instead of URLs, geo coordinates? Lat/lon pairs that are not strings would not pass through encodeURIComponent. We would need to cast as string and send through, etc.

Why not add slashes or other characters? Because, if you are dealing with URLs, those characters could be mistaken for valid. There's going to be limited options for you to replace a period.

So encodeURIComponent plus some replace magic is one possible solution (didn't really do it for me though). I prefer to base64 encode the values. Of course base64 isn't native to JavaScript, but thanks for PHP.js we have some pretty sweet functions ready to use. See here for a base64_decode() equivelant to PHP's. Of course, both the encode() and decode() functions for base64 also require a UTF8 encode/decode function. So, there's 4 functions in all that you'll need.

Here's what the values look like stored in MongoDB now (for a hypothetical "urls" field):

'urls' : {
'aHR0cDovL3d3dy5zaXRlLmNvbS93aGF0ZXZlci9wYWdlLmh0bWw=' : 23,
'aHR0cDovL3d3dy5zaXRlLmNvbS9hbm90aGVyLmh0bWw=' : 4
}

So how do you use this in MongoDB? Simple, you can place the code in any map/reduce or finalize MongoCode. You probably don't want to keep doing that over and over though in each of your files and if you work from the command line, it'll be a nightmare. So you can also save stored JavaScript in MongoDB! Then you can simply call the function as if it was native.

Here's a few sites with further reading on stored JavaScript:

Now when you run aggregation, say a group() query, you can decode the values back. You could also decode the values in any other language that has base64 decode capability like PHP. You could keep the PHP.js functions on the front-end and let someone's browser do the work as well.

How much time does it take for all this? What's the overhead? Well, I haven't benchmarked it extensively. I only benchmarked my aggregation process, but I can say that it didn't take anymore time really. We're talking fractions of a second. Admittedly the job only took 2 or 3 seconds anyway, but regardless if I was running the encode function or not it was the same time. I imagine a much larger job would see a noticeable difference, but also keep in mind that a larger job is taking time anyway. So if you're concerned about using this function for a query that you want to happen without a page load timing out...Don't.

I'm also interested in seeing if there's also compression functions that can be used to save data. LZW compression has been implemented in JavaScript and the patent has apparently run out on that so it's kosher to use. Keep in mind that base64 requires about 33% more space for the data. If you're trying to keep an effecient document size, it may not always be the answer. However, the values are definitely key safe and I imagine any other kind of encoding to avoid periods also adds size.


[Back To Blog Index]