Saving Lots of Data With MongoDB
While building Social Harvest, there were many challenges with data...But among the larger issues at hand was how to store all of the data harvested from multiple APIs. There's a lot!
One very clever trick that I saw at a MongoDB conference was to actually use (and this is for Twitter specifically) the data feed from the command line and do a direct import. This is great and very amazing that you can literally take the Twitter hose and plug it into your MongoDB instance. However, this really wasn't going to work for Social Harvest. For staters, we're not trying to clone all of the data. In fact, it's against many other social networks' terms of service to do so. Instead, we need to store as little data as possible (for efficiency and to fall within ToS) and then we also need to use this data to form assumptions and calculations. Social Harvest analyzes data and runs it through various algorithms. The end result is a far cry from what we start with.
So to start with, just at a baseline, I got the entire process figured out for harvesting, processing, and storing the data. The big issue at hand was the amount. It's simply not possible to hold all of this data in memory (and for great lengths of time) while waiting for it to all be collected, processed and stored.
So instead, I wrote data out to JSON files. This allows me to also pull in data from services that did not supply a JSON feed (like Twitter does) and naturalize all of the data. These files were saved on disk and using PHP, were simply appended to. This was fast enough to handle the thousands of bits of data that came in...And remember, for each harvesting process, we can have another file that's written simulatenously. So what we're doing is splitting up the entire process into manageable chunks. This also left us with a sort of journaling system. All data is being written to disk first before insterting into MongoDB. This also means that if for any reason, the harvesting process were to stop...We could continue where we left off. Perhaps with some minor cleaning of the JSON file or not.
Harvesting is a multi-stage process. So far we've only taken the data from the stream or API and stored it down to a JSON file on disk. This is indeed less efficient than cleverly streaming it directly into MongoDB. However, we're not done with it and this is part of why it's being stored to disk.
Next. The data has refernces to users and those users need to be analyzed as well. We want to find out where they are from, how many followers they have, etc. However, we also don't want to do this for every single status update because often we will see multiple updates from the same user. So duplication is bad and extra uncessary processing is not nice for our servers...But more importantly, we want to be respectful of Twitter, Facebook, YouTube, etc. and obey rate limits and reduce the amount of requests that we must make to them. So, when we can, we cached the data that we harvested in mid-harvest.
So for example, let's say there are 1,000 status updates and 250 of them are from one very chatty user. Rather than retrieve that user's information via an API 250 times, we only need to retrieve it once. In our case, this was stored in memcached. This data is gone after a day or less. Often, a few hours. This keeps us in line with various services' terms of service. We aren't trying to take any personal information or hang onto anything. However, we do need to hang on to it for just a tiny bit of time in order to make less requests to the service. This of course also helps us with speed.
After we've harvested all of the status updates and then all of the user data for each update, we have a few JSON files on disk. These can now be directly imported to MongoDB.
Running a mongoimport via command line is extremely fast. It's significantly faster than having the application do insert after insert. We remove the bottleneck of whatever programming language we're using and the use of the driver and gain the raw power of MongoDB from the command line. Much like that clever streaming import example that we've seen.
Of course once this import is complete (it takes literally seconds), we remove the JSON files.
Now, the harvesting done is through various APIs and the other reason why we can't exacty directly stream to MongoDB is because we're not using the Twitter hose. The hose is great, but overrated. There's a few issues with it. First, we still need more data from Twitter. Second, we're not just Twitter exclusive! There's other services to consider. Third, we don't want a direct clone of data. Going back and working with that data and performing updates later is a hassle and an altogether new challenge that we can simply avoid. Last, it doesn't go back in history at all. The harvest process that Social Harvest uses is more than just real-time. It's also historical. This increases the amount of data being harvested; especially, when there is a good backlog of data.
Additionally, we've also standardized all of the various services into a JSON format native to MongoDB and we've set ourselves up with the ability for parallel processing. We can split the files up even more if we need to and have multiple servers doing this. It doesn't need to be just one. The caching is done in memcache, but we could have also used MongoDB and that cache is also accessible across servers.
So when you really have data coming in from various srouces (remote, local cache, etc.) and then need to manipulate that data or trim the data down, you can't use the direct stream import. This is how I solved this challenge and I hope that it's given you some ideas for challenges that you may face as well.