CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Ben Reichelt's Weblog


Implementing a web based aggregator



Comments

Ross said:

Bloglines has an API, how about just writing a front end for it ( I *hate* the current Bloglines UI).

As for the scalability, I can't believe the FeedLounge guys had trouble fetching and processing 74 feeds per user. Were they trying to do this in real-time? Was there no overlap between the user subscriptions? This makes absolutely no sense (to me), surely it makes more sense to have a central list of feeds that are fetched and then show each user's view of that central store of items. Maybe they did do it that way and I am missing something, but building a scalable back-end for this sort of thing is something I'd *love* to do. Maybe I'll get around to it.
# January 23, 2006 12:58 PM

breichelt said:

Ross, thats the way that I implemented my little aggregator, if people have feeds in common, then they can use that to their advantage and only grab the feed once, just like you describe, and I'm sure that the feedlounge guys have done that.

but, there must be some number of unqie feeds per user, and if there are 100 users, each with 74 feeds, theres a grand total of 7400 hundred unique feeds. so, assuming the worst case, if you want to update each subscribers feeds each hour, that amounts to 7400 web requests each hour, and for each web request there is the cost of indexing the new or updated items in that feed. if we assume one second per feed to get the feed over the web and to index it, thats 7400 seconds, since there are only 3600 seconds in an hour, you can see where you might run into problems :)

granted, this example was assuming one machine, one processor doing the feed updates, if you throw multiple boxes at the problem, it gets better (obviously), but thats where the expense comes into play.

I'm not sure how Bloglines or newsgator does their feed updating, they could just have a massive infrastructure, but I dont see a way around requesting those feeds on some sort of timed cycle.

(one thing I know feedlounge did, was to be smart about what feeds they updated, feeds taht were rarely updated, were checked less frequently than feeds that were updated more often)
# January 23, 2006 1:13 PM

Ross said:

Ben,

7400 feeds an hour is probably actually do-able. I'd posit a large number of requests would fail bad at the if-modified-since stage, and there is no reason you would need to do one per second, two or even three threads sharing even a small-ish connection should be enough. Assuming of course miniscule latency :)

How you spread the load on the database would be an interesting problem, but most queries would be relatively simple (and hopefully lightweight), at least using the design I came up with last time I thought about this, but getting all of the feeds in might be the problem. I'm going to go and get some more coffee and think about it some more.
# January 23, 2006 2:28 PM

breichelt said:

Yep, you're right, you could use a couple more threads and the if-modified-since header would also filter out some more work, but thats only a stop gap solution, because once you get enough users and enough unique feeds, the problem will crop up again.

the database queries were pretty trivial, the database schema as a whole was pretty simplistic, and since there are that many writes occurring, mainly to mark items as read, you could optimize it pretty good. I'll have to check, but I'm pretty sure I stored the body of the posts in the db tables, but another option would be to save the body to the file system, and use the db as just an index for that.
# January 23, 2006 3:05 PM

Ross said:

I'm playing with the idea of individual databases almost on a per-user basis to handle the read items. I wonder how much a strain SQLite would put on the system, although I suspect the disk would very quickly become the bottleneck.

I guess the problem is really stated as, how do you scale transparently without taking your system offline or your users noticing a degradation of performance whilst you scale up. It's an issue that has been discussed elsewhere on blogs basically saying don't bother designing for scalability too early - which I think is just wrong.
# January 24, 2006 6:33 AM

breichelt said:

I agree with you, it would be pretty hard to be able to scale the service using software alone, you would need more hardware, and the problem then becomes the transparency as you've mentioned. When bloglines moved datacenters for instance, the service was down for a few hours (which is acutually pretty damn good, i think)
# January 25, 2006 1:26 PM
Check out Devlicio.us!

Our Sponsors

Proudly Partnered With