Sponsored By Aspose - File Format APIs for .NET

Aspose are the market leader of .NET APIs for file business formats – natively work with DOCX, XLSX, PPT, PDF, MSG, MPP, images formats and many more!

Challenge: Breaking Your Brain About Storage

The other day I was talking with Kyle Baley on twitter and I gave an example of how it is possible to store “highly critical” data only in memory. I mentioned to him a similar problem, here it is.

 

The basics of the problem are that:

we need to support 10000 concurrent queries, we can reject any past that immediately

we are given a latent but extremely high bandwith that guarantees packet ordering

we are only allowed to store 10 of the 100000 objects of the dataset in memory per instance

 

Feel free to email me with your potential solutions to this problem (or put them in comments). 

 

Some questions:

 

What did I not specify that we need for the solution to work? Is there anything that would make it easier?

What is the latency of a query in your solution?

How do we handle fail-over situations?

What is the total general purpose memory needed on our machine?

How would we scale to supporting 10,000,000 concurrent queries?

What happens if we need to support 1,000,000 objects?

What happens when we can change data not just query it?

Can we implement a unique constraint on say the id of our data? If so how?

Have fun guys :)

This entry was posted in challenge, storage. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

7 Responses to Challenge: Breaking Your Brain About Storage

  1. Would you do *network* storage ?

    1. You send the state of the head object to yourself through the network
    2. You receive a new object the from the network and put it on tail.

    Your numbers make it actually possible.

  2. Adam D. says:

    Richard’s answer sounds like what Greg is looking for.

  3. I would use an async event loop… stream direct from disk to socket with large buffers to fill up the pipe. Requires hardly any memory except to track 10,000 x socket states/stream positions/buffers. Use the 10 allowed entries as an MRU cache.

  4. J. Letrouit says:

    How many machines can we have? If we can store 10 objects per machine, we can cache everything in 10000 machines. I don’t know if it does fit in a network, though.

    Then query resolution could happen through a map / reduce.

    As for the failover, each machine could have a slave besides it.

    However, scaling to 10 000 000 would probably not be feasible easily, because we would need a farm of 1000000 machines, that does not fit on a single network.

    Changing data would require to seriously think how we could shard to evenly distribute cached objects by bags of ten or less on each machine. More machines would help here.

  5. Greg says:

    @marjan there is a fairly unique solution to the problem above that may not be the “best” solution to many problems but will have many people thinking about data in a different way :)

    Not that there are many problems that need this solution … its just an interesting problem to look at…

  6. G. Bisesi says:

    I don’t know wtf you’re even talking about. :)

    That’s something that would make life easer!

  7. Marjan Venema says:

    Not a real answer, just some food for thought.

    >>What happens if we need to support 1,000,000 objects?< <
    Not a lot. Though a million objects may sound like a lot to you, it sounds rather like "just a few" to me. But I am probably jaded by the solution we develop at my employers: a fast extraction and query solution for SAP augmented with a lot of Operational logic. We load the entire downloaded model into memory - 80GB or 250 million orders plus all the surrounding data is not an exception. It takes some doing and more that a few hacks as we are still on Win32 (we are desparate for Delphi 64 bit), but a million objects in memory shouldn't be too hard.

    >>store 10 of the 100000 objects of the dataset in memory per instance< <
    Why? Just make sure that you do not replicate your data buffers in your object instances. One thing we do to support the large volume of mode data is to make sure that every piece of data is loaded into memory (including MMF, AWE etc) only once. No caching of data in the classes' instances (except for calculated values where on-the-fly calculation is prohibitively slow).

    >>What happens when we can change data not just query it?<<
    The whole ball park changes then.
    With readonly data you can take shortcuts. For example our indices are very small, because we no longer use hash tables and such. We needed another solution when we were faced with a doubling of the volume when we migrated to Unicode. That solution currently depends on the data being readonly. According to a colleague it can be tweaked to support writeable data/indices, but that’ll take some doing.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>