Quick Thoughts on Eventual Consistency

Very often people attempting to introduce eventual consistency into a system run into problems from the business side. A very large part of the reason of this is that they use the word consistent or consistency when talking with domain experts / business stakeholders. A quick look up of the word consistent helps show where the confusion comes in.

 

S: (n) consistency (logical coherence and accordance with the facts) "a rambling argument that lacked any consistency"
S: (n) consistency ((logic) an attribute of a logical system that is so constituted that none of the propositions deducible from the axioms contradict one another)

 

Business users hear “Consistency” and they tend to think it means that the data will be wrong. That the data will be incoherent and contradictory. This is not actually the case. Instead try using the word “stale” or “old”, in discussions when the word stale is used the business people tend to realize that it just means that someone could have changed the data, that they may not have the latest copy of it.

 

If you can get this point to be understood the discussion about introducing eventual consistency becomes a fairly simple one.

 

You can quantify mathematically the “cost” of eventual consistency, the cost can generally be defined by how many more concurrency problems are experienced. If no concurrency problem is experienced then the end user view  of the data is essentially identical for most use cases. It is important to note though that although this is one way of thinking about cost there are other aspects including complexity for the development team etc.

 

Unless you are using pessimistic locking, all data is stale, there are possibilities of optimistic concurrency failures. There is some period of time that it takes to build the DTOs, put them on the wire and for the client to receive them and draw them on the screen. There is also a period of time for a change to come from the client back up to the server. In all of these periods of time the data could change causing an optimistic concurrency failure. Let’s go with some numbers.

 

Get data from database – 10 ms

Build DTOs – 1 ms

Get data to client – 100 ms

Show on screen – 50ms

Send back to server – 100 ms

Server validation of request – 1 ms

 

So we can quickly add these together and know that any request the server is processing is operating on 262 ms stale data. Of course we have left out the largest thing the user! The human brain has roughly a 190 ms reaction time to visual stimulus, that’s just to realize the data has been shown on the screen, it is assumed the user is actually changing something as well. Do you measure the amount of time users take on various screens? Are you thinking it might be a good idea? Let’s go with a relatively quick time for the sake of discussion. A mean time of 60 seconds on a given screen. This gets added in as well so the total is now 60.262 s

 

Let’s imagine that we also tracked the number of optimistic concurrency failures. Hint: this is another value you should be tracking. We could relatively easily define an equation that represented the probability of a concurrency failure given the period of time. Most data sets will follow a normal distribution … Let’s assume that we get one (an example of where we may not would be if we had a periodical update at 62 seconds … thus P(t) approaches 1 at t = 60.

 

If we were to add in 5 seconds of eventual consistency assuming a normal distribution of changes we would end up with 65.262 seconds.

 

So we would have increased probability = P(65.262) – P(60.262).

 

Now for the last step. Let’s estimate the cost of an optimistic concurrency failure. Its a user, they have to redo something because they failed. We can come up with a rough estimate of the cost. The cost to the business from eventual consistency can at this point be estimated. Its important to note that for some transactions you may say “the value is high so we will never give a consistency error”, say for orders over $1000, it is profitable to later handle the problem even in a manual fashion, accept the order no matter what. This is actually a very valuable insight to reach. You know how often the case is being run over a period of time, you estimated the cost of the failure, and you know the increased probability of a failure due to n seconds of eventual consistency.

 

Estimated Cost = Number of Times * Increased Probability * Cost per time

 

This is a simple and effective way to help make decisions with eventual consistency. What is the cost in terms of user productivity and experience and what will you gain technically by introducing it? How will it affect your availability and partitionability?

 

I hope also that people will see the value in tracking metrics like how long users stay on screens and the number of consistency errors reported … These metrics can help improve user experience drastically.

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

12 Responses to Quick Thoughts on Eventual Consistency

  1. Scott says:

    Greg,
    I read Hendry’s argument and has exactly the same thoughts as you.

    If we look at is logically if the user process is create the user, add the photo (other data), we can even go one step further and do them all in one service call to the server. I mean why have more than one screen.

    I used to think this was a serious problem as well users not seeing things immediately but we just educated them on what to expect…. e.g. If you add this user to this role its will take 60 minutes to appear on all servers — WHAT 60 Minutes… haha… not one of the business users have complained – its all about expectations.

  2. Greg says:

    Hendry

    Now let’s get around that problem. You said a split second, if its a split second why return them to the list? 9/10 problems can be gotten around this way.

    OK its usability, trick the user. Show it like its there (you just sent the command and know it puts it in the list, you even created the id do it in your view model).

    The second option has cost associated with it, the first does not. There are further options such as we could make things synchronous (which is btw my default architecture for 99% of systems). We should also be looking at where eventual consistency is needed and how we deal with that data. Editing CRUD style data is not a good place to put in eventual consistency, doing tasks tends to be much better.

    Cheers,

    Greg

  3. Hendry Luk says:

    Hi Greg,
    Thanks for the write-up.
    I have to echo what Jorn has said, a split second of inconsistency is a big problem when there is an apparent order of how things happen from user’s viewpoint.
    E.g. when a user add a new entry “Dave” into a phone-book, he would expect the immediate subsequent screen to show Dave in the list, so he can attach a photo or IM address to it.
    Instead, thanks to split-second staleness, the next screen is giving him back a list without Dave in it. The user is puzzled, scrolling up and down not being able to find his beloved Dave. Being an inexperienced user, he tries to create Dave’s contact again, saves, and eventually sees one Dave on next screen (without knowing that he has actually created 2 Daves in the address-book).
    He picks the only Dave he can see, and attaches a photo, IM, email, etc. Hits save, and now he sees two Daves in the following list, and none of them shows the photo he has just attached. This story ends with him frantically beating the screen with his head.
    This user just had a massively confusing and chaotic experience. The data was only split-second stale, but it does not take an incredibly fast brain response to perceive this inconsistency because the user “knows for fact” that the query is definitely made *after* the command (albeit only split second), and was expecting a cause-and-effect experience.
    You can educate the user that he will not see the recent changes immediately after he creates/changes a contact, and that he should hit refresh momentarily. So back there, instead of recreating Dave’s contact again, he should have clicked refresh button few seconds after he created it the first time.
    You can probably educate internal corporate users, but it is much harder to tell your client to educate their public world-wide customers how to use their website.
    And the fact that it needs user education for such a simple task only highlights the clumsiness of its usability in the first place. I think user-education might work in utopian world, but in real practice, I dont think we can just use architectural argument to justify clunky usability to our customers, they just won’t buy it.
    We can’t just use user-education as the answer to our own technical issues. I know there are times that user education is important, but it should always be the last resort. Users want intuition, not education.

  4. Alois Kraus says:

    Interesting read. How does that play together with your proposed event sourcing pattern to feed a state machine with events to improve reliability. In your model you only deal with data access but a UI has some logic (state machine) e.g. to update some rows in a database only if the read data is in this or that state. If state changes are only done by the user there is no problem but one user event is almost always transalted into several technical events which can become problematic since race condions can happen now. I mean if the application logic depends on some handshake protocol which involves several events to reach a consistent state we would not have one event in 262ms but perhaps within 10ms several events. How should should one deal with that eventual consistency?
    It is valid to store data and read them immediately to verify if it has correctly been written. That limits scalability for sure but it can make sense in environments where we need to flag immediately success or failure in a near real time fashion.

    Yours,
    Alois Kraus

  5. Jørn Wildt says:

    Good points. Thanks.

  6. Stu Cam says:

    OK, perhaps synonymous was a poor choice of wording.

    What strikes me about your calculations is that given the human latency in parsing and completing a task the system latency for becoming consistent can grow quite high until it becomes a real problem.

    If 10 seconds is an eternity for a computer then 60 seconds is bordering on the end of time :)

  7. Greg says:

    @Stu

    CQRS is not synonymous with eventual consistency. Eventual consistency is an option in CQRS. CQRS has many benefits without eventual consistency.

    What he is saying is correct, telling users, and educating them in general is the best road. Beyond that keeping the SLA low helps a lot (10 seconds is an eternity for a computer but nothing for a human).

    Greg

  8. Stu Cam says:

    @Jorn

    I had seen a presentation by Udi Dahan whereby he says that the way to handle user interaction in a CQRS system (synonymous with Eventual Consistency) is by positive reinforcement.

    Lets say a user wants to update an order. We present the screen for them to change the order, and on submit show them a message which says “Changes accepted, your order will be updated shortly”. This simple message is enough in most cases and users will accept the delay in the write/read proliferation.

    This does of course require a fair amount of checking in your command handlers to ensure that you catch any problems with the ChangeOrder message up-front. Ideally you only want 100% valid messages making it to your ESB.

    As Greg points out if the “Changes accepted” feedback is not sufficient you can keep local edits in the user session and simply present from that until the change filters through to the read system.

  9. Greg says:

    Jorn,

    It can be done but in general education is a much better way of handling things. Combine education with a short SLA between the read and write models and generally you can get to a reasonable point. When its absolutely needed to trick the user with session locality, it can be done but it requires duplication of logic.

    Greg

  10. Jørn Wildt says:

    The use cases I have seen have never had much trouble with stale data … that is, stale data from other users. It’s easy to accept that someone else might have changed your data while you are looking at it.

    My biggest issue with eventual consistency is how to handle the fact that the user, who did the change, may not see his own change until some time later. This makes people nervous (very understandable) and they may even try to submit the request again if they don’t see their changes get trough immediately.

    Do you have any tips for handling “session local” consistency?

  11. Greg says:

    Stu I would guess not many though the metrics required to do it are relatively straight-forward to track and have numerous other benefits (why do users spend so long on this screen anyways?)

  12. Stu Cam says:

    I like the idea of being able to calculate the cost of eventual consistency with regards to the human aspect – something which I feel has been largely ignored when discussing timings.

    I wonder how many companies are able to reach the level of sophistication in their core architecture to dynamically adjust their approach to consistency based on cost. Probably not many!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>