Reliable Messaging (in the cloud era)

At Flock today, someone mentioned to me that they have been getting requests to support “persistence” in fedmsg. I spent many years working in Financial Services which, as you might imagine, has some pretty strong requirements around “only once” and “definitely once” messages, particularly, in trading applications. As a result, I instantly went to “reliable messaging” as a the problem to be solved (which isn’t necessarily correct, but definitely part of the story). However, it has been a while since I was really deeply involved in FS. As a result, I did a little Googling to try and discover the “current state of reliable messaging.” I found some interesting but, rather dated, articles. Specifically, check out this and this. Googling for anything in the last year just gave me “ratified” standards around WS-ReliableMessaging which, I am sure, is good stuff, but, I was more interested in the “why” not the “how” and, unfortunately, didn’t see much (but my searching may not be awesome 😉 ).

OK, on to the point. After reading the two articles above, I was fairly convinced that in the average “trading application” (read: any single application that uses messages to communicate), “reliable messaging,” in the sense described by the standards, the articles, and the general world, probably doesn’t require a protocol-level solution. However, and why I wrote this post, fedmsg, and many other “environments,” is in a somewhat different position. Basically, the application sending the message has no interest in guaranteeing that the message sent was received by fedmsg because the application has no “dependency” on the processing done by fedmsg (this is probably not strictly true in all cases, but illustrative for my point). All of the methods described in the above, and “reliable messaging” in general, has a preconceived notion that the client for the messaging infrastructure actually cares that the server gets the message. By extension, as fedmsg is a broker, when it acts as a client to the servers who signed up to receive messages, the servers have no “interest” in communicating to fedmsg that they got the message because the business logic is within their own applications.

So, dilemma. Fedmsg wants to ensure that it does its job but, no other applications in the environment has any way to know or “care” that fedmsg is doing its job :). Now, do we need reliable messaging? I am not sure, one nice aspect (semi-irrelevant to the distinct implementation) is it forces the applications on both sides of the broker to “care” because they have to do extra work now to send and receive messages at all. However, the tradeoff is that it is “harder” for the applications using the broker which may drive down participation, thereby decreasing the set of interesting things that can happen in the “environment” by essentially removing applications from the environment. Unfortunately, I am not sure I know the answer. However, I can point to a few things that may have similar problems and may be insightful to the answer. Specifically, SMTP is Reliable (as in guaranteed) with the characteristic of no parties really having any interest ensuring the reliability. TCP/IP is also semi-reliable (don’t recall if it is actually guaranteed) as in it normally “just works” with lots of interesting mechanisms to ensure that it works.

Now, let’s also deal with another potential meaning for the term “persistence,” specifically, fedmsg also wants to be able to provide audit and metric information about the transactions it is brokering. Some of that audit/metric information is about performance (quality, including, but not limited to, speed), but, it does, and can, generate other useful information about the environment itself vs the activities of the end points. For example, part of the genesis of this conversation was a discussion about how fedmsg messages trigger badging in the openbadges implemented recently by Fedora. Now, perhaps obviously, the badging system should really register for the messages it cares about (which, it does). However, applications have bugs and something like badging has an inherit need for audit-ability. However, I still think that fedmsg shouldn’t actually implement this kind of persistence. I think that fedmsg should treat the gathering of metrics and audit-ability as just another application that is registering for events. The “audit and metrics consumer” should then be responsible for the persistence of the data and the toolchain to feed consumers of the data. Does this require reliable messaging? Well, arguably, I think this makes fedmsg actually fall in to the same “application-type” that the authors above were referencing. In other words, fedmsg and the, “magical/mystical, audit and metrics application” have a shared interest in the reliability of the messages between the systems. As a result, I think, based on the arguments above, they don’t need reliable messaging at the protocol-level.

All in all, this was very interesting subject for me because when I was in FS, the be all end all problem was how to guarantee transactions got delivered through a multitude of systems exactly once. And, as with so many things in the new era of stateless software development, maybe we never needed to jump through all those hoops. 🙂