Distributed Performance Tracing

I stumbled on this paper from Google describing Dapper, a distributed performance monitoring system (aka profiler) for their entire infrastructure. For any distributed system, even simple client/server systems, it is useful to measure performance across services & tiers. For example, in a distributed system service A may call B and C, B calls D and E, C calls F. All these calls form an acyclic graph. Most services return a value, though some may be asynchronous and some may be one-way calls (no return value).

The basic idea is quite simple. You need to pass a structure along with each call that contains a Guid to correlate a particular path through the system. In addition, you need to recreate the call hierarchy by passing along a parent ID (the calling service) and a child’s ID. The IDs are like pointers to reconstruct the call graph. The root service will not have a parent id. Each service dumps all this info to their own local trace log. A trace line should look something like this:

Guid, Service name, Method name, Parent ID, My ID, Timestamp, other optional data

Another monitoring process should gather all these trace logs from all machines and dump them into a database. Now you can do queries per Guid where you reconstruct the call graph and explore the performance of your entire distributed system. You should be able to gather the information in near real-time, depending on the load of your system. And if you run a high-performance site, then just trace 1 out of every 100 or 1000 calls (sampling).

Ideally, a tool like this would be integrated into your runtime so you don’t need to manually manage the Guids and trace info. It looks like dynaTrace does exactly this. Otherwise, you could integrate this into WCF by attaching a header to every SOAP message containing the Guid and parent/child IDs. I think it might be possible to use Sessions in WCF to add this information, though it may impose additional overheads. Nevertheless, this is very useful stuff and I wouldn’t want to run a production system without these profile traces.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s