Remoting, sockets, threads and robustness

The bad news

Unclosed listener sockets can hang a crashed application or service, prevent it from exiting. Listener sockets are just as likely to be started on their own threads. This makes the code a lot more straightforward — with the listener on its own thread you can use blocking read — but careful design is required to guarantee an exception can be trapped.

Remoting is fragile as hell. Remoting fans won't like that as a statement. They'll go on about how clever it all is. I know it's clever, but the fact is that a connection from anything other than a remoting client will cause serialization errors. So will a connection from a remoting client with out of date classes. These exceptions occur in framework code. To catch them you need a global exception handler, otherwise it knocks out your application. Global exception handling doesn't work properly in a Windows Service, which makes remoting basically useless outside of a rigorously controlled environment.

Remoting also has problems with firewalls, not to mention routing. None of these problems is insurmountable, but the solutions involve extensive reconfiguration of the environment. There is an underpinning expectation that the mountain will come to Mohammed, which is not reasonable. Remoting is also very sensitive to intermittent connections. The remoting framework doesn't throw an exception, it goes catatonic. I think this is because the framework waits patiently for a request that never comes, like a jilted lover sitting by the phone in some horrible country and western ballad, and so I conclude that remoting is unacceptably sensitive to its environment.

UDP discovery seems like a good idea but works poorly or not at all in an internetworked world, due to firewalls and routing issues.

The good news

No-config may not be practical but low-config certainly is. Only one application setting is required: the name of the configuration host. From there you can bootstrap everything else.

Low-config won't protect you from the fragility of remoting, but the solution here is equally simple: don't use remoting. Use web services instead. Web services have the following advantages over remoting.

  • They work through most firewalls.
  • They are very tolerant of networks that suffer frequent brief dropouts.
  • They are tolerant of runtime changes in protocol configuration, such as may happen when a notebook is hibernated at work and taken home, and wakes up on a different network.
  • To consume a web service requires only sockets and XML parsing, so they can be used to bridge platforms.

Threads are another thing that can catch the unwary, but it's easy to make them safer: just mark them as background threads. When you do this, shutting down the main thread forces all background threads to exit.

Some people frown on the use of thread pool threads on the basis that you might run out of them because the thread pool is finite. This is pretty silly if you ask me. There's a limit to the number of threads the underpinning operating system can support. Attempting to create a thread can fail. Attempting to get one from the pool never fails. Sometimes it blocks until a thread becomes available, but it never fails.

It occurs to me that remoting over HTTP will certainly go through firewalls and may not suffer so much from erratic connectivity, since HTTP is intrinsically stateless. But this still won't protect you from (de)serialisation problems.

The way remoting is used at work, the objects are there basically to provide context for function calls. This context is all client-generated and might just as easily be client maintained and passed to the server as a parameter. Ultimately an object method invocation is basically a function call with a hidden extra parameter that points at a struct. If you make the struct explicit and keep it at the client you can use web services to get all the benefits of remoting without all the infrastructure.

For that matter, you are also making better use of the network as a distributed computer, since you have distributed lifecycle management and transient storage among the workstations. In high volume scenarios this would take considerable load off the server.

Published 10-26-2007 13:00 by peterw