An engineer working on Google Talk gave a talk describing the lessons learned from building a very large, scalable system. (incidentally, implemented in Java)
- Measure the right thing: the difficult part for IM is presence (who’s online now), not messaging.
- Real life load tests: when they added GTalk to Gmail and Orkut, they didn’t reveal it to the user for a few weeks. Instead, they simulated IM connections to test against huge loads.
- Dynamic resharding: Prepare to add/subtract machines from your data center and rebalance data across those machines.
- Add abstractions to hide system complexity: make GTalk a “service”; hide all complexity from other systems like Gmail.
- Understand semantics of lower level library: Choose the right low level library to match the characteristics of your application.
- Protect against operational problems: Everything breaks, so prepare and recover for inevitable failures.
- Any scalable system is a distributed system: must have fault tolerance; collect metrics; trace transactions; etc.
- Software development strategies: binaries are backward compatible; features can be rolled out incrementally for experimentation; engineers work on production machines.