Monday, March 12, 2012

The CloudShare Agile Development Process

Agile, agile, agile. As a software as a service company, our customers expect us to be agile. Agile in responding to customer support, but also agile in responding to feature requests, and agile in adding functionality to the base product.

Customers today expect monthly enhancements to their services on the web. They expect that they will be getting more for their money than they did the month before. And they expect that the service today will be faster than it was a month ago.

This blog post (and others that will follow) explains how CloudShare delivers the incremental functionality that customers expect to get, on a month by month basis, without impacting the core stability of the service.

The development process in CloudShare rests on three legs:
  • Staggered version releases every two weeks
  • Automated branch merging
  • Aggressive unit testing
  • Final Testing
(well, OK, four legs.)

Staggered Version Releases

So how does the development team release a version every two weeks? By doing staggered versions. And what are staggered versions? Simple – the versions that the development team always work on in parallel - the development team always works on 3 versions at once.
(I can hear you saying – “and that’s simple?”. Well, bear with me.)

Let’s say we’re a new company, where the service was never deployed to the Internet. No customers whatsoever. The development team is developing the first version, which we’ll call A.
After two weeks of development, the developers pass it on to final testing, and start developing version B.

So now we have version A in final testing, and version B in development. Now version B is being developed, but version A is being tested. Invariably, final testing finds bugs (not many, mind you, but we’ll get to that when we discuss aggressive unit testing). So, while the developers are developing version B, they are also fixing bugs in version A.

Finally, after two weeks of testing and development, version A is ready to be deployed on the Internet, version B is ready to be tested, and a new version, C, is starting to be developed.

So in the next two weeks, development has to contend with three versions – C, being developed, B being tested, and A, in production. All three of these may need coding. Version C, obviously, is being coded and developed. Version B – is being changed to fix the bugs found by testing, and version A – to patch the production code if some horrible bug was found (don’t worry – it’s rare that this happens).

So there you have it – three versions that are being worked on in parallel – the dev version, the test version, and the prod (production) version.

Now comes the interesting part – when the developers start working on version D. This is also when we start testing version C, and when version B is being deployed on the Internet.

You can work it out by yourself (answers in the bottom of the blog). After two weeks, the developers start working on version ___1, we are starting to test version ____2, and version ____3 is being deployed on the Internet.

To summarize – we are always working on three versions – most of the work is on the dev version, a bit of work is on the test version, and in rare circumstances, we patch the prod version.

Automated branch merging

“Ha ha!”, you say, “I got you now: this procedure looks really good in theory, but in practice, each change you do for the test version, you also need to do for the dev version!” And it gets worse! If we patch the prod version, we need to do the same patch for the test version and the dev version.
As they say:

Yes, inconceivable, unless you have automated branch merging! Using this technique, we use our source control tools to automatically merge the changes we did in one branch (for example, the change we did in the qa branch that fixes a bug that QA found), to another branch (the dev branch). All this happens automatically, without the developer needing to do anything.

The beauty of this approach is that if the versions were months apart, this would never have worked, but since the versions are a maximum of two weeks of changes apart, most of the times the merge will succeed without any problem. And, in the rare case where the automatic merging fails, the developers get an email asking them to do the merge manually, which they do.

But how do they know that the merge works? Especially when the merge is automatic, how do we know that it worked? This brings us to the second leg of our agile development process…

Aggressive unit testing

How do developers know that the code they wrote is working? The obvious answer – test it – works. But how would we know that the code we wrote did not break other stuff? The same answer – test everything – won’t work. There is just too much to test, and too little time. And throwing it at the testing team to check is also not going to work. Any bug found by final testing costs a lot of time for the team. Time better spent on coding.

So what should a developer do? Well, we do what any professional developer does – automate. This is a recurring theme in the CloudShare development team (and in any agile team) – if you want to be agile, you need to automate.

How can we automate testing? The answer today is simple – automated unit testing. I could discuss CloudShare unit testing for hours, and I will dedicate a blog post about this incredibly important subject, but suffice right now to say that we have tests for every conceivable customer scenario (and lots of inconceivable customer scenarios, just in case the inconceivable becomes conceivable).
Unit tests, in CloudShare, are a safety net – they make sure that even if a developer does something wrong, the net will catch the mistake. And it is not only the developer’s responsibility to run all the tests – the tests are run every day automatically on the three branches (dev, test, prod). We have tens of machines just waiting to run these tests in parallel, so that feedback about the change the developer did gets to the developer as quickly as possible.

These unit tests are fast – a developer can get feedback about a change they did in about 15 minutes. But they are fast for a reason – they test things programatically, and not through the browser UI. They also do a lot of tests without using the file system, or the cloud infrastructure.
So what happens when a bug was created in the UI, or in the way we use the cloud infrastructure? Or the file system? This is where we reach our last leg – and the most important one:

Final Testing

Our Quality Assurance team - the testing team - gets a very debugged version. It has passed all the unit tests, and they are pretty confident that most of the functionality will work. But still we test – for the reason mentioned above, and also to verify that nothing has slipped through the cracks.

Our Quality Assurance engineers test each of the new features that were implemented. But they also test all the other functionality. Given that we have two weeks to test a version – new functionality and old functionality – we also have to resort to automation here. The testing team has a comprehensive system of tests that run the system just like the customer does – through the browser, and using the same infrastructure as in production.

Final Words

Agility is not just getting new functionality quickly out of the door. It is building a software development process, a company culture, that supports agileness. A company culture that understands that you have to work on multiple versions at once, build tools to enable this, and above all – aggressively test, test, test.

1 E
2 D
3 C