2012 Dec 10

I’ve been working fulltime for startups for almost 6 years, never taking more than a week off. After almost 4 years at IMVU I vowed to take a month off between jobs, and then I didn’t take a day off between IMVU and Canvas (oops). Encouraged by my girlfriend Amanda Wixted’s success in iOS contracting, I finally took the plunge after I left Canvas.

When I was in a startup it was straightforward to evaluate what to work on: will it help us become profitable? will it help the business survive? As a programmer with more than a couple years experience, it’s very straightforward to be profitable and survive as an independent contractor. For better and worse the equation becomes much more complicated: is this project going to be fun? Will it preclude the lifestyle I want? …do I know what lifestyle I want?

For years now I’ve been envious of the lifestyle that Colin and Sarah Northway have lived. They’re both independent game developers who move every 3 months. They maintain this lifestyle by doing short-term furnished rentals (think airbnb/vrbo). It’s not a vacation, they develop their games while they travel.

It’s taken me about 6 months to get to the point where this lifestyle is doable, but I’m finally there. I have a fairly steady stream of longer-term contracting projects that allow me to work remotely. They pay competitive to SF/NYC rates, which means I can afford the overhead of short-term furnished rentals. My girlfriend is in exactly the same position. We’re selling, donating or trashing all of our large possesions. We’re kicking it off by spending 4 months in Tahoe, CA enjoying the ski season. The best part and also the scariest part is that we have no idea where we’ll go after that.

Scaling Up Continuous Deployment

2012 Dec 03

Continuous Deployment is for small startups who don’t care about quality. If I could sum up every misconception of Continuous Deployment in one sentence, that was it. Today I hope to dispell the “It doesn’t scale” myth.

Your build pipeline must scale up as two major factors increase: number of developers and number of people. When I say “scale up”, I very specifically mean maintaining the availability and performance of the system and keeping developers happy. All three of these factors are important and should be measured.

Firstly, set an SLA for your build pipeline, for instance I should be able to commit/deploy 99% of the time. Secondly, set a fixed goal for the total commit-build-deploy process. At IMVU that goal was roughly 10 minutes, at Canvas that goal was 5 minutes. If you set the bar high to start, you’ll maintain the benefits of fast deploys as you scale. Finally, and this deserves it’s own post, happiness. Track happiness explicitly via regular anonymous surveys, and make sure you fund the “little things” that frustrate everyone. It makes a huge difference, and ensures that down the road your system won’t have the usual “death by 1000 annoyances” that makes internal systems so painful to deal with.

Once you have your targets well defined, then it’s time to lay down the fundamentals. Scaling up Continuous Deployment starts with making your tests crazy fast. Step one in fast tests is write the right type of tests in the first place.

I think of it as a food pyramid of testing:

Testing Pyramid

The important idea here is that most functionality shouldn’t be tested via end-to-end GUI regression tests. You must be sparing, or your tests will not just end up slow but brittle by virtue of the fact that they integrate with too much and do too much. For web apps this means be sparing with the use of Selenium.

Next up are integration tests. If your test hits the database, it’s an integration test. If your test makes an HTTP request, it’s an integration test. If your test talks to other processes, it’s an integration test. The simple fact is that most web testing frameworks only do integration tests, and most web developers end up writing way too many of these.

Finally unit tests. This should be your bread and butter. I may be bastardizing the term “Unit Test” since I don’t care about how many classes they touch, I just mean they don’t touch external state and they don’t rely on a specific environment. That means they don’t use randomness or time either. If you follow those rules it’s actually hard to write a test. You should be able to run thousands in a few seconds.

For every 1 GUI test, you want a couple integration test and dozens of unit tests. The only way to maintain a ratio like this is to conscious at all times of writing the right type of test for the functionality you’re testing.

Now that you’re making sure your individual tests are fast, you need to ensure that you’re running the test as fast as possible. Luckily running tests in parallel is usually straightforward. For unit tests you can just run multiple processes, or maybe your test runner has a multithreaded option. For integration/regression tests things get more complicated, and it usually makes sense to run them across multiple machines to ensure isolation. It’s fairly straightforward to use Buildbot or Jenkins to orchestrate multi-machine builds.

So now you’ve got the right kind of tests (fast ones) and they’re running in parallel across numerous machines. Just throwing hardware at the problem will work for a while, but eventually you’ll run into the hardest problem of scaling continuous deployment: The number of tests for a project is a function of the total man-months of engineeering. If you’re constantly growing the team over time you’ll end up with a geometrically increasing number of tests. Tests that all need to be run in a fixed time window. Except, to keep your SLA you need to keep the build from breaking. But the build is broken more often the more engineers (and thus commits) you push through the system. So now you actually need to shrink the build time so that bad builds can be reverted faster.

From my experience, if you’re writing enough tests you’ll want to start running them in parallel fairly early (2-4 developers) and after that you can roughly throw hardware at the problem until the 16-32 developer range. At that point, you’ll need to start changing the process in a more fundamental way.

Developers implicitly understand which commits are more likely to break the build. Usually there’s not much that can be done with that information, other than being more nervous when committing. A “try” pipeline is a system which builds patches or branches instead of the mainline trunk. Try pipelines can be used to test riskier builds without interfering with the main build pipeline at all. They cost a little: developers have to think to use them, and then that specific commit will take twice as long to ship. In exchange you can dramatically drop the number of broken builds. Buildbot directly supports this concept, and the only significant cost is the extra hardware to do parallel builds.

Along the same lines, you can have a bot watch for broken builds and revert-to the last succesfull build. The upside is that the system can bring itself back to good (able to accept good commits and deploy them) faster than a human could. The downside is that it’s fairly rude: anytime multiple authors get bundled in a broken build they’ll all suffer for one bad commit. Alternatively, you can commit to branches that your build system then merges for you once the branch individually has passed tests. This means the trunk build is never red, and no commit is ever rejected unfairly. Unfortunately it also means you’ll need significantly more hardware to run your tests.

Finally, you’ve written fast tests, have a try server or something more advanced but things are slowing down and getting frustrating. What do you do? The most common thing I’ve seen (explicitly or implicitly) is simply committing less frequently. This can take the form of switching to feature branches (and thus trunk is only merges of larger batches of commits), or team branches. Either way, you’re losing some of the advantages of Continuous Deployment. Maybe that’s the right trade-off for your team, maybe it’s not.

Conversely, the gold standard for scaling up continuous deployment is to simply make sure each team gets their own independent deploy pipeline. This effectively requires service oriented architecture for web applications, and so it may be a big pill to swallow depending on your existing architecture. It works. It can scale to hundreds or thousands of developers, and each developers is still able to have an incredibly fast deploy pipeline.

tl;dr: write less tests. write fast tests. go parallel. when that fails, go SOA.

Getting Started with Continuous Deployment

2012 Nov 25

One of the most common questions I’m asked about continuous deployment is “Continuous deployment sounds awesome! I have this existing project/​team/​corporation. How do we get started?” I absolutely do not recommend halting your company and spending months building out a “perfect” continuous deployment pipeline. There are quite a few ways to incrementally build and use continuous deployment systems, which I’ve bucketed into the four options.

As a quick refresher, let me recap the basic components of a continuous deployment pipeline. If every developer isn’t committing early and often, then head over to wikipedia’s article on continuous integration and start there. Once your entire team is practicing continuous integration, you’re ready to to pick a continuous integration (CI) server. The CI server is the cornerstone of all continuous deployment automation. I recommend Buildbot or Jenkins.

Your CI server has a few jobs. On every commit your CI server should run all of your automated test coverage. This is your safety net. If your automated test coverage completes successfully, your CI server will then run your automated deployment scripts. Now that your code is being deployed, you’ll want application and system level production monitoring and alerting. If anything goes wrong (user signup breaks, master database CPU spikes, etc) you want to be alerted immediately. Finally, you need a formal root cause analysis process for learning from production failures and investing in future prevention. Of course there are many more advanced practices that can be added to the continuous deployment pipeline, like a cluster immune system, but those all come after you’ve built and learned from the basic pipeline.

In a nutshell: Commit → Tests → Deploy → Monitor

So how do you incrementally build out such a complicated pipeline? Here are the four options I recommend:

Just ship it.

Automate your deployment. Deploy to production on commit. Worry about tests when your regressions become costly. Use root cause analysis to drive further investment in your pipeline.

  • Pro: Quickest path to getting value out of CD
  • Pro: Easy to do when project is starting
  • Con: Expect heightened regression rate until automated testing/monitoring catches up with deploy rate. Best for: startups, brand new pre-customer software, software where regressions are relatively cheap. Remember: If you have no users, regressions are free!

Ship to staging first.

Automate your deployment. Deploy to staging servers instead of production. Treat staging failures like as if they were production failures. Slowly build up your automated test coverage, and application monitoring/alerting. When comfortable with the regression rate in staging, start continuously deploying to production.

  • Pro: Easiest to sell to conservative stakeholders
  • Pro: No risk of continuous deployment causing regressions in production.
  • Con: Regressions often won’t be caught in staging, giving a false understanding of regression rate. This is a big deal. Your staging servers won’t have real users hitting real edge cases, and they probably won’t have the traffic, data size, scalability issues or reliability measurement of your actual production servers. You will regress in all of these areas without knowing it. This all leads to a false confidence in the abilities of your continuous deployment pipeline.
  • Con: The cost of building out the whole continuous deployment pipeline, without the return of the feedback of deploying to production. Best for: proving methodology to conservative stakeholders, projects with existing staging deployment that is regularly tested, and dipping toes in the continuous deployment waters.

Ship a low risk area first.

Automate deployment of a low-risk area. As infrastructure is built out / regression rate drops, roll out automated deploy to other areas.

  • Pro: Real learnings, real continuous deployment
  • Pro: Minimal risk of costly regressions
  • Con: Low-risk areas are often able to have regressions without Q/A or customers noticing, leading to slow feedback cycle
  • Con: Requires that you have or build a well isolated area of your codebase. Best for: proving methodology to less conservative stakeholders, projects with isolated low-risk code areas, and utilizing continuous deployment infrastructure as it’s built.

Ship your highest-risk area first.

Automate deployment of your highest-risk area (i.e. billing). Build out the full continuous deployment pipeline for your riskiest area, including a large automated test suite. This means significant up front investment, but If continuous deployment works for your company’s highest risk area then you’ve proven it will work everywhere else. While this approach sounds crazy, if your company decides continuous deployment is critical to it’s success then nailing continuous deployment for your highest-risk area makes the rest of adoption a downhill battle instead of an uphill one.

  • Pro: Real learnings, real continuous deployment
  • Pro: Once successful, rolling out to the rest of the company should be an easy sell
  • Pro: Easiest to measure before/after regression rates accurately
  • Con: Must build significant automated testing, alerting, monitoring and deploy/rollback infrastructure up front
  • Con: Regressions will be more expensive Best for: top-down mandate to build and invest in continuous deployment, proving methodology beyond doubt in large organizing, organizations large enough to devote team to implementing continuous deployment, and moving large organization to continuous deployment as fast as possible.

Which of these options is right for you comes down to some combination of risk-adversity and organization size. While the overall cost of building out the full pipeline might be significant, you should be able to choose a method from above and invest in your pipeline while you continue to successfully ship your product. The important part is gaining proportional return as you invest in your infrastructure, allowing you to reap benefits from continuous deployment as soon as possible.

So what are you waiting for? Get started already!

Continuous Deployment for Downloadable Client Software

2009 Mar 09

Continuous Deployment is the practice of shipping your code as frequently as possible. While relatively straightforward when applied to a production deploy as is common for websites and services, when applied to traditional client side applications there are three big problems to solve: the software update user experience, the collection and interpretation of quality metrics, and surviving the chaos of the desktop environment.

The first problem with Continuous Deployment for downloadable client software? It’s a download! Classically, the upgrade process is: The user decides to update, finds the software’s website again, downloads the newer version and runs the installer. This requires the user to remember that the software can be upgraded, find a need for it to be upgraded and determine it’s worth the effort and risk of breaking their install. When OmniFocus was in beta the developers were releasing constantly, many times per day! While the upgrade was manual and you had to remember to do it, the whole process worked well because the selected users were software-hip and often software developers themselves. I have nothing but praise for the way The Omni Group rapidly developed and deployed; plus they published a bunch of statistics! Still, there are clearly better ways to handle that stream of upgrades.

Software Update Experience

For successful Continuous Deployment, you need as many users as possible on your most recent deploy. There are a few models for increasing upgrade adoption, and I’ll list them in order of effectiveness.

Check for updates on application startup When you run the software, it reaches out to your download servers and checks for a new version. If available, it provides an upgrade prompt. These dialogs are most useful when they can sell the user on the upgrade, then it feels like a natural process “I’m upgrading because I want to use new feature Y”. These prompts can become extremely annoying, depending on where in the user’s story your application starts.This is the process IMVU uses today, with all of it’s pros and cons. The best case user story is: The user remembers she wants to hang out on IMVU and launches the client. She notices we’ve added a cool new feature and decides to upgrade. The process is fast and relatively painless. The worst case user story is: The user is browsing our website and clicks a link to hang out in a room full of people he finds interesting. On his way to the room, he’s gets a dialog box with a bunch of text on it. He doesn’t bother reading because it doesn’t look related at all to what he’s trying to do. He clicks yes because that appears to be the most obvious way to continue into his room. He’s now forced to wait through a painfully slow download, a painfully slow install process and far to many dialogs with questions he doesn’t care about. By the time he makes it into the room no one is there. The update process has completely failed him. Let’s just say there is definitely room for improvement.

Bundle an Update Manager This is the approach taken by Microsoft, Apple and Adobe to name a few. Upgrades are automatically downloaded in the background by an always-running background process and then at the user’s pace they’re optionally installed. While this could be theoretically a painless process, the three vendors I’ve named have all decided it’s important to prompt you until you install the upgrades. This nagging becomes so frustrating that it drives users away from the products themselves (personally, I use FoxIt Reader just to avoid the adobe download manager).

Download in the background, upgrade on the next run The FireFox approach, downloads happen it the background while you run the application. When they’re finished you’re casually prompted once and only once if you’d like to restart the app now to apply the upgrade. If you don’t, the next time you run FireFox you’re forced through the prompt-less update process. A huge improvement over constant nags and useless-prompts filled installers. Updating FireFox isn’t something I think about anymore, it just happens. I would call this the gold standard of current update practices. We know it works, and it works really well.

Download in the background, upgrade invisibly This is the Google Chrome model. When updates are available they’re automatically downloaded in the background. They’re upgraded, and as far as I can tell they’re applied invisibly as soon as the browser is restarted. I’ve never seen an update progress bar and I’ve never been asked if I wanted to upgrade. Their process is so seamless that I have to research and guess at most of the details. This has huge benefits for Continuous Deployment, as you’ll have large numbers of users on new versions very quickly. Unfortunately this also means users are surprised when UI elements change, and are often frustrated.

Download in the background, upgrade the running process Can you do better than Google Chrome? I think you can. Imagine if your client downloaded and installed updates automatically in the background, and then spawned a second client process. This process would have it’s state synced up with the running process and then at some point it would be swapped in. This swap would transfer over operating system resources (files and sockets, maybe even windows and other resources depending on operating system). Under this system you could realistically expect most of your users to be running your most recent version within minutes of releasing it; meeting or exceeding the deploy-feedback cycle of our website deploy process.I’m guessing Chrome is actually close to this model. A lot of the state is currently stored in a sqlite database making the sync-up part relatively easy. The top level window and other core resources are owned by a small pseudo-kernel. You could easily imagine a scenario where deploys of non-pseudo-kernel changes could instantly update while pseudo-kernel changes would happen on next update. For all I know Chrome is doing that today! This doesn’t address, and in fact exacerbates UI and functionality changing friction.

Success Metrics

Unlike a production environment, you don’t control any of the environmental variables. You’ll face broken hardware, out of memory conditions, foreign language operating systems, random dlls, other processes inserting their code into yours, drivers fighting for first-to-act in the event of crashes and other progressively more esoteric and unpredictable integration issues. Anyone who writes widely-run client software quickly models the user’s computer as an aggressively hostile environment. The examples I gave are all issues IMVU has had to understand and solve.

As with all hard problems, the first step is to create the proper feedback loop: you need to create a crash reporting framework. While IMVU rolled it’s own, since then Google has open sourced their own. Note that users are asked before crash reports are submitted, and we allow a user to view their own report. The goal is to get a high signal to noise chunk of information from the client’s crashed state. I’ve posted a sample crash report, though it was synthetically created by a crash-test. I hope no one notices my desktop is a 1.86ghz processor… Of note, we collect stacks that unwind through both C++ and Python through some reporting magic that Chad Austin, one of my prolific coworkers, wrote and is detailing in a series of posts. In addition to crash reporting, you’ll need extensive crash metrics and preferably user behaviour metrics. Every release should be A/B tested against the previous release, allowing you to prevent unknown regressions in business metrics. These metrics are a game changer, but those details will have to wait for another post.

A screenshot of our aggregate bug report dataA screenshot of our aggregate bug report data

If your application requires a network connection you’ve been gifted the two best possible metrics: logins and pings. Login metrics let you notice crashes on startup or regressions in the adoption path. These are more common than you think when they can be caused or exacerbated by 3rd party software or windows updates. Ping metrics let you measure session length and look for when a client stopped pinging without giving a reason to the server. These metrics will tell you when your crash reporting is broken, or when you’ve regressed in a way that breaks the utility of the application without breaking the application itself. A common example of this are deadlocks, or more generically stalls. The application hasn’t crashed but for some reason isn’t progressing. Once you’ve found a regression case like that you can implement logic to look for the failure condition and alert on it, to fail fast in the event of future regressions. For deadlocks we wrote a watcher thread that polls the stack of the main thread, if it hasn’t changed for a few seconds then we report back with the state of all of the current threads. In aggregate that means graphs that trend closely with our user’s frustration.

Deadlocks or stalls, measured in millistalls (thanks nonsensical Cacti defaults)Deadlocks or stalls, measured in millistalls (thanks nonsensical Cacti defaults)

Once you have great metrics, you have to strike a balance between asking customers to endure an update and gaining the feedback from your crash reporting and business metrics. For IMVU’s website deployment process we had a 2-phase roll out, similarly for Client development we have “release track” and “pre-release track”, where releases are version X.0 and pre-releases are subsequent dot releases. We ship a pre-release per day, and a full release every two weeks. Existing users are free opt-in and opt-out of the pre-release track. Newly registered users are sometimes given a pre-release as part of an A/B experiment against the prior full release, but are then offered the next full release and do not stay in the pre-release track. Google Chrome is another example of this model. By default you’re given the stable channel which is a quarterly update in addition to security updates. You can opt-in to the beta channel for monthly updates or the dev channel for weekly updates.

The harsh reality of the desktop environment

Once you’re measuring your success rates in the wild and deploying regularly, you’ll get the real picture of how harsh the desktop landscape is. Continuous Deployment changes your mindset around these harsh realities: code has to survive in the wild, but you also must engineer automated testing and production measurement to ensure that changes won’t regress when run in a hostile environment.

Hostile Hardware

To start, software you write and deploy will have to survive on effectively hostile hardware and drivers. For a 3d application, that most commonly means crashes on startup, crashes when a specific 3d setting is used or jarring visual glitches. Drivers and other software on windows have a far-too-common practice of dynamically linking their own code into your process. Apart from being rude, this can lead to crashes in your process in code you didn’t write or call and can’t reproduce without the same set of hardware and drivers. Needless to say, crash reports contain an enumeration of hardware and drivers.

Running in an unknown environment means dealing with the long-tail of odd configurations: systems with completely hosed registries, corporate firewalls that allow only HTTP and only port 80, antivirus software being nearly malicious, virus software being overtly malicious and motherboards that degrade when they heat up just to name a few. These problems scale up with your user-base, and if you choose to ignore “incorrectly” configured computers you’ll end up ignoring a surprisingly large percentage of your would-have-been customers.

Go to the source

Dealing with these issues is compounded by the fact that you have minimal knowledge about the computers that are actually running your software. Sometimes the best metrics in the world aren’t enough. For IMVU that meant we were forced to go as far as buying one of our user’s laptop. She was a power user who heavily used our software and ran into its limitations regularly. The combination of her laptop and her account would run into bugs we couldn’t reproduce on the hardware we had in house. We purchased her laptop instead of just buying the same hardware configuration because she was gracious enough to not wipe the machine; we were testing with all of the software she commonly ran in parallel. This level of testing takes a lot of customer trust, and we’re truly indebted to her for allowing us the privilege of that kind of access.

We also looked at our client hardware metrics and the Unity hardware survey. We cobbled together our 15th percentile computer. This is a prototypical machine which is better than 1/8th of our user’s hardware: 384mb of ram, a 2ghz Pentium 4 and no hardware graphics acceleration. These machines commonly reproduced issues that our business class dell boxes never would. Many of our users have intel graphics “hardware”, which is so inefficient at 3d that it’s a better experience to render our graphics in pure software. Ideally we’d run automated tests on these machines as part of our deploy process, but we’re not there yet. Our current test infrastructure assumes that you can compile our source code in reasonable time on the testing machine.

Before I end this post I’d like to add a few words of caution. If you’re deploying client software constantly then you’re relying on a small set of code to be nearly perfect: your roll back loop. In a worst case scenario, a client installer was shipped that somehow breaks the user’s ability to downgrade the client. In the absolute worst case, that means breaking the machine completely; let’s hope you won’t have to create a step by step tutorial of how recreate your boot.ini. Every IMVU client release is smoke tested by a human before being released.

It’s a much rougher environment for Continuous Deployment on client software. There’s non-obvious deploy semantics, rough metrics tracking, and a hostile environment all standing in the way of shipping faster. Despite the challenges Continuous Deployment for client software is both possible, and has the same return of Continuous Deployment elsewhere: better feedback, faster iteration times and the ability to make a product better, faster.

What is Agile, really?

2009 Feb 17

I'm tired of reading misinformation about Agile. I'm tired of reading statements like these, that are just outright wrong:

Agile means writing software without writing documentation. Agile means not caring about the long term. Agile means engineers get to decide the project’s features. Agile means not having strict practices.

And worse still are the half truths where Agile is confused for a specific practice of some Agile developers, statements like:

Agile means pairing. Agile means Test Driven Development. Agile means scrum.

So what is agile really?

Agile is writing software in teams that regularly reflect on how to become more effective, and trusting that team to adjust its behavior accordingly.

This is the core of agile, synthesized from Principles behind the Agile Manifesto. It’s about people. It’s about trust. It’s about continual improvement. This is where most implementations of Agile falter: they fail to trust the team. If you can’t build a team you trust to improve themselves; fire yourself. Replace yourself with someone that can.

That’s it. That is all you need to know about Agile. With this core, the team will re-evolve the major practices of Agile, but in the team’s context. Take “ship early, ship often” for example. This principle would quickly get re-derived, as the team’s regular reflections would be blocked on the same problem: they don’t know if they’re doing well or not. A quick root cause analysis would show that they don’t know how they’re doing until they’ve shipped real value to real customers. The rest of the principles can be re-derived in a similar manor.

8 books to kickstart your adoption of Lean Software Development

2009 Feb 17


Lean Software Development

This is the book you must absolutely read. It covers succinctly the basic principles of Lean Software, and directly how to implement them. It identifies and covers in depth seven fundamental lean principles: Eliminate Waste, Amplify Learning, Decide as Late as Possible, Deliver as Fast as Possible, Empower the Team, Build Integrity In and Seeing The Whole.

"Lean Software Development helps you refocus development on value, flow and people- so you can achieve breakthrough quality, savings, speed and business alignment."

Implementing Lean Software Development

Implementing Lean Software Development

Effectively Lean Software Development Volume 2. This book picks up where the last one left off. It adds depth, clarity and a breadth of examples.

"You'll discover the right questions to ask, the key issues to focus on, and techniques proven to work."

Bonus Points: The author, Mary Poppendieck, gave a google tech talk: Competing on the Basis of Speed

One hour of solid gold." -

Lean Thinking

Lean Thinking : Banish Waste and Create Wealth in Your Corporation

While not being explicitly about software, this is the book where I finally grokked Lean Software. The analogies to Software are obvious, and this book taught me more about making the transition to lean processes than any other book on this list. It's full of case studies about the exact steps companies used to transition from batch (waterfall) processes into Lean processes.

"In contrast with the crash and burn performance of companies trumpeted by business gurus in the 1990s, the firms profiled in Lean Thinking - from tiny Lantech to mid-sized Wiremold to niche producer Porsche to gigantic Pratt & Whitney - have prospered, largely unnoticed, along a steady upward path through the market turbulence of the nineties. Meanwhile Toyota has set its sights on the leadership of the global motor industry."

Toyota Production System

Toyota Production System: Beyond Large-Scale Production

This was the book that introduced the world to Lean, before it was called Lean. Not only does it introduce the theory in abstract, it also gives detailed explanations for some of the practices in Lean Software. The best example of this is the Five Whys root cause analysis; you simply state the failure or defect, and then ask why five times over. This ensures that you don't solve a low level symptom of a higher level problem, when you could solve the higher level problem; thus dissolving the low level symptom.

"The most important objective of the Toyota system has been to increase production efficiency by consistently and thoroughly eliminating waste."

The Goal: A Process of Ongoing Improvement

The Goal: A Process of Ongoing Improvement

A truly unique narrative about a fictional manufacturing plant. This book explains the Theory of Constraints, a theory which fits into the Lean mindset perfectly. A fascinating example of how to use a book to teach via the Socratic method. A quick read, so it's high ROI!

"It's about people trying to understand what makes their world tick so that they can make it better."

Working Effectively with Legacy Code

Working Effectively with Legacy Code

Forget the title. PLEASE. Terrible title. Commonly referred to as just "Working Effectively. The first pages of the book redefine "Legacy" to mean "Code without tests." This book is about taking existing code that has no tests, and incrementally adding test coverage, all the while delivering value to customers. This is how you solve the "This code is crap" problem without going through the full rewrite song/dance/failure.

"Is your code easy to change? Can you get nearly instantaneous feedback when you do change it? Do you understand it? If the answer to any of these questions is no, you have legace code, and it is draining time and money away from your development efforts."

Test-Driven Development

Test Driven Development

TDD is practically a requirement for keeping up development velocity, a fundamental requirement of Lean Software. This is the classic TDD book. Unfortunately it's not the best introduction, because it appears deceptively obvious until you actually get Test Driven Development. If you're already practicing and have bought into TDD then read this book to really hone your skills and intuition.

"By driving development with automated tests and then eliminating duplication, any developer can write reliable, bug-free code no matter what its level of complexity. Moreover, TDD encourages programmers to learn quickly, communicate more clearly, and seek out constructive feedback."

Extreme Programming Explained

XP Explained

This book lays out the fundamentals of Agile Software Development, which is effectively a subset of Lean Software Development. This is walk-before-you-run territory: if agile is still a foreign concept, start here instead of Lean Software.

"Every team can improve. Every team can begin improving today. Improvement is possible- beyond what we can currently imagine. Extreme Programming Explained, Second Edition, offers ideas to fuel your improvement for years to come."

Cloud Elasticity

2009 Feb 14


It’s time to take advantage of the cloud’s free parallelism. Most existing use cases merely map existing techniques to the cloud. Elasticity is a critical mesaurement: the time it takes to start a node up, and your minimum time commitment per node. Short lived but massively parallel tasks that were once impossible thrive in a highly elastic world. Big prediction: Clouds are going to get more elastic indefinitely; they’ll trend with Moore’s law.

Cloud Elasticity

It’s time to take advantage of the cloud’s free parallelism. Lot’s of infrastructure has moved off to the cloud, but it’s being done so naively. The storie seem to fall into three categories: “I’m serving my wordpress on an ec2 instance.”, “we used a few machines to OCR a bunch of pdfs.” and “When it came time to add capacity, we just clicked a button.” These use cases are all things you would do if you bought hardware, but are easier because of cloud computing.

Let’s try a scenario where we’re going to push public clouds to their breaking point. We want to run our test suite, which takes four hours, as fast as possible. How fast is that today? For this example, let’s assume EC2 nodes always take exactly one minute to start (a fair estimate of reality). Let’s also assume test setup is another minute, flat. We can spawn four nodes and have all of the tests done in one hour and two minutes. We can spawn forty nodes and have the tests done in eight (4 * 60 / 40 + 2) minutes. We can spawn four hundred nodes and have the tests done in 2:36. That’s not just fast, that’s fast enough to change the rules around when you run tests. That’s fast enough to change the nature of software development.

Sadly this scenario isn’t realistic today, because there are two important measures of cloud elasticity: spin-up elasticity and spin-down elasticity. Spin-up elasticity is the time between requesting compute power and recieving it. Spin-down elasticity is the time between no longer requiring compute power and no longer paying for it. In the case of EC2 these numbers aren’t balanced, it’s a minute to spin up and up to an hour to spin down. EC2’s true elasticity is an hour!

Which is a shame, because what really interests me are the services pushing the edges of elasticity. Services that can only exist in a world where computing power blinks in and out of existence. Services that offer to spread your workload as widely as technologically possible. Services that take your long running tasks and give you results immediately, at minimal extra cost.

So here’s my big prediction:

Thanks to computing power’s exponential growth, cloud computing’s elasticity will exponential decay; we’ll see a 1/2 reduction in spin-up and spin-down time every year and a half.

For a while this elasticity will go towards making offline tasks faster. Tasks like compressing large amounts of video will at most take about as long as the elasticity times. Cool stuff, but not very novel uses. It’s when the elasticity starts to approach the “off line” vs “on line” threshold that things get crazy. What if it’s only a second to spin a machine up or down? We can start to have machine per web request, or machine per social interaction (IM, tweet or hug).

What happens when we have 5 second elasticity? (About as long as a user will wait for a UI interaction to complete without multitasking)

What happens when we have 15 millisecond elasticity? (About as fast as your eyes can refresh)

I don’t pretend to know what the next big revolution in computing is going to be; but I’ll sure as hell be watching the services pushing cloud elasticity to it’s edges. If there is a revolution to be had in cloud computing, that’s where it’ll start.

Emergent Properties of Continual Automation

2009 Feb 13

Once a task has been automated to take dramatically less time, a threshold is crossed and at which point you can exploit emergent properties. Or in other words, crazy-ass improvements.

While the most dramatic examples I have relate to our test and deploy infrastructure, I’ll skip rehashing those here.

Most companies manage to end up with weeks of effort required to create local development environments. I’ve heard stories of major bay area companies with internal package management hell that meant there were only two or three people in the entire (thousands of employees) company that could actually create working development setups.

How automated can this process get? At IMVU the entire setup process for our website is an SVN checkout and running a script. At that point a local instance of apache and memcache are running. Port 80 is serving up a local development copy of the website. **There is no work left. **

What does this level of automation give you? Here’s where things start to take off. Now I can easily install a sandbox on my home machine, on my laptop or even on an EC2 instance all without effort. I can start to parallelize my work effort: run regression tests on one machine, poke at code to understand it on the other. We can one-off install sandboxes on machines to run data crunching. This is how we’ve generated the numerous incarnations of rendered 3d art in our registration flow.

When we spin up new engineers we target a working sandbox by lunch and a typo fix live in production by end of day. **On the first day of their employment. **This is one of our best ways to demonstrate our cultural differences from most development shops.

We can give sandboxes to marketing, and have them develop promotion materials against instantly up to date code. Marketing can run tests (which they occasionally break!) and even commit and push material live to the website. This isn’t just saving engineering time, it’s allowing marketing to be dramatically more effective. No more telephone or bouncing e-mail the mockups back and forth.

Instead of resorting to complicated test-cluster setups, our testing pipeline is just fourty seperate sandboxes operating with full parallelism. Should a machine break, or some other stray-electron corrupt an install, it’s a tiny amount of effort to restore the machine from a fresh state.

More importantly, our sandbox install and update procedures are the same thing. It’s incredibly easy to experiment with new features, software packages or other setup changes and have every other developer running with them at their next code update. We’ve experimented with different versions of php, including switches to toggle between php4 and php5. We added Privoxy to flag accidental 3rd-party dependencies in Selenium, solving a large class of accidental test dependencies.

All of these benefits are amazing, but what was more incredible is that we didn’t really anticipate any of them. This is emergence at its finest. In dramatically lowering the cost of sandbox creation we dramatically lowered the costs of numerous dependent activities, and in doing so we changed the very shape of our development practices. This is the application of Systems Theory at it’s finest.

We didn’t get here overnight. We didn’t get here in a few weeks. We didn’t get here by funding a project. We got here by a culture. That culture overridingly said, if you did it twice it’s time to start automating. Every time you repeat a task, make progress on automating it. It doesn’t have to be big, flashy, bold or fanciful. It just has to be progress. You’ll quickly find this culture reinforces itself, automating common tasks makes other engineers want to automate them even more.

Sometime today you will come across a task that you’ve done before. You’ll notice the commands are coming from muscle memory or the steps are fully documented. Automate it.

If not now, then when?


2009 Feb 12


The world doesn’t need another arbitrary binary protocol. Just use HTTP. Your life will be simpler. Originally this came up when scaling a gaggle of MySQL machines. I would have killed for a reliable proxy. It’s with this in mind that I’ve come up with my list of things that HTTP has that an arbitrary protocol will have to rebuild. Anytime you choose to use a service based on a non-HTTP protocol, look over this list and think carefully about what you’re giving up.

1. Servers in any language.

2. Clients in any language.

These two are obvious. Moving right along.

3. Proxies

There are rock solid drop-in software solutions for proxying traffic from one machine to another. These proxies can do all types of request or response rewriting.

4. Load balancers

Need to scale past one machine? Need higher reliability? Drop a load balancer in front of multiple machines and you have a transparent barrier around the complexity of scaling up a service.

5. Debugging tools

There are no problems that have not yet been encountered. In fact, there are probably tools for diagnosing every malady you will ever encounter.

6. Web browsers

You already have a client, you’re using it right now. You can use it to poke at APIs.

7. People

Everyone knows HTTP. Quite a few people know more about it than you ever will. You can always reach out for help, or get contractors to solve problems.

8. Guaranteed web access

Corporate proxies and weird ISPs cause all kinds of havoc for things that aren’t HTTP. Being HTTP means you sidestep those problems.

9. Extensive hardware

If you’re high traffic or need extremely high uptime, you’re going to outgrow most software solutions. When you step up to the big time, hardware vendors will be there to support you.

10. Known scalability paths

Not only are there software solutions to allow easy migrations to more scalable architectures, but there are also easy patterns for designing a backend to scale up servicing HTTP’s stateless request-responses.

11. Prior knowledge

You already know HTTP. Your coworkers already know HTTP. You can start working on the harder problems at hand.

12. Extensibility

Between HTTP verbs and headers you have quite a bit of freedom to extend your original schemes. Need an extra piece of data? Add a header. Have pieces of information but want to be able to remove them? Use HTTP DELETE. Run into a really hairy problem that really wants a piece of it to be solved in a different protocol? Use HTTP protocol switching.

13. URLs

Using HTTP allows you to use a standard way of referencing resources. Parsers already exist for every language and their semantics are well understood.

14. Security

HTTPS gives you baked in easy to use security. It has its limitations, but if you’re really paranoid you can always use SSH and a SOCKS proxy. Once again, HTTP has your back. (Forgot to include this, thanksDaren Thomas for pointing it out!)

In the end the rules are simple. Is it possible to do over HTTP? Then do it over HTTP.

I’m not exactly defending an unpopular position, but there are still surprising transgressions of this rule. XMPP being the most obvious one. It’s quite a bit more complex than HTTP and is missing most of the above qualities. It’s usually cited as an example of a protocol that solves a problem http can’t: asynchronous bidirectional messaging; allowing the server and the client to send messages with minimal lag. The truth is HTTP can do this just fine, with long-polling and HTTP keep-alive you can keep a persistent bidirectional connection open.

There are an ever slimming number of commonly used protocols that aren’t http: instant messaging, e-mail, irc and ftp come to mind.

Move a service to HTTP, and it becomes a team player in our ecosystem. Let’s revolutionize the last of our dinosaur protocols and move on.

In the Lair of the Cycle-Eaters

2009 Feb 11

Programmers are losing serious amounts of productivity to hidden work every day.

It’s time to stop that, but wait, I’m getting ahead of myself.

In a different day and age, x86 assembler genius Michael Abrash coined the phrase Cycle Eater to describe how x86 assembly would have non-obvious slow downs. For instance an addition that should’ve executed in 2 cycles actually ran for an extra 6 cycles spent fetching the operands from memory. You’d assume you had optimal assembly when you’d be missing that your assembly was actually kinda slow. Had you known about your cycle eater, you could’ve re-ordered prior operations to optimize for memory fetching, regaining the optimal 2-cycle performance.

The phrase Cycle Eater perfectly describes much higher level problems that plagues software development. Cycle Eaters are everywhere. Cycle Eaters can be as simple as requiring an engineer to manually switch marketing promotions on and off. They can also be as complex as the time, knowledge and effort it takes to set up a new local sandbox or build environment.

The fundamental problem with Cycle Eaters is that you don't realize how often they're wasting your time.

Ever joined a new company only to spend a week getting your build environment up and get a build that actually runs? That cost has to be paid for every engineer, for every machine and for every reformat, for every reinstall. Not only does that cost add up, but the drive to avoid setting up a build environment causes cascading cycle eaters. Countless times I’ve seen engineers sitting there waiting while tests run or a build compiles while their laptop goes unused for development. They’re avoiding the pain and headache caused by the build setup Cycle Eater.

Luckily Cycle Eaters are surprisingly easy to deal with. When I started contributing features to TIGdb, the commit and deploy process was entirely manual. It was easy to screw up and annoying manual work. My first deploy was by hand. I then resolved to never do that again. I automated away some of the work. My second deploy was by running a couple commands, and then a shell script to do the final deploy. Again I automated away some of the work. My third deploy was SSHing into the server and running a shell script. My fourth deploy was running a local shell script. My fifth deploy automated database migrations.

My example incrementally removed the Cycle Eater, and that’s critically important. I’m not advocating that you go out and try to start a mammoth project to automate away everything that’s slowing you down. That would be a severe violation of the you-aren’t-going-to-need-it principle. Process automation is an interesting thing, because once you automate away a Cycle Eater, you may find your behavior dramatically changing. If it’s free to deploy, you’ll deploy more often. If it’s free to set up sandboxes, everyone in marketing gets one!

Here's where something magical happens.

When you fix a Cycle Eater, you don’t just get back the time you were losing to the Cycle Eater. There are often unpredictable emergent properties from this type of waste reduction. When you have free sandboxes, marketing starts using the same development tools that engineering uses. Marketing suddenly doesn’t need to pull an engineer out of flow to get promotional material deployed.

Client software build and release processes are often extremely manual, often involving “that one guy who builds the installer.” Once fully automated, releases can be cut daily with minimal cost. Daily releases result in dramatically better feedback, such as specifically which revisions caused regressions or improvements. That knowledge feeds back into the process, causing progressively higher quality client releases.

All this from simple incremental automation.

This isn’t just my theory. It’s an IMVU culture of removing Cycle Eaters. It’s allowed an extremely aggressive policy for new hires: on the day that they start working, they will commit a fix to the website. It’ll probably be a typo, but it will be a real fix pushed into production and live for every customer. On their first day. All thanks to slaying Cycle Eaters.

So start today. The next time you notice that your time is being eaten up by one of those little things you wouldn’t normally fix, think about it. Just think about the solution to the problem, and then implement a step in that direction. It doesn’t have to be a big step and you don’t have to know how to completely fix the Cycle Eater. Just make a single incremental improvement.

That first step will be the hardest. You’ll have to force yourself to overcome your natural tendency to ignore the Cycle Eater, but the results… I can’t just tell you what it’s like to push a button and have a full deploy just happen. It’s a rush. There is something fundamentally pleasing about automating away wasteful work; you must experience it for yourself.

Go, and slay your Cycle Eaters.

Continuous Deployment at IMVU: Doing the impossible fifty times a day.

2009 Feb 10

I recently wrote a post on Continuous Deployment: deploying code changes to production as rapidly as possible. The response on news.ycombinator was, well…

“Maybe this is just viable for a single developer … your site will be down. A lot.” - akronim

“It seems like the author either has no customers or very understanding customers … I somehow doubt the author really believes what he’s writing there.” - moe

…not exactly what I was expecting. Quite the contrast to the reactions of my coworkers who read the post and thought “yeah? what’s the big deal?” Surprising how quickly you can forget the problems of yesterday, even if you invested most of yourself into solving them.

Continuous Deployment isn’t just an abstract theory. At IMVU it’s a core part of our culture to ship. It’s also not a new technique here, we’ve been practicing continuous deployment for years; far longer than I’ve been a member of this startup.

It’s important to note that system I’m about to explain evolved organically in response to new demands on the system and in response to post-mortems of failures. Nobody gets here overnight, but every step along the way has made us better developers.

The high level of our process is dead simple: Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat.

Our tests suite takes nine minutes to run (distributed across 30-40 machines). Our code pushes take another six minutes. Since these two steps are pipelined that means at peak we’re pushing a new revision of the code to the website every nine minutes. That’s 6 deploys an hour. Even at that pace we’re often batching multiple commits into a single test/push cycle. On average we deploy new code fifty times a day.

So what magic happens in our test suite that allows us to skip having a manual Quality Assurance step in our deploy process? The magic is in the scope, scale and thoroughness. It’s a thousand test files and counting. 4.4 machine hours of automated tests to be exact. Over an hour of these tests are instances of Internet Explorer automatically clicking through use cases and asserting on behaviour, thanks to Selenium. The rest of the time is spent running unit tests that poke at classes and functions and running functional tests that make web requests and assert on results.

Buildbot running our tests sharded across 36 machines.Buildbot running our tests sharded across 36 machines.

Great test coverage is not enough. Continuous Deployment requires much more than that. Continuous Deployment means running all your tests, all the time. That means tests must be reliable. We’ve made a science out of debugging and fixing intermittently failing tests. When I say reliable, I don’t mean “they can fail once in a thousand test runs.” I mean “they must not fail more often than once in a million test runs.” We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day. Even with a literally one in a million chance of an intermittent failure per test case we would still expect to see an intermittent test failure every day. It may be hard to imaginewriting rock solid one-in-a-million-or-better tests that drive Internet Explorer to click ajax frontend buttons executing backend apache, php, memcache, mysql, java and solr. I am writing this blog post to tell you that not only is it possible, it’s just one part of my day job.

Back to the deploy process, nine minutes have elapsed and a commit has been greenlit for the website. The programmer runs the imvu_push script. The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.

The point is that Continuous Deployment is real. It works and it scales up to large clusters, large development teams and extremely agile environments.

And if you’re still wondering if we are a company that “has no customers”, I’d like to refer you to our million dollar a month revenue mohawks.

What webhooks are and why you should care

2009 Feb 09

Webhooks are user-defined HTTP callbacks. Here’s a common example: You go to github. There’s a textbox for their code post webhook. You drop in a URL. Now when you post your code to github, github will HTTP POST to your chosen URL with details about the code post. There is no simpler way to allow open ended integration with arbitrary web services.

This tiny interface is used in obvious ways: bug tracking integration, sms messaging, IRC and twitter.

The same tiny interface is also used in non-obvious ways, like Run Code Run which offers to build and run your project’s tests for you. All by just plugging a URL into GitHub.

Webhooks today offer a lot of value as an instant notification mechanism. Have events your users care about? Give them a webhook for those events and you’ve given them the power and flexibility to integrate that event stream into their life.

For all of that power, webhooks are impressively simple to implement. It’s a one liner in almost every language.


While there’s a lot of value in webhooks today, it’s the future that really interests me. Webhooks are composable. You’ll point a webhook at a site that will call other webhooks. It might process the data, record it, fork it off to multiple webhooks or something stranger still. Yahoo Pipes tried to do this, but ultimately you were limited to what Yahoo Pipes was designed to do. Webhooks can be integrated and implemented everywhere. They piggyback the fundamental decentralized nature of the web.

I imagine a future where twitter feed updates instantly call a webhook. I’ve pointed that webhook at a service that does bayesian filtering. The filtering has been set up to determine if the tweet looks time-sensitive “Anyone interested in getting dinner tonight?” vs time-insensitive “Webhooks are cool.” Time sensitive posts call another webhook, this time set to sms my phone. Note that nowhere in this future am I writing any code. I don’t have to.

It’s important that we get to this level of customization for the masses. It’s also important for adoption that we use the web’s native verbs. We understand HTTP on a fundamental level. It’s simple, scales and makes sense.

You should care because webhooks will be ubiquitous. You should care because they’re going to reshape the internet. You should care because webhooks are the next step in the evolution of communication on the internet and nothing will be left untouched.

Continuous Deployment

2009 Feb 08

Alex has just written a refactoring of some website backend code. Since it was a small task, it’s committed and Alex moves on to the next feature.

When the code is deployed in production two weeks later it causes the entire site to go down. A one-character typo which was missed by automated tests caused a failure cascade reminiscent of the bad-old-days at twitter. It takes eight hours of downtime to isolate the problem, produce a one character fix, deploy it and bring production back up.

Alex curses luck, blames human infallibility, inevitable cost of software engineering and moves on to the next task.

This story is the day-to-day of most startups I know. It sucks. Alex has a problem and she doesn’t even know it. Her development practices are unsustainable. “Stupid mistakes” like the one she made happen more frequently as the product grows more complex and as the team gets larger. Alex needs to switch to a scalable solution.

Before I get to the solution, let me tell you about some common non-solutions. While these are solutions to real problems, they aren’t the solution to Alex’s problem.

  1. More manual testing. This obviously doesn’t scale with complexity. This also literally can’t catch every problem, because your test sandboxes or test clusters will never be exactly like the production system.

  2. More up-front planning Up-front planning is like spices in a cooking recipe. I can’t tell you how much is too little and I can’t tell you how much is too much. But I will tell you not to have too little or too much, because those definitely ruin the food or product. The natural tendency of over planning is to concentrate on non-real issues. Now you’ll be making more stupid mistakes, but they’ll be for requirements that won’t ever matter.

  3. More automated testing. Automated testing is great. More automated testing is even better. No amount of automated testing ensures that a feature given to real humans will survive, because no automated tests are as brutal, random, malicious, ignorant or aggressive as the sum of all your users will be.

  4. Code reviews and pairing Great practices. They’ll increase code quality, prevent defects and educate your developers. While they can go a long way to mitigating defects, ultimately they’re limited by the fact that while two humans are better than one, they’re still both human. These techniques only catch the failures your organization as a whole already was capable of discovering.

  5. Ship more infrequently While this may decrease downtime (things break and you roll back), the cost on development time from work and rework will be large, and mistakes will continue to slip through. The natural tendency will be to ship even more infrequently, until you aren’t shipping at all. Then you’re forced to do a total rewrite. Which will also be doomed.

So what should Alex do? Continuously deploy. Every commit should be instantly deployed to production. Let’s walk through her story again, assuming she had such an ideal implementation of Continuous Deployment. Alex commits. Minutes later warnings go off that the cluster is no longer healthy. The failure is easily correlated to Alex’s change and her change is reverted. Alex spends minimal time debugging, finding the now obvious typo with ease. Her changes still caused a failure cascade, but the downtime was minimal.

This is a software release process implementation of the classic Fail Fast pattern. The closer a failure is to the point where it was introduced, the more data you have to correct for that failure. In code Fail Fast means raising an exception on invalid input, instead of waiting for it to break somewhere later. In a software release process Fail Fast means releasing undeployed code as fast as possible, instead of waiting for a weekly release to break.

Continuous Deployment is simple: just ship your code to customers as often as possible. Maybe today that’s weekly instead of monthly, but over time you’ll approach the ideal and you’ll see the incremental benefits along the way.

Blood Pact Blogging

2009 Feb 07

I always think to myself “That guy is wrong, and I could’ve written a much better post on that topic” …and yet it never happens. So I’m stating it here for everyone to see: I’m going to write a useful, hopefully witty and interesting blog posts every day for a whole month.

Consider it reverse cold turkey quitting.

It’d be stupid if I were doing it alone, but I’ve managed to convince 9 other people to make the same pact with me. I’ve even gone ahead and picked an overly dramatic name for the event, “Blood Pact Blogging”. I mean if ten people agree it must be smart, right?

I’m not sure what’s actually going to happen. Will we make it? How many will fall? Will it suck?

For posterity, here’s what I hope we each accomplish:

  • Get better at writing.
  • Write more in general.
  • Writing comes more easily.
  • Publishing comes more easily.
  • Generate some traffic, readers, and discussion.
  • Learn something.
  • Have fun.
  • Write under pressure.
  • Write about something we didn't think we'd write about.

Should I only experience a couple of these bullet points, this will have been worth it. Andalong the way, a whole bunch of my friends will have written a whole bunch of neat things!

Wish us luck, and if you’re interested in following our progress go to