Healthy Webapps through Continuous Introspection
by Erik van Zijst (speaking)
Objective
This talk explores the challenges of identifying and solving a whole series of typical performance problems affecting Python based webapps.
I'll be looking specifically at those issues that only manifest themselves under the load, high levels of concurrency, runtime dynamics and configuration of your actual production environment, but elude QA and testing.
I'll be using plenty of real examples from Bitbucket, which itself is written in Python.
Description
Every application has its hotspots -- small portions of code that consume considerably more resources than all of the other code combined.
Python webapps are no different. Some pages, invoked with the just the right, or wrong input, can bring a server to its knees, hogging the CPU and taking many seconds, or in extreme cases even minutes to render. By keeping workers tied up, the whole system can then become slow to respond, or collapse altogether.
Many webservers have a crude built-in failsafe to prevent this. They automatically kill workers that fail to complete their requests in time. As a result, you may not fully appreciate, or indeed realize at all that you are routinely serving 500 pages, denying users access to your service, or leaving uncommitted database transactions -- possibly even slowly corrupting data. Workers killed by force leave virtually no forensic traces and so even when issues are suspected, it's hard to pin them down.
The cause behind these hotspots can be poorly generated SQL queries from an ORM, an algorithm with non-linear complexity, excessive disk or network IO, or lock contention in the database -- to name just a few.
Oftentimes these problems escape a developer's attention, as dev and test environments simply don't have the dataset, level of concurrency or sheer size of the real thing.
In this talk we'll address the challenges of tuning your webapp with continuous automatic runtime inspection tools, including homegrown Dogslow. We'll uncover the pages that consume disproportionate amounts of time and cycles to complete and the pages that get killed altogether.
We'll discuss several ways to help you identify and eliminate the hotspots, both passively through monitoring exclusively, as well as actively by selectively interrupting workers before they get killed and examine how to effectively interpret the automatically collected forensic evidence.
Speaker bio
Erik has been a passionate software professional for nearly 15 years, subscribing to the idea that software is a craft, not an exact science. Launching his career in his native Amsterdam in 1999, he served as chief architect for a financial market data startup until it was acquired in 2006, designing and coding both a real-time European-wide IP distribution platform for stock quotes, as well as a stock and options trading platform.
In 2007 he co-founded a Palo-Alto, CA based online real-time video distribution startup and acquired his first patent. In a deliberate move back to more hands-on coding, Erik joined Atlassian in 2008 and relocated to its headquarters in Sydney, Australia. Initially working on FishEye and Crucible, he later worked on various cross-product integration projects and in 2010 joined the newly formed team to run Bitbucket after its acquisition.
Currently based out of San Francisco, CA Erik continues to work on Bitbucket as one of its senior devs, with a special interest in large-scale scalability and performance challenges. His efforts in these areas have resulted in several successful open source projects, relied on by Bitbucket as well as many other sites out there.