Rackspace Open Sources Dreadnot, a Continuous Deployment tool

Today we open sourced Dreadnot, our take on a Continuous Deployment tool. Details are posted over on the Rackspace Cloud Blog

Source is up on github.com/racker/dreadnot.

Posted in Uncategorized | Leave a comment

2011 Timecards

Work Project:


Hobby Project:


2012 Goal

Finish the hobby project.

Created using Dustin’s git-timecard.

Posted in Uncategorized | Leave a comment

Write Logs for Machines, use JSON

Logging for Humans

A printf style format string is the de facto method of logging for almost all software written in the last 20 years. This style of logging crosses almost all programing language boundaries. Many libraries build upon this, adding log levels and various transports, but they are still centered around a formated string.

I believe the widespread use of format strings in logging is based on two presumptions:

  1. The first level consumer of a log message is a human.
  2. The programer knows what information is needed to debug an issue.

I believe these presumptions are no longer correct in server side software.

An example of the problem

An example is this classic error message inside the Apache HTTP Server. The following code is called any time a client hits a URL that doesn’t exist on the file system:

ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,
  "File does not exist: %s", r->filename);

This would generate a log message like the following in your error.log:

[Mon Dec 26 09:14:46 2011] [info] [client 50.57.61.4]
  File does not exist: /var/www/no-such-file

This is fine for human consumption, and for decades people have been writing Perl scripts to munge it into fields for a computer to understand too. However, the first time you add a field, for example the HTTP User-Agent header, it would break most of those perl scripts. This is one example of where building a log format that is optimized for computer consumption starts to make sense.

Another problem is when you are writing these format string log messages, you don’t always know what information people will need to debug the issue. Since you are targeting them for human consumption you try to reduce the information overload, and you make a few guesses, like the path to the file, or the source IP address, but this process is error prone. From my experience in the Apache HTTP server this would mean opening GDB to trace what is happening. Once you figure out what information is relevant, you modify the log message to improve the output for future users with the relevant information.

What if we logged everything into JSON?

If we produced a JSON object which contained the same message, it might look something like this:

{
    "timestamp": 1324830675.076,
    "status": "404",
    "short_message": "File does not exist: /var/www/no-such-file",
    "host": "ord1.product.api0",
    "facility": "httpd",
    "errno": "ENOENT",
    "remote_host": "50.57.61.4",
    "remote_port": "40100",
    "path": "/var/www/no-such-file",
    "uri": "/no-such-file",
    "level": 4,
    "headers": {
        "user-agent": "BadAgent/1.0",
        "connection": "close",
        "accept": "*/*"
    },
    "method": "GET",
    "unique_id": ".rh-g2Tm.h-ord1.product.api0.r-axAIO3bO.c-9210.ts-1324830675.v-24e946e"
}

This example gives a much richer picture of information about the error. We now have data like the User-Agent in an easily consumable form, we could much more easily figure out that BadAgent/1.0 is the cause of our 404s. Other information like the source server and a mod_unique_id hash can be used to correlate multiple log entries across the lifetime of an request.

This information is also expandable. As the knowledge of what our product needs to log increases, it is easy to add more data, and we can safely do this without breaking our System Admins precious Perl scripts.

Why now?

This idea is not new, it has just never been so easily accessible. Windows has had “Event Logs” for a decade, but in the more recent versions it uses XML. The emergence of JSON as a relatively compact serialization format that can be generated and parsed from almost any programming languages means it makes a great light weight interchange format.

Paralleling the big data explosion, is a growth in machine and infrastructure size. This means logging and the ability to spot errors in a distributed system has become even more valuable.

Logging objects instead of a format string enables you to more easily index and trace operations across hundreds of different machines and different software systems. With traditional format strings it is too fail deadly for the programmer to not log all the necessary information for a later operator to trace an operation.

Generating JSON with Log Magic

Log Magic is a small and fast logging library for Node.js that I wrote early on for our needs at Rackspace. It only has a few features, and it is only about 300 lines of code.

Log Magic has the concept of a local logger instance, which is used by a single module for logging. A logger instance automatically populates information like the the facility in a log entry. Here is an example of creating a logger instance for a module named 'myapp.api.handler and using it:

var log = require('logmagic').local('myapp.api.handler');

exports.badApiHandler = function(req, res) {
  log.dbg("Something is wrong", {request: req});
  res.end();
};

The second feature that Log Magic provides is what I call a “Log Rewriter”. This enables the programmer to just consistently pass in the request object, and we will take care of picking out the fields we really want to log. In this example, we ensure the logged object always has an accountId and txnId fields set:

var logmagic = require('logmagic');
logmagic.addRewriter(function(modulename, level, msg, extra) {
  if (extra.request) {
    if (extra.request.account) {
      extra.accountId = extra.request.account.getKey();
    }
    else {
      /* unauthenticated user */
      extra.accountId = null;
    }
    extra.txnId = extra.request.txnId;
    delete extra.request;
  }
  return extra;
});

The final feature of Log Magic is dynamic routes and sinks. For the purposes of this article, we are mostly interested in the graylog2-stderr, which outputs a GELF JSON format message to stderr:

var logmagic = require('logmagic');
logmagic.route('__root__', logmagic['DEBUG'], "graylog2-stderr");

With this configuration, if we ran that log.dbg example from above, we would get a message like the following:

{
    "version": "1.0",
    "host": "product-api0",
    "timestamp": 1324936418.221,
    "short_message": "Something is wrong",
    "full_message": null,
    "level": 7,
    "facility": "myapp.api.handler",
    "_accountId": "ac42",
    "_txnId": ".rh-3dT5.h-product-api0.r-pVDF7IRM.c-0.ts-1324936588828.v-062c3d0"
}

Other implementations

There are many other libraries that are starting to emerge that can output logs in a JSON or GELF format:

  • winston: (Node.js) A more complete (or complex?) logging module compared to Log Magic, but the prolific crew at Nodejitsu have done a great job.
  • graypy: (Python) A graylog2 logger that interacts with the standard Python logging module.
  • gelf4j (Java) We use a modified version of this library that logs to stderr instead of using UDP.

The Transaction Id

One field we added very early on to our system was what we called the “Transaction Id” or txnId for short. In retrospect, we could of picked a better name, but this is essentially a unique identifier that follows a request across all our of services. When a User hits our API we generate a new txnId and attach it to our request object. Any requests to a backend service also include the txnId. This means you can clearly see how a web request is tied to multiple backend service requests, or what frontend request caused a specific Cassandra query.

We also send the txnId to our user’s in our 500 error messages and the X-Response-Id header, so if a user reports an issue, we can quickly see all of the related log entries.

While we treat the txnId as an opaque string, we do encode a few pieces of information into it. By putting the current time and the origin machine into the txnId, even if we can’t figure out what went wrong from searching for the txnId, we have a place to start deeper debugging.

Transporting Logs

Since our product spans multiple data centers, and we don’t trust our LAN networking, our primary goal is that all log entries hit disk on their origin machine first. Some people have been using UDP or HTTP for their first level logging, and I believe this is a mistake. I believe having a disk default that consistently works is critical in a logging system. Once our messages have been logged locally, we stream them to an aggregator which then back hauls the log entries to various collection and aggregation tools.

Since all of our services run under runit, our programs simply log their JSON to stderr, and svlogd takes care of getting the data into a local file. Then we use a custom tool written in Node.js that is like running a tail -F on the log file, sending this data to a local Scribe instance. The Scribe instance is then responsible for transporting the logs to our log analyzing services.

For locally examining the log files generated by svlogd, we also made a tool called gelf-chainsaw. Since JSON strings cannot contain a newline, the log format is easy to parse, you just split up the file by \n, and try to JSON.parse each line. This is useful for our systems engineers when they are on a single machine, trying to debug an issue.

Collecting, Indexing, Searching

Once the logs crossing machines, there are many options to process those logs. Some examples that can all accept JSON as their input format:

For Rackspace Cloud Monitoring we are currently using Graylog2 with a patch to support Scribe as a transport written by @wirehead.

Bellow is an example of searching for specific txnId in our system in Graylog2:

While this example is simple, we have some situations where a single txnId spans multiple services, and the ability to trace all of them transparently is critical in a distributed system.

Conclusion

Write your logs for machines to process. Build tooling around those logs to transform them into something that is consumable by a human. Humans cannot process information in the massive flows that are created by concurrent and distributed systems. This means you should store the data from these systems in a format that enables innovative and creative ways for it to be processed. Right now, the best way to do that is to log in JSON. Stop logging with format strings.

Posted in Uncategorized | 18 Comments

The Switch: Python to Node.js

In my previous post, I glossed over our team switching from Python to Node.js. I kept it brief because the switch wasn’t the focus of the post, but since I believe I am being misunderstood, I will explain it in depth:

Cloudkick was primarily written in Python. Most backend services were written in Twisted Python. The API endpoints and web server were written in Django, and used mod_wsgi. We felt that while we greatly value the asynchronous abilities of Twisted Python, and they matched many of our needs well, we were unhappy with our ability to maintain Twisted Python based services. Specifically, the deferred programming model is difficult for developers to quickly grasp and debug. It tended to be ‘fail’ deadly, in that if a developer didn’t fully understand Twisted Python, they would make many innocent mistakes. Django was mostly successful for our needs as an API endpoint, however we were unhappy with our use of the Django ORM. It created many dependencies between components that were difficult to unwind later. Cloud Monitoring is primarily written in Node.js. Our team still loves Python, and much of our secondary tooling in Cloud Monitoring uses Python.

This attracted a few tweets, accusing various things about our developers, but I want to explore the topic in depth, and 140 characters just isn’t going to cut it.

Just how much Python did Cloudkick have?

We had about 140,000 lines of Python in Cloudkick. We had 40 Twisted Plugins. Each Plugin roughly corresponds to a backend service. About 10 of them are random DevOps tools like IRC bots and the like, leaving about 30 backend services that dealt with things in production. We built most of this code in a 2.5 year experience, growing the team from the 3 founders to about a dozen different developers. I know there are larger Twisted Python code bases out there, but I do believe we had a large corpus of experiences to build our beliefs upon.

This wasn’t just a weekend hack project and a blog post about how I don’t like deferreds, this was 2.5 years of building real systems.

It worked.

We were acquired.

Our Python code got the job done. We built a product amazingly quickly, built our users up, and were able to iterate quickly. I meant it when I said our team still still loves Python.

What I didn’t mention in the original post, is that after the acquisition, the Cloudkick team was split into two major projects — Cloud Monitoring, which the previous post was about, and another unannounced product team. This other product is being built in Django and Twisted Python. Cloud Monitoring has very different requirements moving forward — our goals are to survive and keep working after a truck drives into our data centers, and this is very different from how the original Cloudkick product was built.

What happened to Python then?

Simply put, our requirements changed. These new requirements for Cloud Monitoring included:

  • Multi-Region availability / durability
  • Multiple order of magnitude increases in servers monitored
  • Scalable system, that can still be used 5 year from now. (Remember Rackspace Cloud grew 89% year over year right now)

Cloudkick was built as a startup. We took shortcuts. It scaled pretty damn well, but even if we changed nothing in our technology stack, it was clear we needed to refresh our architecture and how we modeled data.

The mixing of both blocking-world Django, and Twisted Python also created complications. We would have utility code that could be called from both environments. This meant extensive use of deferToThread in order to not block Twisted’s reactor thread. This created an overhead for every programmer to understand both how Twisted worked, and how Django worked, even if your project in theory only involved the web application layer. Later on, we did build enough tooling with function decorators to reduce the impact of these multiple environments, but the damage was done.

I believe our single biggest mistake from a technical side was not reigning in our use Django ORM earlier in our applications life. We had Twisted services running huge Django ORM operations inside of the Twisted thread pool. It was very easy to get going, but as our services grew, not only was this not very performant, and it was extremely hard to debug. We had a series of memory leaks, places where we would reference a QuerySet, and hold on to it forever. The Django ORM also tended to have us accumulate large amounts of business logic on the model objects, which made building strong service contracts even harder.

These were our problems. We dug our own grave. We should’ve used SQLAlchemy. We should’ve built stronger service separations. But we didn’t. Blame us, blame Twisted, blame Django, blame whatever you like, but thats where we were.

We knew by April 2011 that the combination of new requirements and a legacy code base meant we needed to make some changes, but we also didn’t want to fall into a “Version 2.0″ syndrome and over engineering every component.

Picking the Platform.

We wanted some science behind this kind of decision, but unfortunately this decision is about programming languages, and everyone had their own opinions.

We wanted to avoid “just playing with new things”, because at the time half our team was enamored with Go Lang. We were also very interested in Python Gevent, since OpenStack Nova had recently switched to it from Twisted Python.

We decided to make a spreadsheet of the possible environments we would consider using for our next generation product. The inputs were:

  • Community
  • Velocity
  • Correctness (aka, static typing-like things)
  • Debuggability/Tooling
  • Downtime/Compile Time
  • Libraries (Standard/External)
  • Testability
  • Team Experience
  • Performance
  • Production

We setup the spreadsheet so we could change the weight of each category. This let us play with our feelings, what if we only cared about developer velocity? What if we only cared about testability?

Our conclusion was, that it came down to was a choice between the JVM platform and Node.js. It is obvious that the JVM platform is one of the best ways to build large distributed systems right now. Look at everything Twitter, LinkedIn and others are doing. I personally have serious reservations about investing on top of the JVM, and Oracles recent behavior (here, here) isn’t encouraging.

After much humming and hawing, we picked Node.js.

After picking Node.js, other choices like using Apache Cassandra for all data storage were side effects — there was nothing like SQL Alchemy for Node.js at the time, so we were on our own either way, and Cassandra gave us definite improvements in operational overhead of compared to running a large number of MySQL servers in a master/slave configuration.

Node.js? It has nested callbacks everywhere, thats ugly!

I think this is one of the first complaints people lob at Node.js when they just start. It makes a regular occurrence on the users mailing list — people think they want coroutines, generators or fibers.

I believe they are wrong.

The zen of Node.js is its minimalist core, both in size and in features. You can read the core lib Javascript in a day, and one more day for the C++. Don’t venture into v8 itself, that is a rabbit hole, but you can pretty quickly understand how Node.js itself works.

Our experience was that we just needed to pick one good tool to contain callback flows, and use it everywhere.

We use @Caolan’s excellent Async library. Our code is not 5 level deep nested callbacks.

We currently have about 45,000 lines of Javascript in our main repository. In this code base, we have used the async library as our only flow control library. Our current use of the library in our code base:

  • async.waterfall: 74
  • async.forEach: 55
  • async.forEachSeries: 21
  • async.series: 8
  • async.parallel: 4
  • async.queue: 3

I highly suggest, that if you are unsure about Node.js and are going to do an experiment project, make sure you use Async, Step, or one of the other flow control modules for your experiment. It will help you better understand how most larger Node.js applications are built.

Closing

In the end, we had new requirements. We re-evaluated what platforms made sense for us to build a next generation product on. Node.js came out on top. We all have our biases, and our preferences, but I do believe we made a reasonable choice. Our goal in the end is still to move our product forward, and improve our business. Everything else is just a distraction, so pick your platform, and get real work done.

PS: If you haven’t already read it, read SubStack’s great the node.js aesthetic post.

Posted in Uncategorized | 31 Comments

Technology behind Rackspace Cloud Monitoring

Earlier this week we announced a new product: Rackspace Cloud Monitoring. It is just starting as a (free) private beta, so if you want to try it out, be sure to sign up via the survey here.

Transition from Cloudkick Technology

Rackspace Cloud Monitoring is based on technology built originally for the Cloudkick product. Some core concepts and parts of the architecture originated from Cloudkick, but many changes were made to enable Rackspace’s scalability needs, improve operational support, and focus the Cloud Monitoring product as an API driven Monitoring as a Service, rather than all of Cloudkick’s Management and Cloud Server specific features.

For this purpose, Cloudkick’s product was successful in vetting many parts of the basic architecture, and serving as a basis on which to make a reasonable second generation system. We tried to make specific changes in technology and architecture that would get us to our goals, but without falling into an overengineering trap.

Cloudkick was primarily written in Python. Most backend services were written in Twisted Python. The API endpoints and web server were written in Django, and used mod_wsgi. We felt that while we greatly value the asynchronous abilities of Twisted Python, and they matched many of our needs well, we were unhappy with our ability to maintain Twisted Python based services. Specifically, the deferred programming model is difficult for developers to quickly grasp and debug. It tended to be ‘fail’ deadly, in that if a developer didn’t fully understand Twisted Python, they would make many innocent mistakes. Django was mostly successful for our needs as an API endpoint, however we were unhappy with our use of the Django ORM. It created many dependencies between components that were difficult to unwind later. Cloud Monitoring is primarily written in Node.js. Our team still loves Python, and much of our secondary tooling in Cloud Monitoring uses Python. [EDIT: See standalone post: The Switch: Python to Node.js]

Cloudkick was reliant upon a MySQL master and slaves for most of its configuration storage. This severely limited both scalability, performance and multi-region durability. These issues aren’t necessarily a property of MySQL, but Cloudkick’s use of the Django ORM made it very difficult to use MySQL radically differently. The use of MySQL was not continued in Cloud Monitoring, where metadata is stored in Apache Cassandra.

Cloudkick used Apache Cassandra primarily for metrics storage. This was a key element in keeping up with metrics processing, and providing a high quality user experience, with fast loading graphs. Cassandra’s role was expanded in Cloud Monitoring to include both configuration data and metrics storage.

Cloudkick used the ESPER engine and a small set of EPL queries for its Complex Event Processing. These were used to trigger alerts on a monitoring state change. ESPER’s use and scope was expanded in Cloud Monitoring.

Cloudkick used the Reconnoiter noitd program for its poller. We have contributed patches to the open source project as needed. Cloudkick borrowed some other parts of Reconnoiter early on, but over time replaced most of the Event Processing and data storage systems with customized solutions. Reconnoiter’s noitd poller is used by Cloud Monitoring.

Cloudkick used RabbitMQ extensively for inter-service communication and for parts of our Event Processing system. We have had mixed experiences with RabbitMQ. RabbitMQ has improved greatly in the last few years, but when it breaks we are at a severe debugging disadvantage, since it is written in Erlang. RabbitMQ itself also does not provide many primitives we felt we needed when going to a fully multi-region system, and we felt we would need to invest significantly in building systems and new services on top of RabbitMQ to fill this gap. RabbitMQ is not used by Cloud Monitoring. Its use cases are being filled by a combination of Apache Zookeeper, point to point REST or Thrift APIs, state storage in Cassandra and changes in architecture.

Cloudkick used an internal fork of Facebook’s Scribe for transporting certain types of high volume messages and data. Scribe’s simple configuration model and API made it easy to extend for our bulk messaging needs. Cloudkick extended Scribe to include a write ahead journal and other features to improve durability. Cloud Monitoring continues to use Scribe for some of our event processing flows.

Cloudkick used Apache Thrift for some RPC and cross-process serialization. Later in Cloudkick, we started using more JSON. Cloud Monitoring continues to use Thrift when we need strong contracts between services, or are crossing a programing language boundary. We use JSON however for many data types that are only used within Node.js based systems.

Node.js ecosystem

We have been very happy with our choice of using Node.js. When we started this project, I considered it one of our biggest risks to being successful — what if 6 months in we are just mired in a new language and platform, and regretting sticking with the known evil of Twisted Python. The exact opposite happened. Node.js has been an awesome platform to build our product on. This is in no small part to the many modules the community has produced.

Here it is, the following is the list of NPM modules we have used in Cloud Monitoring, straight from our package.json:

Now that our product is announced, I’m hoping to find a little more time for writing. I will try to do more posts about how we are using Node.js, and the internals of Rackspace Cloud Monitoring’s architecture.

PS: as always, we are hiring at our sweet new office in San Francisco, if you are interested, drop me a line.

Posted in Uncategorized | 6 Comments

Node.js interview @ VentureBeat

Node JS from Venturebeat on Vimeo.

I was interviewed a few weeks ago by @JolieOdell from VentureBeat about Node.js. Video is here.

Posted in Uncategorized | Leave a comment

Technology enables better Support Experiences

My MacBook Air is my favorite laptop ever. The combination of size and enough power for most development tasks, makes it perfect for my job right now. For the last week though, whenever I’d peg the CPU, by doing things like watching HD Flash Video, it would start dropping video frames and becoming unresponsive. I also noticed a ‘clicking’ sound coming from it. I think thats a bad sign, probably the only moving part in the Air, a single lonely Fan, was dead, and the machine is overheating.

Last night I was trying to watch Netflix, and it would play for about 15 seconds, then overheat, and become unwatchable. My procrastination wasn’t going to fix the fan. I went to Apple’s San Francisco Store site and to make a reservation or the Genius Bar. Wooo! A single spot left at 11:30 this morning, perfect, and my reservation was made in under a minute.

I took my Air into the store, and took a seat to await my turn. Of course, a store employee found my appointment on a custom app on his iPhone, and told me it will be about 10 minutes before they will be able to see me. No problem, I grab a seat and keep reading Children of the Sky on my Kindle.app.

Anthony finishes with the previous customer and it’s my turn. I explain how my Air is behaving (avoiding saying the conclusion I’d already come to). He says yep, probably a dead fan or the sensor, and my Air is overheating. He then procedes to grab a USB ethernet adaptor and a blue ethernet cable that goes behind the counter. He reboots the Air, holding down the option key. He is NetBooting my Air, into a custom Apple diagnostics program. It all happens in under two minutes.

The conclusion is in, they will need to replace the fan, but luckily it is under warranty. It’s gonna be five days, which is too much as I fly to Texas on Monday. Anthony toggles a few buttons on his iPad, and prints me out a case report. Now I am back at the office finishing a Time Machine backup, so I can restore to a spare MacBook Pro. Once I’m up and running on the Pro, I’ll drop off my Air at the Store tomorrow so it can get repaired.

Now, this whole little story isn’t that exciting. It probably happens hundreds of times per day across Apple’s stores and business.

I was amazed when he started NetBooting my Air — they had really thought about this problem. Do you really want to download a disk image onto a customer’s computer and then try to debug things from there? No, it has a ton of problems, customers don’t want random files downloaded on to their disk, and there might be problems in the OS install itself — the solution of NetBooting is an elegant application of good technology to enable the Apple employees at the Genius Bar to get their job done quickly and easily.

Looking at the whole experience, it was driven by technology deeply integrated and enhancing a support experience. The technology got me into the store easily, the employee doing in-store triage already knew about my appointment, easy diagnostics that weren’t invasive to my computer, and integrated case management. Someone at Apple thought deeply about each step in my experience, and put technology where it made sense, and people where it didn’t.

What I found more interesting however, is how this experience aligns with Rackspace’s own beliefs. Cloudkick was acquired almost a year ago, and we tried to provide the best support to our customers possible. This largely meant the engineers who wrote the code responded to the support issues. In the last year, it has been interesting to see how Rackspace has to scale this process to thousands of employees — it isn’t a trivial problem. Delivering a great support experience requires both great technology, and great people.

In the end, this is all about building technology and software to produce a better outcome for the customer and the business. Building technology for technology’s sake can sure be fun, but it is a million times more fulfilling when that technology enables a better experience for a person.

It’s also what I find interesting about my job at Rackspace. We aren’t trying to just build software so that a single Racker can handle 100 customers instead of 50. We are building software so that our support Rackers can create a genuinely better experience for our customers.

Posted in Uncategorized | 7 Comments

OpenSSL memory use in Node.js

Last Thursday I went to the Joyent office for Node Office Hours — I wanted to talk to Isaac about running a private NPM registry.

Isaac answered my questions about private NPM registries, but Matt Ranney explained a more interesting problem. He was dialed into a conference call line for Node Office hours (he is currently living in Hawaii.) Matt explained that he is using the new TLS module in Node v0.4, and it was using 1 megabyte of memory per connection! Using 1mb per connection made us think there must be something wrong in Node.js itself, so we wrote up a simple test case client and server.

The TLS server says “welcome” to a newly connected client, and then echos anything it receives back to the client:

var tls = require('tls');
var fs = require('fs');

var options = {
  key: fs.readFileSync('test_key.pem'),
  cert: fs.readFileSync('test_cert.pem'),
};

tls.createServer(options, function (s) {
  s.write("welcome!\n");
  console.log('got client');
  s.pipe(s);
}).listen(8000);
console.log('127.0.0.1:8000');

The client spawns 200 connections, and writes ‘hello’ to the server every 5 seconds for each connection:

var tls = require('tls');

function go() {
  var s = null;
  s = tls.connect(8000, '127.0.0.1', function() {
    setInterval(function() {
      s.write('hello\n');
    }, 5000);
  });
};

var i;
for (i = 0; i<200; i++) {
  go();
}

Running the client and server on my laptop, the server used almost 200 megabytes of memory. This meant the high memory usage per-connection is not just a problem with Matt's application, but something deeper in Node.js.

At this point the Linux users are trembling, unsure how to debug the issue. Luckily, I still use Steve's operating system, and I fired up Instruments.app to start taking a look at the problem.

Those Pesky Certificate Authority Certificates

Using the builtin Allocations instrument, I was looking for how memory was being used. I expected to just see a large blob of allocation being done inside v8, since Instruments and DTrace that power it do not have visibility inside the VMs internals. Unexpectedly, it quickly became apparent our main use of memory was the node::crypto::SecureContext::AddRootCerts function. After going back to the Javascript, we could see that for every new TLS connection being made, Node was re-parsing the list of root-certificate authorities from their string forms, into the X509_STORE object used by OpenSSL:

Just by commenting out one line of Javascript, we were able to reduce memory usage by 20%, and increased the performance of the HTTPS server from 70 requests/second to 700 requests/second.

Ryan changed the Node crypto code to use a single global CA store for the default root certificates in 5c35dff41. The current fix is a hack, the correct long term fix is to better use SSL_new with a shared SSL_CTX*, but that will require a larger refactoring of node_crypto.cc.

Zlib: Big Buffers "R" us

We thought after fixing the obvious bug in parsing the certificates, we might be done. We went back to the profiling in Instruments.app, and discovered memory usage still was over 500kb per connection:

We dove into the OpenSSL codebase, and found that Zlib Compression is enabled by default, and there isn't an easy way to turn it off. The documentation for SSL_COMP_add_compression_method says:

Unlike the mechanisms to set a cipher list, there is no method available to restrict the list of compression method on a per connection basis.

Recent versions of OpenSSL (>1.0.0) have added SSL_OP_NO_COMPRESSION that can be disable compression per-connection via SSL_CTX_set_options. However, most Linux distributions are still using variations of OpenSSL 0.9.8, meaning this option isn't viable for most users.

As an alternative, it is possible to zero out the global list of all OpenSSL compression methods after initializing OpenSSL:

void disable_openssl_compression() {
  STACK_OF(SSL_COMP)* comp_methods = SSL_COMP_get_compression_methods();
  sk_SSL_COMP_zero(comp_methods);
}

This is a terrible hack, but it enabled us to test the effect of disabling compression on the server side, and after seeing the results, I think it is worth it.

With this change, we ran the test again, and see an amazing thing, we are only using about 15 megabytes of memory, and we can actually see the v8 garbage collector doing work now:

Ship it to production

Going back to Matt's original problem, he is using Node.js as part of the server side infrastructure for Voxer, a communication application for mobile devices. He applied the patches, and provided this graph from Cacti showing the massive improvement in free memory:

What about Twisted Python?

At Rackspace/Cloudkick we use Twisted Python for many production services, including several with many thousands of open TLS connections. We had always just assumed the high memory usage was an issue in Twisted Python itself, not something in OpenSSL. We had tried looking at the memory usage using Dowser, but we never got far, as it always looked like Python wasn't misbehaving. This now makes sense, because the OpenSSL objects used in pyOpenSSL wouldn't be counted by Dowser.

pyOpenSSL is a limited binding of the most useful functions in OpenSSL, so it doesn't provide bindings to the functions that can manipulate the OpenSSL Compression settings. I decided to use the ctypes module to avoid needing to distribute a C based extension to disable OpenSSL compression:

def disableSSLCompression(self):
  try:
    import ctypes
    import glob
    openssl = ctypes.CDLL(None, ctypes.RTLD_GLOBAL)
    try:
      f = openssl.SSL_COMP_get_compression_methods
    except AttributeError:
      ssllib = sorted(glob.glob("/usr/lib/libssl.so.*"))[0]
      openssl = ctypes.CDLL(ssllib, ctypes.RTLD_GLOBAL)

    openssl.SSL_COMP_get_compression_methods.restype = ctypes.c_void_p
    openssl.sk_zero.argtypes = [ctypes.c_void_p]
    openssl.sk_zero(openssl.SSL_COMP_get_compression_methods())
  except Exception, e:
    log.msg('disableSSLCompression: Failed:')
    log.msg(e)

We deployed this to one Twisted Python Service as a test, and you can see the impact on memory use, dropping form 1.15 gigabytes to around 300 megabytes:

Conclusion

While compression in TLS can be helpful in some situations with bulk data transfers, the method by which OpenSSL has done it is disappointing. Enabling a feature by default that has such a large impact on memory usage, and not having a reasonable way to disable it is bad library design.

If you have specific applications using TLS that don't benefit from compression or have many thousands of mostly idle connections, consider disabling OpenSSL's compression methods to save yourself about 500kb of memory per connection.

PS: agl on HN mentioned that OpenSSL &gt 1.0.0 also has SSL_MODE_RELEASE_BUFFERS which greatly reduces memory usage of a connection -- this is already done in Node.js if available, but if you are updating older code be sure to add this too.

Posted in Uncategorized | 13 Comments

Rackspace acquires Cloudkick

Cloudkick has been acquired by Rackspace.

I am extremely excited.

The official post over on Cloudkick has all the important information.

It’s been a wild and crazy 17 months since I joined Cloudkick. I have learned a million things from the experience so far, and the adoption of Clouckidk by big customers really has been astounding.

Today I start a new chapter as a Racker.

Coverage elsewhere:

Posted in Uncategorized | 5 Comments

UDP in Node.js

At Node.js Camp yesterday on UDP in Node.js, and the video got recorded on ustream.

Slides for the talk are posted here.

All of the example code is posted up on github here.

Happy to answer any questions about node and udp here too.

PS, thanks to everyone at Joyent for making the Node.js Camp happen, it was a great little conference!

Posted in Uncategorized | 2 Comments

Enabling JNA in Apache Cassandra 0.6

What is it?

Java Native Access is a library to provide access to native function calls on top of the JVM — from its website:

JNA provides Java programs easy access to native shared libraries (DLLs on Windows) without writing anything but Java code—no JNI or native code is required. This functionality is comparable to Windows’ Platform/Invoke and Python’s ctypes. Access is dynamic at runtime without code generation

So, it lets Java easily call functions in a C library, without using JNI or other generated code, just like Python’s CTypes.

What does JNA do for Apache Cassandra?

Since Apache Cassandra is written in Java, it doesn’t have access to many operating system level optimizations, that are helpful when you are building essentially an on disk data storage system. JNA lets Cassandra access functions that otherwise wouldn’t be available in a pure JVM API.

Specific features enabled by JNA in Cassandra:

  • Since 0.6.2: JNA for mlockall. This prevents Linux from swapping out parts of the JVM that aren’t accessed frequently. Chris Goffinet reported a 13% performance improvement in his tests from this change. [CASSANDRA-1214]
  • Since 0.6.6: JNA for hard links, improving snapshots. Previously Cassandra would use the /bin/ln binary to create hard links to SSTables when creating a snapshot. On nodes with thousands of SSTables, this would take a very long time, because it had to fork+exec the JVM to run the new process. With JNA, Cassandra uses the link call directly. [CASSANDRA-1371]

Installing JNA

Debian & Ubuntu

Debian and Ubuntu have JNA in their apt repositories, so in theory it is easy to just install:

sudo apt-get install libjna-java

However, currently in Ubuntu, it only has version 3.1, while Cassandra wants 3.2.7. There is a github here with an updated deb packaging for the most release of JNA, from which you can build your own JNA deb package. If building JNA from source is too much of a pain (and it is, trust me), you can easily grab a binary for it from the Riptano Debian repository too.

The deb package will install the JNA jar file to /usr/share/java/jna.jar, but Cassandra only loads it if its in the class path. The easy way to do this is just create a symlink into your Cassandra lib directory:

ln -s /usr/share/java/jna.jar /ck/cassandra/lib/

Other Platforms

The JNA site has pre-built binaries that you can download. On OSX, I just downloaded the .jar file, and then copied it into my cassandra/lib directory for testing.

Non-Root Operation

If you are not running Cassandra as root, you will need to adjust a ulimit. The mlockall system call locks pages of memory into RAM which could adversely affect other processes on a machine. For this reason by default there is a 64kb limit on locked memory for most Linux machines.

If you run Cassandra with JNA, and see a log message like this:

Unable to lock JVM memory (ENOMEM).
This can result in part of the JVM being swapped out, especially with
mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra
as root.

Then you need to edit the limit.

If Cassandra is the only service on a machine, I suggest just setting the value to unlimited:

ulimit -l unlimited

Depending on how you start Cassandra, you might need to add this to your init scripts, so that it is set unlimited before dropping root — or setup limits.conf for the Cassandra user.

Verifying

If JNA is working, you will not see a log message about it. It currently only logs a message if it fails to load or encounters an error.


A special thanks to Brandon Williams on #cassandra for answering all my questions about JNA this afternoon!

Posted in Uncategorized | 4 Comments

Disappointed.

I will write a separate post about ApacheCon 2010 in Atlanta.

I write mostly about technical things that I have an opinion about. I enjoy writing those blog posts, and learning more from the commentary and feedback. I believe they can generate value

I generally avoid posting about politics, because they just make me unhappy. Writing more blog posts about politics won’t fix anything. I don’t believe they generate value.

I avoid unhappy topics in general. I don’t like mulling on unhappy things. I think that is how most humans feel too.

At ApahceCon this week, a member of the Apache Software Foundation’s Board, Noirin, was assaulted by another member of the community. Noirin has a post covering it on her blog, entitled “A hell of a time”.

I consider Noirin is a friend, we are both involved in Apache HTTP Server, both part of the ‘younger’ generation at Apache, and she was a colleague at Joost for a time.

I was at the bar in question. I didn’t see anything, as I was having a good time, hanging out with many friends around the table. I have met most of the people around the table in random locations all around the world. It wasn’t a seedy dive bar. It was adjacent to the Westin Hotel. We were all having a good night.

When I heard about this event on Friday, I was first in shock, then anger. Then it turned into disgust. Now I am just resolved in disappointment.

Disappointed that this could happen here. Happen within our community. Our community has plenty of flaws, but sexual assault of any kind if just unacceptable. This wasn’t an episode of Mad Men, set in the 1960s. This is the Apache community. Our community.

I know I made my mother angry a few times in my life. Not things I am proud of. I think the times I felt the worse is when she said she was disappointed in me. The feeling that I have failed. Anger is passing, disappointment is a judgement.

And that is how I feel now. Disappointed. This behavior is unacceptable in our community.

I don’t know how to fix the problem. I just know one thing, I am resolved to stamp it out of Apache. If you don’t like that, leave. You can stay home.

The Apache community talks much about “Community over Code“. We value community and the people far more than most open source and technology groups. All people must feel welcome and safe at Apache. We must make sure this never ever happens again, and the knowledge that this behavior is never acceptable is spread.

Posted in Uncategorized | 5 Comments

Interview about Libcloud at ApacheCon

I sat down yesterday with Cote from Redmonk here in Atlanta during ApacheCon. He recorded a “make all” podcast all about Libcloud, the Cloud, and Cloudkick:

I’d gotten quick briefing from him yesterday and was interested in talking more about the libcloud framework which is trying to wrap one API around all the different clouds out there. We also talk about the kinds of applications Paul has been seeing people put on the cloud and also the monitoring and management services that CloudKick providers.

Check it out!

Posted in Uncategorized | 2 Comments

The case for Gossip

More of a stream than a full blog post.

Lets say your business is a website, and this means downtime is unacceptable.

Lets also you don’t have ops people in 3 spread out timezones. (Hey, you aren’t Google)

This means at some point, your ops people go to sleep.

Ops people hate getting woken up at night.

Gossip can reduce the times ops teams are woken up. (If it doesn’t cause your website to be down for hours)

Gossip doesn’t mean you automatically determine the role of a machine.

Gossip is not autoscalling. Not Autogrowing. Not Auto-lots of things.

Gossip is discovering the current state of your machines.

It does mean you have a faster convergence time on failure, which reduces the cost of a failure, and reduces the pain inflicted on your ops people.

I believe you still want a human involved in many auto things. Use humans to add new machines, to change their roles, to do capacity planning.

Automatically determining where resources are within your network, not allocating them. Allocating and destroying automatically is a different topic, and should be addressed, differently.

Finding available resources is different than just configuration management. Availability is not a concept something like puppet or chef have.

They have variables, ways to acquire them, and high latency ways to propagate them.

The time to failure detection in a properly written gossip system can be sub-second, depending on your application’s needs.

That is order of magnitudes faster than a human, or a configuration management system, or a human changing a configuration management environment.

There are some alternatives you could use instead of gossip. They include things like load balancers, which can have their own host of issues.

You can use things like Zookeeper, but underneath they are just gossiping underneath.

You can use things like DNS, but now you are using DNS (insert dns rant here).

You can write your application to be more robust, and re-try different machines on failure, but then you generally are adding latency to an end user request. Making your application more robust, I believe is relevant, but a divergent topic. You’ll still want to wake up an ops person though, because adding latency to even a subset of requests is not acceptable.

Gossip isn’t perfect. Gossip doesn’t solve everything. Gossip, like most things, can be abused and overused. I believe Gossip however, used for a limited goal of reducing the time to detection and the impact of failures, is overall a positive mechanism for most infrastructures.

Posted in Uncategorized | 2 Comments

three things

Three things I would like to hear more about:

  • DIRT: Data Intensive Real Time, coined by Bryan. I’ve been thinking about how we use orbited to do this, and I think a real revolution in web application infrastructure is just around the corner as more people get involved with Node.js type projects.
  • Django non-rel: An odd choice I think, but its these kind of projects that are taking a hard look at the ORMs most programmers have come to expect, and how to make their nice features feasible with very creative backends. Django non-rel won’t be the last ORM on top of a non-relational datastore, and I think it will be years before its all settled out, but I know most people will never want to use Cassandra’s raw Thrift API.
  • Upcoming Web Frameworks: Hinted at by my first two, I think current web frameworks don’t fit the evolving model of a web application. Something that really combines a non-relational backend, with a process and communication model that supports DIRT, in addition to traditional HTML. I don’t think there is a leader here yet, but I think within a year or two, there will be serious disruption in the framework space — a Rails or Django of the 2010s. The key will be finding the right places of abstraction, because you still need to make it easy to use, even with all the crazy complicated things going on underneath, but if the wrong abstractions are selected, you’ll just end up with leaks everywhere.

Three things I am tired of hearing about:

  • Benchmarks of HTTP servers: If the HTTP server is your applications bottleneck,  Congratulations! But it probably means you are serving static files, and not using a database or a dynamic language. HTTP is hard, but Ryan has made it easier. HTTP though, isn’t the bottleneck on your Rails/PHP/Django/node.js application, some backend datastore will be.
  • NoSQL: I’m tired of some non-durable non-multi-master NoSQL projects.  We had that before, its called BDB in the 90s. Oh, and again with MySQL.  There are innovative NoSQL projects, stop polluting the name. Anyways, the name in general is terrible, but innovation is not building a new BDB, it is building a multi-master multi-datacenter data storage system.
  • Meg Whitman vs Jerry Brown: Only makes sense if you are a California resident I guess, but I don’t like either candidate at this point. Voting by mail, and just wish the election would be done, so I can stop seeing their terrible attack advertisements on each other.
Posted in Uncategorized | 4 Comments