Apache HTTP Server development for this week

April 8th, 2008

The last week has seen many new features and improvments made to httpd. Many of them have been accelerated by people at the ApacheCon EU Hackathon this week.

  • mod_session

    On Friday Graham Leggett introduced a series of modules to support generation of sessions from HTTPD. Included is mod_session_crytpo, which encrypts the data using AES. This is the first time ‘form based’ authentication has had real support in the Apache Core.
    [docs: mod_session]
    [thread: Apache support for form authentication]

  • mod_socache

    On Tuesday Joe Orton commited the new Small Object Cache modules, which have been under discussions for a couple months now. The mod_sslsession cache has been changed to use this. Currently suported cache backends are DBM, memcached, and Shared Memory. I expect many other modules will changed to use this cache API as time goes on.
    [svn: ap_socache.h]
    [thread: [PATCH] ap_socache.h & mod_socache_*]

  • If/Else blocks added

    Nick Kew ported the expression parser from mod_includes, and has used this to add If and Else blocks to the core.This provides a viable alternative to mod_rewrite and RewriteCond, and letsyou set any modules configuration values.
    [docs: if]
    [thread: Dynamic configuration for the hackathon?]
    [commit: r644253]

  • Turkish Documentation

    Nilgün Belma Bugüner contributed a complete translation of the Apache HTTP Server documentation in Turkish.
    [docs: Turkish]
    [thread: New Turkish Documents]
    [commit: r645667]

  • Serf Bucket Discussions

    Discussion at the Hackathon covered how Serf Buckets use a “pull” method, for both input and output, unlike the current filter stack in httpd, which is Pull for input filters, but push for output filters. There was general agreement that the expieriment of mod_serf should be expanded up the filter stack.
    [svn: mod_serf.c]

  • Simple MPM created

    Paul Querna started work on a new MPM at the Hackathon. The MPM hopes to run on both Unix and Win32 platforms, and keep the same behavoirs on both.
    [svn: SIMPLE.README]

ApacheCon EU 2008

March 27th, 2008

I will be at ApacheCon EU 2008, in Amsterdam in a week or so.  (April 6-11)

Not giving any talks this year.

I will also be at Joost’s Leiden office the week following.  (April 12-19) 

traveling.

March 27th, 2008

I spent the last week or so in New York City, at the Joost office there.  I kinda forgot to post anything.  This was the first time I had been to NYC, and it was a fun trip.  Not sure I would ever want to live in NYC.

Returning to San Jose, was not so fun.

Yesterday, American Airlines cancelled a couple hundred flights due to problems with their MD-80s.

I was originally going LGA -> ORD -> SJC.

The ORD -> SJC leg got cancelled.

They re-routed me LGA -> DFW -> SJC.

When I landed in DFW, the flight to SJC had been cancelled.

There were no more flights to SJC.

The flights to SFO and OAK were all fully booked.

I luckly got on the top of the standby list to OAK, and got on that flight.

My checked bag however, is somewhere between New York and California, and American Airlines doesn’t know where it is yet. Sigh.

March Madness on Joost

March 20th, 2008

ncaa_featurebox.jpg
Watch all of the March Madness games LIVE on Joost!

Joining Joost

January 18th, 2008

Joost Logo

Monday will be my first day at Joost.

Today is my last day at Bloglines aka Ask.com ak IAC Search and Media aka IAC/Interactive.

It’s been a fun couple years here, and I am very grateful for the great team I helped build at Bloglines, but it is time for me to move on.

in reply to “bloglines sucks”

December 27th, 2007

In reply to Scoble’s post today, “Bloglines Sucks“…..

I will first try to outline the “issue”.

At the bottom of every post on a wordpress.com blog, is a tracker image used for statistics. It includes a rand parameter, which changes every time the feed is fetched over HTTP. The image URL is something like this:

http://stats.wordpress.com/b.gif?host=scobleizer.com&rand=2045631674&blog=3428&post=3957&subd=scobleizer&ref=&feed=1

Because this rand value changes every time we read the feed, we considered the Item ‘Updated‘.

The behavior of the last 40 posts being shown as updated, every time a new post was added was caused by our use of the HTTP ETags and Last-Modified features. Since Wordpress.com returns a 304 Not Modified for most of our crawls, we would only ‘reparse’ the entire feed when a new post was added.

Now, The reason users do not see this problem in Google Reader, is that Google Reader has no concept of an “Updated” item. When a writer edits a blog post later, users in Google Reader would never see the changes. In Bloglines, we have always considered this a feature, showing you the user when a blog post is edited.

In Bloglines you can disable this feature, on a per-feed basis:

In Bloglines Beta, click on the feed, then select Edit. Change the “Updated Items:” to “Ignore”.

In Bloglines Classic, click on the feed, then select edit subscription. Change the “Updated Items:” to “Ignore”.

As far as I can tell, the use of a rand parameter in the Wordpress.com statistics image is a new change, also introduced at the same time the inline comment images were added to feeds.

FeedBurner includes similar statistics, tracking images and comment images, but they do not include a constantly changing image url. This works correctly in Bloglines.

In regards to placing blame, Dana Epp says “Bloglines says it’s not them”. I don’t know who Dana has talked to inside Bloglines. When these type of issues are reported, we generally try to get in touch and investigate with the publisher, and hopefully figure out what is going on together, rather than outright saying its not our fault. It is a bad experience for our users, and we always want to be involved and help fix it.

I first heard about this issue on Friday, December 21st from Matt via email. (also my birthday) I forwarded that email onto our internal Bloglines Engineering Mailing list, but frankly, I didn’t expect anyone to work on the issue on the Friday before Christmas. IAC Search and Media, the parent company of Bloglines and Ask.com, also has a mandatory Holiday Shutdown this week for all employees. No one will be in the office officially until January 2nd, 2008.

Luckily or unlucky, depending on your perspective, I took some time this afternoon away from my family to read my feeds. For now the bug^H^H^Hfeature in Bloglines of showing edited posts has been fixed. I’ve have simply turned it off for all users.

I hope you had a Merry Christmas, and have a Happy New Year.

22->23

December 21st, 2007

Getting older

Really getting tired this year. Since ApacheCon in Atlanta on November 11th, I haven’t been home in San Jose for more than 6 days straight.

Thankfully, I have the next 2 weeks in Spokane at my parent’s house to chill for a bit.

on shedding

November 29th, 2007

Brian McCallister has a new post on a service location technique dubbed “Shredding”.This post started out as a comment on Brian’s site, but it got a little long….

  • Don’t underestimate using load balancers where they make sense.. You don’t need to spend tons of money on a commercial one. 2x 1u pizza boxes with modern CPUs + 1/10GigE running {Free,Open}BSD + CARP + pfsync.
  • For ‘dumb clients’: Just Proxy it. Perlbal does this for LiveJournal infront of their MogileFS boxes. Or look at Dynamo for another example, the ‘dumb’ clients can connects to any nodes, and that nodes proxies to the correct one. Reducing the number of request/response cycles down is important to keep client latency down. Its not so much about the persistent TCP connection, as the send/reply of the data just to find something.
  • For ’smart clients’: I personally prefer a daemon running on each local machine, which uses a multicast/gossip communication with other nodes to keep a local ‘cache’ of where services are located and their status. Every couple seconds, based on the current state of the cluster, it would write it out to blob file on disk. Clients Just slurp up this file to find anything. (You can also do the same thing, but based on a unix daemon socket, but its generally slower.)
  • There is some discussion about RFC issues with 302s and sending a POST to the redirected URL. The larger issue is that almost no HTTP Client Libraries will do this correctly out of the box.

All that said, for the Bloglines FS, we proxy writes to the data storage nodes, but that is mostly to ensure redundancy of data. For reads, we send back a sorted list of the data nodes that have a chunk to the client. The client then connects directly, and will try the other entries on the list if the first one fails.

See also:

mod_serf in trunk

November 13th, 2007

Now in httpd trunk: mod_serf.  A reverse proxy module that uses Serf for its HTTP Client. Woot.

ipod warning

November 11th, 2007

ipod warning

Don’t steal music. Thank You Apple for the reminder.

I wonder if new IPhones will include wrappers saying ‘Don’t jail break‘.

Well, of course they won’t this is Apple we are talking about, it would be more like:

Don’t jail break

No encarcele la rotura

Setzen Sie nicht Bruch gefangen

壊れ目を拘留してはいけない 

new software: mod_timer

October 25th, 2007

Do you have a custom logging module?

Ever wondered how long it took to actually finish logging?

At $work I was helping with some problems with Apache, and we wanted to know how long until an Apache Worker Process actually goes back into the Accept Queue.

So mod_timer is born. It hooks into Apache when the connection is accepted, before we start reading any data, and the timer ends when the connection memory pool is destroyed. It also performs the same measurements on requests inside the connection.

It produces a log file like this:

r:127.0.0.1:51886:1193364411078414:1568
r:127.0.0.1:51886:1193364411077069:3034
r:127.0.0.1:51886:1193364411080117:21150
r:127.0.0.1:51886:1193364411101293:99477
r:127.0.0.1:51886:1193364411200792:1856762
r:127.0.0.1:51886:1193364413057577:7000364
c:127.0.0.1:51886:1193364411077016:8980955
r:127.0.0.1:51887:1193364427909070:96608
r:127.0.0.1:51887:1193364428006034:2031335
r:127.0.0.1:51887:1193364430037392:1086699
r:127.0.0.1:51887:1193364431124508:916482
r:127.0.0.1:51887:1193364432041014:5315190
c:127.0.0.1:51887:1193364427909020:9447211

Log Fields:

  • ‘r’ or ‘c’ represents if this is a request or connection being logged.
  • Remote IP Address
  • Remote Port
  • Start time, in apr_time_t (64bit int time since 1970 in microseconds)
  • Run time in microseconds

Using this, it becomes easier to look for ‘evil’ clients that are doing things like sending one byte of a GET request a second.

gltail

October 8th, 2007

We thought gltail sounded pretty cool. So we hooked it into bloglines.com:gltail screenshot

(Screenshot is clipped to protect user data)

It worked okay for 1 webserver. But hooking it up to the entire cluster, it was just a little bit too slow — drawing a new frame once every 8 seconds. Time to port it to C :-)

geeksessions presentation

October 4th, 2007

Slides from my GeekSessions 1.2 Presentation

Video from the presentation and panel are supposed to show up here at some point soon.

goodbye RDBMs

October 4th, 2007

The End of an Architectural Era (It’s Time for a Complete Rewrite)
[Via Wesley Felter]

With CouchDB gaining traction, and the recent paper on Dynamo, it feels like people everywhere are dropping their relational databases.  Are the database vendors going to figure this out, and change their products in time to matter?

Dynamo

October 3rd, 2007

If you care about distributed systems, you need to read the paper about Amazon’s Dynamo.

Comments:

  • Making node joining/leaving an administrative command is not something most academics consider, but it significantly reduces complexity.  We made a similar decision with the PodServer system for Bloglines.  I believe this is the right decision, since a node changing membership on the long term is a rare event.  Even with our growing blog index, we only add new nodes once every 6 months or so. (Plan ahead :-) )
  • Shout out to BerkeleyDB.  Glad to see other people pushing it hard. Combined with the older white-paper about Google using BerkeleyDB for their Google Accounts system, it just validates my positive feelings on continuing to use it as a core part of the Bloglines architecture.
  • The configurability of N/R/W is a great idea.  Most systems make N configurable, but skimp out on giving full flexibility to the people using the system.
  • I’m convinced I need to read more about Vector Clocks.  For the Bloglines PodServer, we are blessed with only have a single writer per record due to how our crawlers work, so we just ‘cheat’ on versioning, but this has caused us pain a few times.
  • I wish Amazon would Open Source Dynamo. I can understand the difficulties in doing that, but its a nice thing to dream about.
  • I think I will propose an Apache Labs project to start something like Dynamo.  For a basic key/value storage system on a constant hashing ring, without all of the High Availability concerns, you could get something working pretty quickly.   Adding all of the high end features could take time of course…..

More discussion over Sam Ruby’s Blog: Key + Data.

This all ties in nicely with the GeekSessions 1.2 topic  of “Designing beyond the database”, where I presented last night.