At work, every project has an .htaccess file containing at the least some mod_rewrite rules.  This way, all I need to do to run a project is check it out of version control.  I don’t need to modify my local Apache configuration.

But turning this option on and allowing .htaccess files may be a performance hit. More specifically, enabling the AllowOverride option in Apache is a performance hit. The Apache docs sums up the problem best:

“Wherever in your URL-space you allow overrides (typically .htaccess files) Apache will attempt to open .htaccess for each filename component. For example,

1
2
3
4
DocumentRoot /www/htdocs
<Directory />
   AllowOverride all
</Directory>

and a request is made for the URI /index.html. Then Apache will attempt to open /.htaccess, /www/.htaccess, and /www/htdocs/.htaccess.”

So I disabled all .htaccess files in production, and inserted each file’s individual mod_rewrite rules into the main Apache config file. After a quick Apache Bench run, one project looked around 3% faster. Note that there are a few other useful optimizations on that page.

What Is Software?

Long ago, I concluded that software is in fact, among art, science, and engineering, closest to art.  But I had never really considered software as craft.  The Manifesto for Software Craftsmanship has rekindled a debate on software as craft.  That may yet be the best classification of our trade, but I’m sure it won’t be the last.

Let’s Start From The Beginning

Let’s back up a minute.  What is software?  Is it art?  Science?  Engineering?  Craft?  When in doubt, the dictionary is a good place to start.

engineering -noun the art or science of making practical application of the knowledge of pure sciences, as physics or chemistry, as in the construction of engines, bridges, buildings, mines, ships, and chemical plants.

Software has never felt much like engineering to me, maybe because there’s no calculus involved (usually).  There are of course parallels in resource trade-offs, input/output, etc.  But to me there is something inherently different between building a bridge, or even a stereo, and writing a program. 

Software development is still more a craft than an engineering discipline. This is primarily because of a lack of rigor in the critical processes of validating and improving a design.

- Jack Reeves, What Is Software Design?

According to Reeves, if we adjust our definition of things slightly, and see that coding is really equivalent to the design that takes place before engineering, it can start to look like other engineering disciplines.  But tools (i.e. languages) and coding and testing processes need to evolve much more before that becomes true.

art -noun the quality, production, expression, or realm, according to aesthetic principles, of what is beautiful, appealing, or of more than ordinary significance.

Software, which is so technical, as art has always appealed in an esoteric sort of way.  They say that artists will destroy the first 100 pieces that they do. In the same way, many programmers fight the urge to rewrite something they’ve written, almost immediately after it’s completed.  I know I do.  But fundamentally speaking, the problem with this analog is that while art can exist for art’s sake, software is usually more practical.  That is, even if I write something for fun, I still want it to at least do something.

science -noun a branch of knowledge or study dealing with a body of facts or truths systematically arranged and showing the operation of general laws.

If I’m not convinced of software as an applied science like engineering, I don’t see how software as science can be so.  In that sense, it falls to the same problem as art: science may exists for science’s sake – to discover new knowledge – but for software, at the end of the day, there always seems to be a customer waiting.

craft -noun an art, trade, or occupation requiring special skill, esp. manual skill.

Oh, but I like this one.  It allows us to impose the special skills that is the craft.  It leaves room for the creativity that is so essential for success – and rewarding.  It speaks to the practical requirements that usually pay the bills at the end of the day and (hopefully) make life easier for someone.

If we consider software as craft, we cannot avoid talking about masters and apprentices.  People often forget that there are apprentices around today.  It used to be that if you wanted to become an electrician, plumber, or home inspector, you must have first trained as an apprentice.  Only after many (usually unpaid) months of apprenticeship under a master could you get a license to work.  These days, there are ways around that requirements, and apprenticeship is often found only in union crews.  (And in unions it seems that politics has overtaken the original goal: to further the craft.)  But I digress.

The main problem with this is the reason there is a market for this type of qualifying: it is the law.  You technically cannot pull a permit on a home’s electrical (except your own) unless you are licensed.  You cannot get your license until you apprentice or take a test.  So would it be better or worse if there was a law requiring all software developers to get licensed before practicing their craft?  I’m certain quality would go up.  But the profession suddenly becomes less accessible.

A Rose By Any Other Name…

While we’re on the subject, what do software professionals like to be called?  Software engineers?  Developers?  Programmers?  Artisans?  Architects?  I hate using the title “software engineer” (see above).  Besides, “software artisan” is closer, but does not sound any less pretentious.  Personally, I prefer “developer” if I’m in the company of those who know what that means, and for everyone else, “programmer” will do.

I am reminded of “The human programmer” that compares programmers’ quest for a title to musicians’ and writers’.

Maybe programmers are just like the 1950s musicians that lacked … confidence, snatching desperately for public nobility. Thus far our attempts at title theft have been less successful than theirs, though not for lack of trying (e.g. software developer / architect / engineer). Like musicians, our work requires not only talent but years of practice, and we see ourselves as “different.” The average person cannot walk up to a piano, or a computer keyboard, and produce anything of value. Writers? Throw them in here too, certainly. Are reporters not a little too serious about being called “journalists”? Of course, this line of thought would annoy both of those professions, them being old, established, and respected compared to programming.

Software As…?

Of course, none of this really matters.  We still go about our jobs every day doing what we do and enjoying it.  But we forget sometimes that software as a profession is only decades old!  Compared to other professions in art, science, or engineering, software is in its infancy!  Someday I think a classification will be agreed upon.  In the meantime, I’ll throw one more definition out there.

alchemy -noun any magical power or process of transmuting a common substance, usually of little value, into a substance of great value.

Software alchemist.  I’m ordering new business cards.

Though I have heard good things about Parallels and VirtualBox, I have always been a user of VMware.  In particular, VMware Workstation.  Workstation is great for firing up multiple Linux instances and testing out load-balancing or proxying scenarios.  I haven’t really figured out any use for Windows VM’s other than testing IE6. 

While there are a few Virtual PC hard disk images (.vhd) for Windows XP around, VMware cannot directly import .vhd files.  It needs the actual Virtual PC virtual machine file (.vmc).  After again losing my Windows XP virtual machine that I use for IE6 testing, I thought I’d document the process of running Windows XP in VMware so I don’t have to figure it out again the next time it happens. 

Note: though these instructions are for VMware Workstation, some of this may apply to the free VMware Player.

  1. Download the IE6 Virtual PC Virtual Hard Disk (.vhd) image from Microsoft.
  2. Download and install Virtual PC from Microsoft, if you don’t have it already.
  3. Start Virtual PC.  If you have no virtual machines, you will get the New Virtual Machine Wizard.  Click Next.
    vmcwizard1
  4. Select “Use default settings to create a new virtual machine”. Click Next.
     vmcwizard2
  5. Pick a location to save your Virtual PC virtual machine.  This should be the location you will create the VMware virtual machine.  I keep all my VM’s in the same directory with meaningful names.
    vmcwizard3
  6. Click Finish to create the new virtual machine. 
    vmcwizard4
  7. If you selected “When I click Finish, open Settings,” in the previous step, you will see the settings dialog.  If you did not, select the new VM and click Settings.  Select “Virtual hard disk file:” and find the .vhd file you downloaded in step 1.  After finding it, click OK.
    startvmc1
  8. You should see your VM in the Virtual PC Console. 
    startvmc2
  9. Select your VM and click Start.  Your Windows XP virtual machine should boot in its own window.
    startvmc3
  10. Shut down the virtual machine using the Start button.  Then exit out of Virtual PC.  Start VMware Workstation.  Once it’s started, select “Import or Export…” from the “File” menu.  You should see the Conversion Wizard.  Click Next.
    vmxwizard0  
  11. You are at Step 1 of the conversion. Click Next to select a Source Type. Under “Select the type of source you want to use:”, select Other.  Click Next.
    vmxwizard1
  12. Under “Source VM or image:”, find the Virtual PC (.vmc) file you created earlier.  Click Next.
    vmxwizard2
  13. Select “Convert all disks and maintain size.” Click Next.
    vmxwizard3
  14. You are at Step 2 of the conversion. Click Next to select a destination type.  Under “Select the destination type,” select “Other Virtual Machine.”  Click Next.
    vmxwizard5
  15. Under “Virtual machine name,” fill in a meaningful name.  Under “Location:”, find the place you want to store your virtual machine.  Click Next.
    vmxwizard6
  16. The wizard tells you that the source files are in Microsoft virtual disk (.vhd) format.  Under “How do you want to convert them?”, select “Import and convert (full-clone).”  Under “Disk Allocation,” Select “Allow virtual disk files to expand.”  Click Next.
    vmxwizard7
  17. The next step allows you to configure your VM networking.  You should probably stick to the default of 1 NIC, bridged, that connects at power on.  Click Next.
    vmxwizard8
  18. Step 3 allows for some VMware customisation.  You definitely want to install the VMware Tools.  Click Next.
    vmxwizard9
  19. You’re Virtual PC image is ready to be converted to VMware.  Click Finish to begin the conversion!
    vmxwizard10
  20. Get up from your desk and take a walk around.  Go get a cup of coffee. 
  21. After the conversion is completed, you should see your new Windows XP virtual machine in VMware Workstation.
    finished1
  22. Click on “Power on this virtual machine” and your Windows XP VM should boot inside of VMware Workstation.  You can uninstall Virtual PC at this point, if you want (which is likely, since you’re running VMware).
    finished2  

Autoload Magic

So to use a class in PHP, usually, we first have to include the file that contains the definition for that class.

require('myclass.php');
 
$class = new MyClass;

But after you start instantiating a few classes here and there, problems arise.

  • If you call a lot of classes, the number calls to require() becomes big.
  • When you start including files in other files, you are not sure whether or not you’ve already included a certain file, so you use require_once(), which is inefficient.
  • Worst of all, included files become disorganized, and you may accidentally remove a require() that is needed, not in the current file, but in a file downstream, leading to a Fatal Error (if you use require() and not include()).

The solution: __autoload() The PHP autoload feature is one of the coolest features in the whole language. Basically, if PHP tries to load a class and cannot find the class definition, it will call the __autoload() function that you provide giving it the name of the class it can’t find. At that point, you’re on your own. But how to find the file located on disk? There are four strategies for finding the right class file.

  1. Keep all class files in one directory. Not a very attractive method.
  2. Maintain a global array of class names to class definition files. In this case, the name of the class is the key, and the location of the file on disk is the value. This global hash could be created in memory upon server start time, or created as the application executes. Obviously, seems kind of a heavy solution to me.
  3. Use a special container class that all other classes inherit from. This method is suitable mostly when attempting to unserialize objects (when unserializing objects, PHP must have the class definition to recover the object). This special container will automatically know the class file for the contained object. Then, the only class file that needs to be located is the container’s class file. (Autoloading in the context of object un/serialization is a special case of class loading.)
  4. Use a naming convention. The idea here is to name your classes in a way that corresponds to your file system. For example, Zend names their classes where underscores represent directories in the project. So the Zend_Auth_Storage_Session class is defined in the file Zend/Auth/Storage/Session.php. Autoload simply needs to replace the underscores for the system directory separator char and give it to require().

At work, we use #4, naming conventions. Why? Well, why not? The other methods are heavier or more complicated, and I don’t see any gain. By using a simple naming convention, never again will we need to call require() or include() in our app. As a bonus side-effect, the code is organized in a consistent matter. Provided we follow the naming convention that Zend (and PEAR) use, our __autoload() function looks like:

// by using PEAR-type naming conventions, autoload will always know where to
// find class definitions
function __autoload($class)
{
   require(str_replace('_', DIRECTORY_SEPARATOR, $class) . '.php');
}

If we are to do unit testing, then it would be nice to have a simple way to run all the tests at once.  And seeing the results in a command line is fine, but it would be nice to be able to generate some pretty reports I could view in a browser.  And running coverage tools from the command line to see exactly what we’re testing is fine, but it would be nice to be able to generate coverage reports I could view in a browser. It would be nice to be able to do all these things with one simple command.

The command: phing.

If you’re familiar with Ant for Java, phing is functionally identical to ant.  With a very simple build script, I can run all the unit tests, and generate test result reports and coverage reports in HTML automatically. Then I can fire up my browser and see if any tests failed, or whether we need new tests.

The following are examples of test results reports and coverage reports.

Unit test results

Code coverage report

Phing currently requires PHPUnit for unit testing and xdebug for coverage. Someone would have to write a task to use other frameworks such as SimpleTest.

The phing build file is a first step towards automating builds with a build server. A build server can periodically check the project for changes. If new code exists, it will run tests and generate reports for tests, coverage, lint, anything. It can put all this on the web for easy inspection. It can keep versions of the whole project that be can be readily deployed.  So if we have a staging and production server, and we stick to deploying ONLY bundles from the build server, it can make deployment and maintenance easier.

The typical argument for database abstraction is database portability. By abstracting the database, you are free to switch from one RDBMS to another effortlessly. Popular database abstraction layers are PEAR MDB2, ADOdb, and built-in PHP PDO library (not quite a database abstraction layer, but we’ll throw it in there anyway).

But there are three problems with that logic.

  1. How often are you going to switch databases in the life of your application anyway? I don’t know if I’ve ever switched a database out from under an application. Ever.
  2. To achieve true database independence, you’ll need to avoid using all the little syntax nuances that vary from DB to DB (e.g. the MySQL LIMIT clause) and avoid any feature in that database that makes it interesting (e.g. say goodbye to ON DUPLICATE KEY UPDATE). Chances are, you haven’t done these things. So to switch databases, your biggest problem will not be having to change all the mysql_* calls to ora_* or whatever.
  3. As with any layer, when you add another layer, there will always be some performance impact.

In light of these reasons, database dependence can be a good thing. You can take advantage of features in the RDBMS. And if you can make native calls directly to the extension, you’re saving a lot of code from being executed.

But still, something feels wrong from having all those native function calls throughout the codebase. The solution is database access abstraction, which does not attempt to abstract away the entire database, but attempts to abstract away access to the database.

Practically, this means building a wrapper class around your database code.  Then, your application can use this wrapper class for its database needs. This is somewhat the best of both worlds. If you do need to switch DB’s, all your native functions calls will at least be in one file. You can also insert any system-wide logic pertaining to DB’s in this one class. For example, if you move to a replicated MySQL environment, you’ll need to direct READ queries to connect to one of multiple slave servers, and direct WRITE queries to the master server.  This seems like an obvious thing to do, but a lot of people assume using a DBAL is enough abstraction already.

At work, my biggest motivation was performance. Running tests on our current DBAL, ADOdb, against using the mysqli_* functions in PHP revealed significant performance gains in going without the DBAL, which makes sense.

This blog repeats much of the thinking here, but is a more comprehensive looks at the topic (though the language is confusing at times.)

I’ve mentioned Apache Bench before. Httperf serves the same purpose as ab, but has a few more features, and has one very nice value-add.

While ab cannot really simulate a user visiting a website and performing multiple requests, httperf can. You can feed it a number of URL’s to visit, and specify how many requests to send within one session. You can also spread out requests over a time period randomly according a uniform or Poisson distribution, or a constant.

But the big value-add is autobench. Autobench is a perl wrapper around httperf for automating the process of load testing a web server. Autobench runs httperf a specified number of times against a URI, increasing the number of requests per second (which I equate to -c in ab) so that the response rate or the response time can be graphed vs. requests per second. (So response rate or response time on the vertical, and requests per second on the horizontal.)

With this, you can generate pretty graphs like this:

Requests per sec

Response time

From the graphs above, you could determine the approximate capacity of your website. In the first graph, the number of responses received was equal to the number of requests sent until 16 req/sec. At 16 req/sec., the number of responses starts going down as requests begin to error out.  In the second graph, the response time stays level at about 500ms (a reflection of your code and database) until 15 req/sec.  At 16 req/sec. the time goes up to nearly 1s, and at 17 req/sec. the response time is over a second.  You would conclude that the capacity of this website is around 15 requests per second.

The people who provide autobench also offer an excellent HOWTO on benchmarking web servers in general.

Apache Bench

Apache Bench is either the first or second most useful PHP tool (with Xdebug being the other). I described the basic theory of Apache Bench in an earlier post. That’s a short post, so I won’t repeat it. This will be another short post, with a small note on how I use it day-to-day. If you are changing something in the system, a piece of code, a database setting, an OS setting… anything! for performance reasons, and you want to see if it makes any difference, use Apache Bench. Fire up a quick test before the change, and after the change. ab runs very quickly (on the order of a few minutes on a slow machine), so you can run 1000 requests and not have to worry about your sample size. I even run it on my laptop. Even though my laptop introduces a lot of noise, it still gives relative results. I usually run it two ways before the change, and two ways after.

% ab -n 1000 -c 1 http://www.whatever.com

That usually gets me a good idea of improving performance.

% ab -c 100 -t 60 http://www.whatever.com

That usually gets me a good idea of scaling under load.

UPDATE: There have been reports that Apache Bench is not reliable.

Ward Cunningham On Technical Debt

Ward Cunningham reflects on the history, motivation and common misunderstanding of the “debt metaphor” as motivation for refactoring.

The MySQL Query Cache is not very hard to understand. It is at its most basic a giant hash where the literal queries are the keys and the array of result records are the values. So this query:

SELECT event_name FROM events WHERE event_id = 8;

is different from this query:

SELECT  event_name FROM events WHERE event_id = 10;

Important note!  This means that even though your parameterized queries may look the same without the parameters, to the query cache, they are not!

As with all caches, the query cache is concerned about freshness of data. It takes perhaps the simplest approach possible to this problem by keeping track of any tables involved in your cached query. If any of these tables changes, it invalidates the query and removes it from the cache. This means that if your query returns frequently-changing data in its results, the query cache will invalidate the query frequently, leading to thrashing. For example, if you had a query that returned a view count of an event:

SELECT event_name, views FROM events WHERE event_id = 8;

Every time that event is viewed, the cached query will be invalidated. What’s the solution?

In general, write queries so that their result sets do not change often. In specific, mixing static attributes with frequently updated fields in a single table leads to thrashing, so separate out things like view counts and analytics into their own tables. The frequently updated data can be read with a separate query, or perhaps cached in your application in a data structure that periodically flushes to the DB.

This vertical partitioning of a single table’s columns into multiple tables helps immensely with the query cache. What’s more is that the table with the unchanging data can be further optimized for READS, and the frequently updated table can be optimized for UPDATES.