't Bijstere spoor

't Bijstere spoor

A blog about Web development

PHP serializer 0.2

I updated the serializer to make use of spl_object_hash() where this function is available, this means it will go a lot faster on PHP 5.2 when serializing objects. (suggestion from Sebastian Bergmann).

Links


The curse of major versions

For some reason it seems like a lot of the open source applications I use for web application development seem to be haunted with some kind of 'major version syndrome'. I've been following the PHP-DEV list closely over the last years (lurking mostly) and I got a bit worried about the future of PHP reading all the endless discussions and personal attacks involved with the development of PHP6.

Today I had to work on converting a PHP4 application into PHP5 (No change needed, the only cost was in testing), and I have been exchanging emails about getting a host to support PHP5, which sparked me to write this post.

Python

The heat going on on the php-internals list also made me think about not putting all my eggs in one basket. So I've been browsing around for other languages. I considered a few, and I think the one I'm most interested in is python. Haven't had time to actually get started with it, but I figured it was a good idea to subscribe to Planet Python. I learned that a Python 3 (or 3000, or py3k) is in the making and it seems like they are struck with similar problems.

Perl

This discovery made me think about Perl. From what I heard about Perl (I don't hear a lot, as it seems to be a declining language.. ), Perl 6 is a completely new language which at some point (7 years ago) had Perl 5 as its starting point. I have to admit I don't know the community or what's exactly going on, but I ran across a lot of articles about decline of the language and rumors about split in the community.

Apache

Also, Apache seems to have problems with adoption rates. Most people still run 1.3 in favor of 2.0. The linked article is from 2005, so the information is a bit outdated by now (Apache 2 was released in 2002). But as far as I know Apache 1.3 is still a lot more popular than Apache 2.0

Problems

I want to emphasize that this article is mostly based on a feeling I'm getting looking at the various projects and browsing around, so as a result I can't really put down hard facts or unbiased sources. Then again, making conclusions based on incomplete data is likely the daily life of every programmer, so I might as well give it a shot.

I think the problem is two-fold. First, there's the adoption problem by the community. Main reasons seem to be:

  • Backwards compatibility breaking.
  • Maturity (instability) of the new version.
  • Don't try to fix something thats not broken.

It can be debated if either of these reasons are actually true or merely FUD; but the main point here is, the surrounding communities same to make these conclusions.

The second problem is the internal development team. Often a comparison is made between open source development and democracy. When projects grow, their development teams grow as well.. adding more opinions and eventually more bureaucracy and frustration.

When a subject of a new major version comes up, a can of worms seems to open up. Everybody working on the project will have ideas about the direction of the project. Generally there's a few years between major releases, so everybody will feel strongly about getting their favorite feature in.

Solutions

If you look at the statistics for the various php versions people seem to have no problem to upgrade between minor versions. The most popular version of PHP is 4.4.4. Still a bit behind the current 4.4.7 release, but visibly people are more comfortable to upgrade to a version starting with 4.

The conclusion I can make: Keep your major version number and add new features gradually. There's too much pressure behind a whole new integer value for the major version number for both for the users and the internal dev team. Implementing new features gradually will keep both your users happy and keep your developers' heads on their shoulders.

Then again, I'm just a bystander and maybe I have no idea of the complex social structures involved with developing open source products.

Our development and live environment runs PHP 5.2.3, MySQL 5.0.41, Apache 2.2.3. We recently migrated all our servers to Debian 4.0.


PHP serializer in userland code

I did a bit of work on an alternative for serialize(), written in PHP.

I wanted to build this as a helper class for a draft-PHP-RPC server. The reason I needed a custom one was because I wanted to make sure I would be able to spit out PHP4-compatible serialized data, and in the future, when its ported to PHP6, also PHP5-compatible data.

Some of my findings:

  • Its dead-slow, compared to the built-in version (as expected). What PHP's built in serializer could do in 0.00366 seconds, I needed 0.0948.
  • So even though its CPU expensive, there is less memory needed for big structures, because it uses echo so it can stream it straight to the client if needed.
  • When a property is private or protected, it really is. There's no way to grab the value. I was hoping Reflection would have allowed me to cheat.
  • There's no proper way to find out if two variables reference the same data. The only way is to change one of them, and see if the second also changed.
  • I was hoping SPLObjectStorage would be able to give me back an index the stored object. Instead I'm looping through all the objects I got and use === to see if they are the same.

I'm starting to wonder now if its a better idea to just use serialize() and make the needed fixes with regexes and stuff, but thats an experiment for an other night.

For the people who might find it useful, here's the download and source code..

List of differences with PHP's serialize:

  1. It only checks references for objects.
  2. It converts Serializable objects to strings when the target version is PHP4.
  3. It has a setting that allows you to automatically convert any object to an array or STDClass.
  4. It ignores all private and protected variables.

ext3: too many links!

Apparently when using ext2 or ext3 there's a limit on the number of subdirectories you can create within a directory. This is a hardcoded number and seems to be set to about 215 ~= 32k.

This is related to the maximum number of (hard/soft?)links that can created, as every subdirectory needs a link back to their parent (..). If you run into a "too many links" error with really large folders, you'll know why.. The only way you can change this is by changing the number and re-compiling, which is not worth doing IMHO.

I thought I read somewhere the number of files in one directory is by default around 64k, so I just figured it would be the same for subdirectories.. Guess I'll need to re-organize a bit =)


CSS "projection" media type

OperaShow

I tend to use Firefox pretty much all day, mainly because its such a great tool for development. Its hard to go without tools like Firebug when you're working on complex javascript applications.

Recently I have been more and more attracted to the feature-set of Opera. The latest gem I found was "OperaShow", which has apparently been around since the days of Opera 5 (they are on 9.2 at time of writing).

CSS2 defines the "projection" media-type, intended for use in presentations. Most of the same rules apply as with the "print" media-type. (CSS2 spec on paged media).

My next presentation will definitely be done using Opera and completely in HTML. I'd want to invite you all out to open up or download Opera, go to their tutorial and open up the fullscreen mode (F11 on windows).


PHP-RPC

Update: I found out the difference between pointer and normal references, so I updated the 'data format section. Update2: Got the definition of all data types

Over the past time I've seen several proposals and implementations of people trying to leverage PHP's serialize format for RPC (remote procedure calls).

PHP's format is very compact compared to XML-RPC, not to mention SOAP.. There's no complex XML Parsing involved and its very fast to parse. Consuming a webservices leveraging this format can often be done using 2 or 3 lines of code without use of any external library.

Additionally it allows you to send over typed objects.. You could for example, say, send a object of the 'User' class, and on the other end of the line it would show up with the exact same classname.

Disadvantages:

  • Most of the advantages only apply when its used with PHP. There are better ways to communicate when there's other languages involved.
  • The classmapping will only be effective when both the client and the server have the same class definitions.
  • The serialize structure is not 100% compatible between versions.
  • There is no formal standard for both the structure, or the RPC protocol.

I hope, by typing the following document I can fix the last 2 problems in this list. The classmapping issue might also be fixed down the road by adding some kind of negotiation scheme. Another TODO is adding introspection and multiple calls in one request, like most XML-RPC implementations today support.

If there's enough interest in a standard like this, I will change this document into a more 'official' one and detach it from this blog.. If there's not, well, it means I will have a nice set of business requirements for use within our business :). Please note that this an early draft, so subject to change.

The proposal (0.1)

Goals

  • Client should be very easy to implement. Server is allowed to be a bit more complex.
  • No duplication of the HTTP protocol. For example, HTTP already provides encryption, redirecting and authentication.
  • PHP 4/5/6 compatiblity.
  • Client and server implementations should be built from the idea 'be strict in what you produce, be liberal in what you accept'

The request

Requests are made using either GET or POST. Both should be accepted. GET is more appropriate for fetching information, whereas POST is used for posting new data. POST has the advantage that it doesn't have any limits in the size of the request and an encoding can be supplied. GET has the advantage that information can be fetched using a one-liner.

When there is no encoding specified, UTF-8 is assumed. Data supplied using POST should be encoded as application/x-www-form-urlencoded (this is how a browser submits data by default).

The method thats called should always be supplied as the 'method' variable. The method can contain periods (.) to seperate namespaces like XML-RPC. Arguments can be specified in two ways, and the API documentation should specify what the appropriate way is. The first way is using named arguments, a GET example would be:

http://www.example.org/services/phprpc?method=getUsers&maxItems=20

The method here is getUsers, the named argument is maxItems and its value is 20.

The second way is using a list of arguments, which might be more appropriate in some cases where you want to directly map services and methods from a class on the server to the api. This is also how XML-RPC works.

http://www.example.org/services/phprpc?method=getUsers&arguments[0]=20&arguments[1]=1

The first argument is 20, the second is 1.

Smart clients should autodetect if the user is trying to use named arguments or a sequence by checking out the type of the keys in the array.

Smart servers should use reflection to automatically map named arguments to the actual arguments in a list.

Clients SHOULD supply the version of PHP they are running. This can be either a complete version number, or just the major version (e.g.: 4, 5, 6). Clients should supply this as the phpVersion parameter. If the versionnumber is not supplied, the current stable PHP version is assumed, which is at the time of writing 5.

Clients SHOULD also supply the version of the PHP-RPC protocol as the 'version' parameter. Currently this is 0.1.

Clients MAY supply a returnClasses parameter. The value for returnClasses is either 0 or 1 and this can tell the server if the client is aware of typed objects that might be sent from the server.

The server

The server MUST allow requests both GET and POST requests. The server MUST treat any incoming text without encoding as UTF-8.

The server SHOULD allow both named arguments and indexed arguments for methods where this is possible.

If the client sent phpVersion the server MUST convert the returned serialized string so it can be read by the server. If the phpVersion is 4 or 5 the server MUST convert all unicode-strings (type U) to binary strings (type s). If the phpVersion is 4 the server MUST convert all private and protected properties to public properties.

Servers SHOULD also convert all typed objects to either STDClass'es or arrays when the client supplied returnClasses is set to 0, if this is appropriate.

When the method-call was successful the server should send HTTP code 200. When an error occurred the server should send an appropriate HTTP error code. (for example 400 for missing arguments, 500 for unexpected exceptions, 401 if the user should authenticate itself first and 403 is the method was not allowed to be called).

The return data is always in PHP's serialize data format. The Content-Type header should always be 'application/x-php-serialized'

When an error occurred the server MUST send back an array, with at least the 'message' property, which should contain a description of the error that occurred. The server MAY supply more information in this array, such as line number, filename, class of the exception, stacktrace, etc..

The serialized data format

All data is serialized using PHP's serialize format. This is an unofficial specification. Although the format is human-readable, it is and should be treated as a binary format.

All items start with an 1 byte type identifier. These are the different types out there:

aarray
bboolean
Cobject which implements Serializable
ddouble
iinteger
Nnull
oseems to be a depreciated way to encode objects
Oobject + class
rreference
RPointer reference
sstring
Sescaped string. PHP6 uses this, but recent versions of PHP5 can also decode it.
UUnicode string (PHP6)

A boolean, double and integer all have the format:

type:value;

Where type is either b, i or d and value is a literal number (e.g. 12, 85.12, or 1 for true, 0 for false).

A null is specified as:

N;

A string is specified with the length of the string, and the actual string between double quotes.

s:10:"helloworld";

A Unicode string works the exact same way, however.. This type is only supported in PHP6. PHP6 differentiates between binary strings and unicode strings. Strings coming from older versions of PHP will therefore always be treated as binary strings in PHP6. Unicode strings will be supplied as UTF-16 and the length specifies the number of bytes, not characters.

Arrays wrap their elements in curly braces { }. The contents of the array are always simply a list of one or the other types (or more arrays.)

a:lengthofarray:{key1 + value1 + key2 + value2}

Note: the + signs are not literals here. Also note that arrays and objects are the only types that do not end with a semi-colon (;).

Example:

a:2:{i:0;s:3:"moo";i:1;s:4:"unox";}

Objects work similar to arrays, but they include the name of the class.

O:classnamelength:"ClassName":propertycount:{key1 + value1 + key2 + value2}

Example:

O:6:"MyClass":2:{s:6:"*prop1";s:6:"value1";s:5:"prop2";s:6:"value2";}

PHP5 introduces private and protected properties. If the name of a property is prepended with a *, it means the property is protected. When a property is private, it includes the name of the defining class and contains 0x00 before and 0x00 after the name of the class.

Written out, that would be:

public    s:9:"property1";s:6:"value1";
protected s:10:"*property2";s:6:"value2";
private   s:20:"0x00 + ClassName + 0x00 + property3";s:"value3";   // all whitespace and + signs should be ignored in this line

PHP5 also introduced the Serializable interface, which allow a custom encoding of objects. Serializable objects are encoded as:

C:classnamelength:"ClassName":datalength:{data}

So if the serialize method returned "foo", the result could look like this:

C:9:"TestClass":3:{foo}

Lastly, references. When the structure you're serializing contains a reference to a variable that was used earlier, it will be referenced. The main reason for this is, if you would have a structure with a circular reference, the serialization would keep on traversing your structure.. until, well, it would break.. Also, if two variables reference the same data.. that link is actually maintained..

A reference looks like this:

R:19;

There is a second reference type in PHP5. Objects in PHP sort of work like other references, but not completely. To illustrate I will just show an example.

<?php

class MyClass {
  var 
$myProp 1;
}

$obj1 = new MyClass(); // new object
$obj2 &= $obj1// pointer reference
$obj3 $obj1// value reference

$obj1->myProp 2;
echo 
$obj2->myProp"\n"// will display 2
echo $obj3->myProp"\n"// will display 2

$obj1 = new MyClass();
$obj1->myProp 3;
echo 
$obj2->myProp"\n"// will display 3
echo $obj3->myProp"\n"// will display 2

?>

PHP4 made a copy of every object when it was assigned to a new variable.. to understand the difference, the output of this script in PHP4 would be:

2
1
3
1

Value references in PHP are serialized as:

r:19;

If you want to know which variable this is referencing to, you should be looking for the 19th variable you decoded so far in your structure (you start counting at 1), but excluding other references and property names.. (so array indexes don't count, array values do..).

PHP4 seems to be treating r and R as the same thing (pointer references), so there's no need for conversion for PHP4 clients. I tested this with PHP 4.4.4.