't Bijstere spoor

't Bijstere spoor

A blog about Web development

Unicode nearing 50% of the web

According to a recent post from the Google Blog, Unicode nearing 50% uptake on the web. A rather steep graph as well:

unicode uptake graph

This is pretty good news. I've had the 'pleasure' of working with a number of integration project where the 3rd party was still using iso-8859-1 (aka latin-1). Usually when this is the case, its not by choice but because of their software's default settings (Browsers, MySQL, etc.). I for one hope non-unicode charsets will soon be a thing of the past.

One other note in the post was about ligatures, such as fi and the dutch ij. If this is the first time you heard about these, you might be surprised to see that you can (likely) only copy-paste ij as a whole, and not just the i or j. It's one unicode character, not two. It just made me wonder: what kind of software would generate these, and more importantly why?


Comments

Dave
Dave said on Friday, 29 January 2010 at 2:30 pm CET

"It just made me wonder: what kind of software would generate these, and more importantly why?"

Well, the answer is right there in the post you referenced, it just looks better in documents intended for printing: "[...] especially generated PDF documents."

Jordan Walker
Jordan Walker said on Friday, 29 January 2010 at 3:01 pm CET

Let the battle and competition rage.

Evert
Evert said on Friday, 29 January 2010 at 7:03 pm CET

@Dave,

Maybe I'm crazy, but shouldn't it be a job of the font to make a combination of 2 characters look better?

Lars Gunther
Lars Gunther said on Friday, 29 January 2010 at 7:24 pm CET

And of course this means that PHP 6 is becoming more important with each day. But is it in sight?

Jay Pipes
Jay Pipes said on Friday, 29 January 2010 at 8:53 pm CET

Drizzle got rid of all non-UTF-8 character sets a long time ago. The web is UTF8 and so should be the data behind it.

One minor thing, though. UTF-8 != Unicode :) UTF-8 is technically just a mapping of Unicode code points to a range of values.

I would argue that the web has standardized on UTF-8, not UCS4, UTF-32, UTF-16 or other Unicode tranformation mappings...

Cheers!

jay

Nelson Menezes
Nelson Menezes said on Saturday, 30 January 2010 at 12:41 pm CET

As mentioned above, ligatures simply look better on print or large font sizes on-screen.

If you are getting situations where ligatures are being copied-pasted then someone screwed up -- the ligatures are meant to be applied on rendering only, not on source material. So, it would be the job of a browser to introduce ligatures on screen, but still allow copy/paste of individual characters.

BTW, great things are coming... http://hacks.mozilla.org/2009/10/font-control-for-designers/

Joost
Joost said on Wednesday, 3 February 2010 at 9:26 am CET

Ligatures like IJ are also important because of capitalization rules, I know Bing Maps only uppercases the first letter, which is wrong in Dutch.

http://www.bing.com/maps/#JnE9eXAuaGV0K2lqJTdlc3N0LjAlN2VwZy4xJmJiPTUzLjAxOTQzMDQyMDYxODIlN2U1LjYzOTk5NTU2MDA1MDAxJTdlNTMuMDAzNzU3NTgxOTI4JTdlNS42MDAwODQyODk5MDg0MQ==

http://maps.google.nl/maps?f=q&source=s_q&hl=nl&geocode=&q=het+ij&sll=52.469397,5.509644&sspn=3.935848,9.876709&ie=UTF8&hq=&hnear=Het+IJ&ll=52.369992,4.997234&spn=0.030814,0.077162&z=14







Solve this simple math problem to prevent bots from spamming this blog:
5 + 6 =