Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘ASCII characters

Moving on to Unicode

with 10 comments

Introduction

My first computer was an Apple II Plus, which didn’t even support lower case characters. Everything was upper case. To do word processing you used special characters to change case. Now we expect our computer to not just handle upper and lower case characters, but accented characters, special symbols, all the Asian language characters, all the Arabic characters and everything else.

In the beginning there was ASCII which allowed computers to encode the alphabet, numbers and the common typewriter characters, all 127 of them. Then we added another 127 characters for accented characters. But there were quite a few different accented characters so we had a standard first 127 characters and then various options for the upper 127 characters. This allowed us to handle most European languages on computers. Then there was the desire to support Chinese characters which number in the tens of thousands. So the idea came along to represent these as two bytes or 16 bits. This worked well, but it still only supported one language at a time and often ran out of characters. In developing this there were quite a few standards and quite a bit of incompatibility of moving files containing these between computers systems. But generally the first 127 characters were the original ASCII characters and then the rest depended on the code page you chose.

unicode

To try to bring some order to this mess and make the whole problem easier, Unicode was invented. The idea here was to have one character set that contained all the characters from all the languages in the world. Sounds like a good idea, but of course Computer Scientists underestimated the problem. They assumed this would be at most 64K characters and that they could use 2 bytes to represent each character. Like the 640K memory barrier, this turned out to be quite a bad idea. In fact there are now about 110,000 Unicode characters and the number is growing.

unicode1

Unicode specifies all the characters, but it allows for different encodings. These days the two most common are UTF8 and UTF16. Both of these have pros and cons. Microsoft chose UTF16 for all their systems. Since I work with Sage 300 and since we are trying to solve this on Windows that is what we will discuss in this article. To convert Sage 300 to Unicode using UTF8 would probably have been easier since UTF8 was designed to give better compatibility with ASCII, but we live in a Windows UTF16 world where we want to interact well with SQL Server and the Windows API.

Microsoft adopted UTF16 because they felt it would be easier, since basically each string became twice as long since every character was represented by 2 bytes. Hence memory doubled everywhere and it was simple to convert. This was fine except that 2 bytes doesn’t hold every Unicode character anymore, so some characters actually take 2 16-byte slots. But generally you can mostly predict the number of characters in a given amount of memory. It also lends itself better to just using array operations rather than having to go through strings with next/previous operations.

Compatibility

Windows took the approach that to maintain compatibility they would offer two APIs, one for ANSI and one for Unicode. So any Windows API call that takes a string as a parameter will have two versions, one ending A (for ANSI) and one ending in W (for Wide). Then in Windows.h if you compile with UNICODE defined then it uses the W version, else it uses the A version. This certainly adds a lot of pollution to the Windows API. But they maintained compatibility with all pre-existing programs. This was all put in place as part of Win32 (since recompiling was necessary).

For Sage 300 we’ve resisted going all in on Unicode, because we don’t want to double the size of our API and maintain that for all time, and if we do release a Unicode version then it will break every third party add-in and customization out there. We have the additional challenge that Unicode doesn’t work very well in VB6.

But with our 64 Bit version, we are not supporting VB6 (which will never be 64Bit) and all third parties have to make changes for 64 Bit anyway, so why not take advantage of this and introduce Unicode at the same time? This would make the move to 64 Bit more work, but hopefully will be worth it.

Why Switch to Unicode

Converting a large C/C++ application to Unicode is a lot of work. Why go to the effort? Sage 300 has had a traditional and simplified Chinese versions for a long time. What benefits does Unicode give us over the current double byte system we support?

One is that in double byte, only one character set can be installed on Windows at a time. This means for our online version we need separate servers to host the Chinese version. With Unicode we can support all languages from one set of servers, we don’t need separate sets of servers for each language group. This makes managing the online server farm much easier and much more uniform for upgrading and such. Besides our online offerings, we have had customers complain that when running Terminal Server they need separate ones for various branch offices in different parts of the world using different languages.

Another advantage that now we can support mixtures of script, so users can enter Thai in one field, Arabic in another and Chinese in another. Perhaps a bit esoteric, but it could have uses for optional fields where there are different ones for different locales.

Another problem we tend to have is with sort orders in all these different incompatible multi-byte character systems. With Unicode this becomes much more uniform (although there are still multiple of these) and much easier to deal with. Right now we avoid the problem by limiting key fields to upper case alphanumeric. But perhaps down the road with Unicode we can relax this.

A bit advantage is ease of setup. Getting the current multi-byte systems working requires some care in setting up the Windows server that often challenges people and causes problems. With Unicode, things are already setup correctly so this is much less of a problem.

Converting Sage 300

SQL Server already supports Unicode. Any UI technology newer than VB6 will also support Unicode. So that leaves our Business Logic layer, database driver and supporting DLLs. These are all written in C/C++ and so have to be converted to Unicode.

We still need to maintain our 32-Bit non-Unicode version and we don’t want two sets of source code, so we want to do this in such a way that we can compile the code either way and it will work correctly.

At the lower levels we have to use Microsoft’s tchar.h file which provides defines that will compile one way when _UNICODE is defined and another when it isn’t. This is similar to how Windows.h works for the Windows runtime, only it does it for the C runtime. For C++ you need a little extra for the string class, but we can handle that in plustype.h.

One annoying thing is that to specify a Unicode string in C, you do l”abc”, and with the macro in tchar.h, you change it to _T(“abc”). Changing all the strings in the system this way is certainly a real pain. Especially since 99.99% of these will never contain a non-ASCII character because they are for debugging or logging. If Microsoft had adopted UTF8 this wouldn’t have been necessary since the ASCII characters are the same, but with UTF16 this, to me is the big downside. But then it’s pretty mechanical work and a lot of it can be automated.

At higher levels of Sage 300, we rely more on the types defined in plustype.h and tend to use routines form a4wapi.dll rather than using the C runtime directly. This is good, since we can change these places to compile for either and hide a lot of the details from the application programmer. The other is that we only need to convert the parts of the system that deal with the database and the parts that deal with string handling (like error messages).

One question that comes up is what will be the length on fields in the database? Right now if it’s 60 characters then its 60 bytes. Under this method of converting the application the field will be 60 UTF16 characters for 120 bytes. (This is true if you don’t use the special characters that require 4 bytes, but most characters are in the standard 64K block).

Summary

Moving to both 64 Bits and Unicode is quite an exciting prospect. It will open up the doors to all sorts of advanced features, and really move our application ahead in a major way. It will revitalize the C/C++ code base and allow some quite powerful capabilities.

As a usual disclaimer, this article is about some research and proof of concept work we are doing and doesn’t represent a commitment as to which future version or edition this will surface in.