0xFF 0xFE -> CVS -> 0xEF 0xBB 0xBF

2005-06-17

My project house-keeping yesterday ended up with a rather strange discovery. I have some test log files that contain Unicode characters and are stored on disk as UTF-16 with the correct 2 byte ‘byte order mark’ (BOM) header of 0xFF 0xFE. When I discovered that I needed to save some test logs as Unicode I hacked together some code that dealt with the UTF-16 BOM and did the right thing. Yesterday’s mammoth CVS checkin and test was obviously the first time that I’d checked these files out of CVS and run my tests. The tests failed in very strange ways and some of the test logs seemed to have been corrupted, they had 0xEF 0xBB 0xBF bytes at the start…

After some googling I discovered that 0xEF 0xBB 0xBF is the BOM for UTF-8 and that CVS had obviously translated UTF-16 to UTF-8 during the check-in/check-out process. I’d really rather it didn’t do that but I cant find a way to prevent it (or even a mention of it in the docs).

Once I realised what was going on I added some code to deal with UTF-8 files and my tests ran again and all was well. I guess I’m lucky that CVS bothered to add the BOM to the UTF-8 files (it’s optional for UTF-8) as I’m sure it would have taken me far longer to work out what was going on if I couldn’t just google for 0xEF 0xBB 0xBF.

[Updated: 12:18] Thanks for the suggestions to put the file under CVS in binary mode. I tried that (using cvs admin -kb to convert the file) but then the file seemed to have its line end mucked about with. At that point I gave up on CVS and fixed the code.