Groby Computers Ltd - Leicester

Oracle Database Administration

Useful Stuff Oracle Scripts Unix Scripts

A word about Unicode and Byte Order Marks




Symptoms:

=========

Text files sent to you from another platform do not execute correctly in sqlplus

Oracle version – Generic

This is a common problem in cross platform dev environments:

Files that have been encoded in a Unicode derivative may contain non-pritable characters that interfere with sqlplus' interpretation of SQL statements.

Cause:

======

Files that are transferred between a Windows environment and a Unix environment may be affected by a variety of issues.

1. End-of-line characters
2. Byte Order Mark characters

1. End-of-line characters.

In Unix environments the end-of-line is delimited by a single character - the newline character aka a line-feed 0x0a ( control-J ). Terminal emulators know that when they encounter a newline, they also need to supply a carriage-return character 0x0c ( control-M ). Whereas in windows, files explicitly have both the carriage-return and line-feed characters at the end of a line. [ For those not familiar with type-writers, a carriage-return is the control character that moves the cursor to the start of the line. A line-feed moves the cursor down one line ]

This difference can be very annoying - with files in Windows appearing to be one continuous line or in Unix having ^M at the end of each line or worse!

Solution(s):

A. When transferring files between environments, do so in TEXT mode. Most programs, like ftp, winscp and similar, have a TEXT mode for this - they add/remove the extra characters so you don't have to.

B. Use a conversion script such as dos2unix

2. Byte Order Marks

For files that have been encoded in a derivative of Unicode, there may well be characters at the start of the file denoting the endianness of the file - this is called the Byte Order Mark e.g....


        # od -bc fred.txt
        0000000 357 273 277 115 171 040 156 141 155 145 040 151 163 040 106 162
                357 273 277   M   y       n   a   m   e       i   s       F   r
        0000020 145 144 012
                  e   d  \n

  

Notice the three characters Octal 357, Octal 273 and Octal 277 at the very start of the above file? This is the UTF-8 encoding of the Byte Order Mark. I'll explain ...

Unicode is a coding system that encompases most of the character sets in the world. It has 16 planes each having 65,535 ( 0xFFFF ) possible characters making a total of 1,114,112 different code-points! ( from 0 to 0x10FFFF ). Most of the time, it is sufficient to deal only with plane 0 - the Base Multi-lingual Plane (BMP).

The code-points of the BMP, can be encoded into an 8-bit, 16-bit or even a 32-bit character set. These encodings are called Unicode Transformation Formats and they give us UTF-8, UTF-16 and UTF-32.

For a multi-byte character set like UTF-16 the bytes of each character will be stored in one of two possible orders - these are known as Big Endian and Little Endian.

When a file is transferred from one platform to another it is important to know which way around the bytes were stored and if there is no other indication, a Byte Order Mark ( BOM ) is added to the start of the file to show this. This is always 0xFEFF so that if the Endianness changes, the BOM will be affected in a predictable way.

This is all well and good but UTF-8 isn't a multi-byte character set, so there is no Endianness but nevertheless some Windows programs still encode the data with a BOM (notepad is one such program) causing interoperability issues.

Solution:

The simplest is to process files with a script - I have written a script which does a bit more than the possibly familiar dos2unix script called (get ready to groan!) dos3unix ...

#!/bin/ksh
#--------------------------------------------------------------------------------
# File:    dos3unix
# Purpose: Remove the UTF-8 Byte Order Mark and Windows style carriage-returns
# Usage:   dos3unix file [file ...]
# Notes:   The orignal file is preserved as file- (i.e. with a hyphen appended)
#--------------------------------------------------------------------------------
 
  alias doit='true'
 
  while getopts n name
  do
    case $name in
      n)  alias doit='false';;
      *)  cat << eof
dos3unix: Usage
# dos3unix [-n] file [file...]
  -n  check and report only, no changes made
eof
exit 1 ;;
    esac
  done
  shift $(($OPTIND -1))
 
  for f in "$@"
  do
 
    ty=$( file -b "$f" )
    n=$( grep -cP '\r$' "$f" )
    f3=$( sed '/./q' "$f" | cut -c1-3 )
    f32=$( sed '/./q' "$f" | cut -c1-3 | tr -d '\357\273\277' )
 
    if [[ "$f3" != "$f32" && "$n" == "0" ]]
    then
 
      doit && mv "$f" "${f}-"
      doit && tr -d '\357\273\277' < "${f}-" > "$f"
      echo "$f: unBOM: $ty"
 
    elif [[ "$f3" != "$f32" ]]
    then
 
      doit && mv "$f" "${f}-"
      doit && tr -d '\357\273\277' < "${f}-" | sed 's/\r$//' > "$f"
      echo "$f: dos2unix-unBOM: $ty"
 
    elif [[ "$n" == "0" ]]
    then
 
      echo "$f: $ty"
 
    else
 
      doit && mv "$f" "${f}-"
      doit && sed 's/\r$//' < "${f}-" > "$f"
      echo "$f: dos2unix: $ty"
 
    fi
 
  done
 
# End-of-file dos3unix
  

Simply cut-and-paste the above script into the file dos3unix, make it executable and then put it somewhere in your path e.g.

  # mkdir ~/bin
  # PATH=$PATH:~/bin
  # chmod +x dos3unix
  # mv dos3unix ~/bin
  
Page Updated Wed Oct 19 21:05:56 BST 2011