Class Unicode


  • public final class Unicode
    extends Object
    Various unicode manipulation methods that are more efficient then chaining operations: all is done in the same buffer without creating a bunch of string objects.
    Author:
    Apache Directory Project
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static char bytesToChar​(byte[] bytes)
      Return the Unicode char which is coded in the bytes at position 0.
      static char bytesToChar​(byte[] bytes, int pos)
      Return the Unicode char which is coded in the bytes at the given position.
      static byte[] charToBytes​(char car)
      Return the Unicode char which is coded in the bytes at the given position.
      static int countBytes​(char[] chars)
      Count the number of bytes included in the given char[].
      static int countBytesPerChar​(byte[] bytes, int pos)
      Count the number of bytes needed to return an Unicode char.
      static int countChars​(byte[] bytes)
      Count the number of chars included in the given byte[].
      static int countNbBytesPerChar​(char car)
      Return the number of bytes that hold an Unicode char.
      static boolean isUnicodeSubset​(byte b)
      Check if the current byte is in the unicodeSubset : all chars but '\0', '(', ')', '*' and '\'
      static boolean isUnicodeSubset​(char c)
      Check if the current char is in the unicodeSubset : all chars but '\0', '(', ')', '*' and '\'
      static boolean isUnicodeSubset​(String str, int pos)
      Check if the current char is in the unicodeSubset : all chars but '\0', '(', ')', '*' and '\'
      static String readUTF​(ObjectInput objectInput)
      Reads in a string that has been encoded using a modified UTF-8 format.
      static void writeUTF​(ObjectOutput objectOutput, String str)
      Writes four bytes of length information to the output stream, followed by the modified UTF-8 representation of every character in the string str.
    • Method Detail

      • countBytesPerChar

        public static int countBytesPerChar​(byte[] bytes,
                                            int pos)
        Count the number of bytes needed to return an Unicode char. This can be from 1 to 6.
        Parameters:
        bytes - The bytes to read
        pos - Position to start counting. It must be a valid start of a encoded char !
        Returns:
        The number of bytes to create a char, or -1 if the encoding is wrong. TODO : Should stop after the third byte, as a char is only 2 bytes long.
      • bytesToChar

        public static char bytesToChar​(byte[] bytes)
        Return the Unicode char which is coded in the bytes at position 0.
        Parameters:
        bytes - The byte[] represntation of an Unicode string.
        Returns:
        The first char found.
      • bytesToChar

        public static char bytesToChar​(byte[] bytes,
                                       int pos)
        Return the Unicode char which is coded in the bytes at the given position.
        Parameters:
        bytes - The byte[] represntation of an Unicode string.
        pos - The current position to start decoding the char
        Returns:
        The decoded char, or -1 if no char can be decoded TODO : Should stop after the third byte, as a char is only 2 bytes long.
      • countNbBytesPerChar

        public static int countNbBytesPerChar​(char car)
        Return the number of bytes that hold an Unicode char.
        Parameters:
        car - The character to be decoded
        Returns:
        The number of bytes to hold the char. TODO : Should stop after the third byte, as a char is only 2 bytes long.
      • countBytes

        public static int countBytes​(char[] chars)
        Count the number of bytes included in the given char[].
        Parameters:
        chars - The char array to decode
        Returns:
        The number of bytes in the char array
      • countChars

        public static int countChars​(byte[] bytes)
        Count the number of chars included in the given byte[].
        Parameters:
        bytes - The byte array to decode
        Returns:
        The number of char in the byte array
      • charToBytes

        public static byte[] charToBytes​(char car)
        Return the Unicode char which is coded in the bytes at the given position.
        Parameters:
        car - The character to be transformed to an array of bytes
        Returns:
        The byte array representing the char TODO : Should stop after the third byte, as a char is only 2 bytes long.
      • isUnicodeSubset

        public static boolean isUnicodeSubset​(String str,
                                              int pos)
        Check if the current char is in the unicodeSubset : all chars but '\0', '(', ')', '*' and '\'
        Parameters:
        str - The string to check
        pos - Position of the current char
        Returns:
        True if the current char is in the unicode subset
      • isUnicodeSubset

        public static boolean isUnicodeSubset​(char c)
        Check if the current char is in the unicodeSubset : all chars but '\0', '(', ')', '*' and '\'
        Parameters:
        c - The char to check
        Returns:
        True if the current char is in the unicode subset
      • isUnicodeSubset

        public static boolean isUnicodeSubset​(byte b)
        Check if the current byte is in the unicodeSubset : all chars but '\0', '(', ')', '*' and '\'
        Parameters:
        b - The byte to check
        Returns:
        True if the current byte is in the unicode subset
      • writeUTF

        public static void writeUTF​(ObjectOutput objectOutput,
                                    String str)
                             throws IOException
        Writes four bytes of length information to the output stream, followed by the modified UTF-8 representation of every character in the string str. If str is null, the string value 'null' is written with a length of 0 instead of throwing an NullPointerException. Each character in the string s is converted to a group of one, two, or three bytes, depending on the value of the character. Due to given restrictions (total number of written bytes in a row can't exceed 65535) the total length is written in the length information (four bytes (writeInt)) and the string is split into smaller parts if necessary and written. As each character may be converted to a group of maximum 3 bytes and 65535 bytes can be written at maximum we're on the save side when writing a chunk of only 21845 (65535/3) characters at once. See also DataOutput.writeUTF(String).
        Parameters:
        objectOutput - The objectOutput to write to
        str - The value to write
        Throws:
        IOException - If the value can't be written to the file
      • readUTF

        public static String readUTF​(ObjectInput objectInput)
                              throws IOException
        Reads in a string that has been encoded using a modified UTF-8 format. The general contract of readUTF is that it reads a representation of a Unicode character string encoded in modified UTF-8 format; this string of characters is then returned as a String. First, four bytes are read (readInt) and used to construct an unsigned 16-bit integer in exactly the manner of the readUnsignedShort method . This integer value is called the UTF length and specifies the number of additional bytes to be read. These bytes are then converted to characters by considering them in groups. The length of each group is computed from the value of the first byte of the group. The byte following a group, if any, is the first byte of the next group. See also DataInput.readUTF().
        Parameters:
        objectInput - The objectInput to read from
        Returns:
        The read string
        Throws:
        IOException - If the value can't be read