DBA Catalog: unicode

Recently we created a database link on user’s request wondered to run ad-hoc query with complex join including demographic data (in Georgian language) residing on the remote database. Db link provided, user issued a query and got the results shown on picture 1. Question marks, caused by wrong character conversion.

At first we check character-set on both databases:

SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET’;

Source db - RU8PC866 (this was set in that database before unicode invented :) )
Remote db - AL32UTF8
So, we face the problem as Unicode data retrieved from non-Unicode database. The same question marks appear all the time and client's NLS_LANG variable does not help.
Let see what those questions marks are:

SELECT dump(convert(name, 'RU8PC866' , 'AL32UTF8' ) , 1016) from remotetable@db_link;

Typ=1 Len=5 CharacterSet=AL32UTF8: 3f,3f,3f,3f,3f
Typ=1 Len=6 CharacterSet=AL32UTF8: 3f,3f,3f,3f,3f,3f
Typ=1 Len=6 CharacterSet=AL32UTF8: 3f,3f,3f,3f,3f,3f

We did explicit conversion where hexadecimal 3F is 63 in decimal and question mark symbol in ASCII. That's it.
dump function returns a varchar2 value that includes the datatype code, the length in bytes, and the internal representation of the expression. In our case AL32UTF8.
convert function converts a character string from one character set to another. In our case from AL32UTF8 into RU8PC866. The same conversion takes place implicitly when we select remote table.

Let’s try another conversion:

SELECT dump(convert(name, 'UTF8' , 'AL32UTF8' ) , 1016) from remotetable@db_link;

Typ=1 Len=15 CharacterSet=AL32UTF8: e1,83,93,e1,83,94,e1,83,9b,e1,83,9c,e1,83,90
Typ=1 Len=18 CharacterSet=AL32UTF8: e1,83,a4,e1,83,a0,e1,83,98,e1,83,93,e1,83,9d,e1,83,9c
Typ=1 Len=18 CharacterSet=AL32UTF8: e1,83,92,e1,83,a0,e1,83,98,e1,83,92,e1,83,9d,e1,83,9a

Here, we see correct codes in UTF-8.
For example: code “e1,83,93” corresponds to Georgian character ”დ” (don). However UTF8 representation has a little problem with our further development. The reason is that we are going to use function UNISTR which takes input argument in UCS-2 ( UTF-16 nowadays) format as two code units, the first from the high-surrogates range (U+D800 to U+DBFF), and the second from the low-surrogates range (U+DC00 to U+DFFF). In UTF-8 representation “e1,83,93” equals to pair 10, D3 in UTF-16.
So, correct conversion retrieved in oracle's AL16UTF16:

SELECT dump(convert(name, AL16UTF16, 'AL32UTF8' ) , 1016) from remotetable@db_link;

Typ=1 Len=10 CharacterSet=AL32UTF8: 10,d3,10,d4,10,db,10,dc,10,d0
Typ=1 Len=12 CharacterSet=AL32UTF8: 10,e4,10,e0,10,d8,10,d3,10,dd,10,dc
Typ=1 Len=12 CharacterSet=AL32UTF8: 10,d2,10,e0,10,d8,10,d2,10,dd,10,da

At this point we simply use UNISTR passing to it a value returned from dump function.
For example: UNISTR ('\10D3') results Georgian character "დ" (don).
Only problem left is to parse dump results in well formatted string of codes understandable for function UNISTR.
For this purpose we created a small function taking a string as an argument, performing transformation and returning formatted code sequence for UNISTR function
pseudo-function:

f(Typ=1 Len=12 CharacterSet=AL32UTF8: 10,d2,10,e0,10,d8,10,d2,10,dd,10,da) = 
"\10d2\10e0\10d8\10d2\10dd\10da".

Here is the real function:

CREATE OR REPLACE function PORTA_MNP_APP.unihex (pString varchar2)
   return varchar2
is
   v_pos1 number;
   v_pos2 number;
   v_res varchar2(2000); 
   v_res1 varchar2(2000); 
   v_char  char(1);
   v_char1  varchar2(2000);
   v_len number; 
 begin

 if  pString is null  then 
    return null;
 end if; 

 SELECT dump(convert(pString, 'AL16UTF16' , 'AL32UTF8' ) , 16) into v_res from dual; 

v_res := v_res || ','; 
v_pos1 :=instr(v_res, ':',1); 
v_len:=length( v_res ); 
v_pos2:=0; 

FOR i IN v_pos1+2 .. v_len  LOOP 
     v_char := substr( v_res, i, 1 ); 
     if v_char =',' then 
      if  length(v_char1)=1 then 
         v_res1:=v_res1 || '0' || v_char1; 
         v_char1:=''; 
          v_pos2:=v_pos2+1;  
              else 
            v_res1:=v_res1 ||  v_char1; 
            v_char1:='';
            end if; 
      else 
        v_char1:=v_char1 || v_char; 
            if  mod(v_pos2, 4)=0 and v_pos2<> v_len then   
                    v_res1:=v_res1 || '\'; 
               end if;  
            v_pos2:=v_pos2+1;  

       end if; 
  END LOOP; 

  return v_res1; 

 exception  
    WHEN OTHERS THEN  
      --DBMS_OUTPUT.Put_Line(pString); 
      return null;
   end;
/

After compiled we created a view in remote database as follows:

CREATE OR REPLACE VIEW remoteview
(
ID,
REQUEST_ID,
CUSTOMER_TYPE,
NAME,
SURNAME,
CUSTOMER_FULLNAME,
ADDRESS,
)
AS
SELECT ID,
REQUEST_ID,
CUSTOMER_TYPE,
unihex (NAME) NAME,
unihex (SURNAME) SURNAME,
unihex (CUSTOMER_FULLNAME) CUSTOMER_FULLNAME,
unihex (ADDRESS) ADDRESS,
FROM np_history_custinfo;

Finally, in source database session we run:

select unistr(name), unistr(surname), unistr(customer_fullname), unistr(address)
FROM remoteview@db_link

Output is pretty fine and formatted well in Georgian language. Unfortunately we can’t post final result here as it has sensitive data. :)
Thanks,

DBA Catalog

Google it ....

Friday, January 25, 2013

Retrieve Unicode data in non-unicode database using db_link (Georgian)