dimanche 23 août 2015

Phonetic search of arabic names latin spelled

The goal of this article is to give an approach to search record on a latin database of recrods containing arab names / surnames,  of course, there is no problem if the records were written in arabic letter encoded in a charset that supports this language such ar8mswin1256 charset recommended for oracle databases, but when we phonetic translate an arabic name and spell it in latin chars, that's another kind of non bijective job,
Let's take a look:

Mohammed (salla Allah Alyhi wa sallam) can be spelled :
   1- Mohamed
   2- Muhamed
   3- Mouhamed
   4- Mouhammed
   5- Med (Abusive spelling in north Afraica countries ex french occupied  countries)
   6- Mohd(Same a 5)

you must note that in arabic theres there is often a successive double consonant, and those who don't respect phonetic translation can omit the second char, even if the word become incorrect in certain case
eg:
 'S' between two vowels that become 'Z'

Prohet's name is a very frequent name in Muslim countries, we can say then that's a particular case?
Let's get another look


i'll take now my name and others
Wassim, can be spelled:
   1- Ouassim
   2- Wessim
   3- Ouessim
   4- Wassime
   5- Ouassime
   6- Ouessime

And that for all names with 'W' char, we can replace the W by ou and we have the same name.
We can note also that the 'E' char at the end of the word often doesn't metter

Tayeb can be spelled:
   1-Taieb
We can note that 'Y' char and 'I' (also 'EE' in middle east culture ) have the same phonetic effect.
then we can spell
Sami-->Samy (don't forget Sammy -double consonant - evoked earlier)
Rym -->Rim

etcetera etcetera...

Let's now comeback to the approach that can be adapted to resolve the problem.

First of all, we must establish our names dictionary, not a real dictionary because arabic language has the must vast dictionary with 12 000 000 word vs 600 000 word in english **, but let's say that we will build a function that will transform an arabic name on another word that we can exploit later by a simple matching with the result of the same function executed on the searched for name.

behind sentences
We'll build function f
for each name of our database
 f(name) = name' stored somewhere

select from database rows with condition ; f(searchedForName) = name'

 f(name) = name' stored somewhere ??? why??
storing the result will be very useful because the result is always the same for records stored in our database, we win a time and money by ding that until the change of our function f we must re-execute
the function on our population.

function f(name):
Considering our parameter name which is a String, an intelligent algorithm will consider that string a char sequence, other ways we will be in the case **.We'll transform that char sequence to another one which we will respect our universal rules, and unlike ** the cardinality of  rules set is an infinitesimal (x 10) in front of arabic words set count (12 000 000)
Our algorithm will take a decision and return a string that can be blank if needed for all incoming letters in our world: Typical recursive behavior, the normalized name will be the concatenation of results of recursive call on the word.

We remark also that the pseudo strings generated during recursive call, needs some extra information:
specially previous char 'in consonant case (double char) or other cases' end sometimes we need the next char if it exists.


ALGORITHM String getArabicPhoneticEquivalent (String name , Int depth) RETURN String

DECLARE
   String toReturn;
   char nextChar,currentChar;
BEGIN

IF StringLength(name) == depth THEN //Condition to break recursivity
    return "";
END IF;

toReturn =""; // Must be set to empty string

IF depth == 0 THEN
     name =  StringLowerCase(name);//Affects lowercase to input
     name =  cleanSpecialChars(name); //Transfom some unsupported letters
     toReturn  = getSpecialNames(name);// Tests for special names
END IF;


IF toReturn != "" THEN // Case we have special name, no need to continue
   return toReturn;
END IF;

currentChar = name[depth];

IF StringLength(name) > depth THEN
     nextChar = name[depth+1];
ELSE
     nextChar = ' ';
END IF;
depth = depth + 1;
IF nextChar == currentChar THEN //Double letter simulated to one
    return  getArabicPhoneticEquivalent (name , depth);
END IF;

IF consonant(name[depth]) THEN
   
    IF name[depth] == 'h' THEN //Skip h after a consonant
         IF depth > 1 AND consonant(name[depth-1]) THEN
              return  getArabicPhoneticEquivalent (name , depth);
         END IF;
     END IF;
     return StringConcat(getConsonantEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
END IF;

IF vowel(name[depth]) THEN
       return StringConcat(getVowelEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
END IF;

END ALGORITHM;    

ALGORITHM skipSpecialChars(String name) RETURN String
BEGIN
  name = StringReplaceAllSequences("w","ou");
  name = StringReplaceAllSequences("ï","i");
  name = StringReplaceAllSequences("î","i");
  name = StringReplaceAllSequences("ô","o");
  name = StringReplaceAllSequences("é","e");
  name = StringReplaceAllSequences("è","e");
  name = StringReplaceAllSequences("ê","e");
  name = StringReplaceAllSequences("à","a");
  name = StringReplaceAllSequences("ç","c");
  return name;
END ALGORITHM;

ALGORITHM getSpecialNames(String name) RETURN String
BEGIN
   IF name == "med" OR name =="mohd" THEN
       return 'mouhamed';
   END IF;

   return "";
END ALGORITHM;

ALGORITHM getConsonantEquivalent(char currentChar, char nextChar) RETURN String
BEGIN
IF currentChar == 'c' THEN
    IF  consonant(nextChar)THEN
          return "k";
    ELSEIF vowel(nextChar) THEN
          return "s"
    END IF;
END IF;
return currentChar+"";
END ALGORITHM

ALGORITHM getVowelEquivalent(char currentChar, char nextChar) RETURN String
BEGIN
IF currentChar == 'y' THEN
     return "i";
END IF;
IF currentChar == 'e' THEN
     return "a";
END IF;
return currentChar+"";
END ALGORITHM;

Note that we can make this algorithm more performant by adding additional controls
such as:
The abreviation of ben 'b'
The 'el' similar to 'al'

In the PLSQL code below i've added these controls and some others
'Code may be more up te date than algorithm ( no worry the spirit is kept)'

 
create or replace 
package body PKG_SEARCH is
  -------------------------------------------------------------
  -- Teste les voyelles
  -- @return boolean
  -------------------------------------------------------------
  function f_is_vowel(c in char) return boolean is
  begin
     if c in ('a','e','i','o','y','u') then 
        return true;
     else
        return false;
     end if;   
  end;
    -------------------------------------------------------------
  -- Teste les consonnes
  -- @return boolean
  -------------------------------------------------------------
  function f_is_consonent(c in char) return boolean is
  begin
     if c in ('b','c','d','f','g','h','j','k','l','m','n','p','q','r','s','t','v','w','x','z') then      
        return true;
     else
        return false;
     end if;     
  end;
  -------------------------------------------------------------
  -- Teste les consonnes
  -- @return boolean
  -------------------------------------------------------------
  function  f_special_chars(a_name in out varchar) return varchar is
  begin    
    a_name := replace(a_name,'ï','i');
    a_name := replace(a_name,'î','i');
    a_name := replace(a_name,'ô','o');
    a_name := replace(a_name,'é','e');
    a_name := replace(a_name,'è','e');
    a_name := replace(a_name,'ê','e');
    a_name := replace(a_name,'à','a');
    a_name := replace(a_name,'ç','c');    
    a_name := replace(a_name,'w','o');    
    a_name := replace(a_name,'u','o');   
    a_name := replace(a_name,'y','i');   
    a_name := replace(a_name,' b ','ben');   
    a_name := replace(a_name,' b','ben');   
    a_name := replace(a_name,' al ',' al');   
    if (length(a_name) >1 and substr(a_name,1,2) = 'b ') or (length(a_name) >3 and substr(a_name,1,2) = 'ben ')  then
          a_name := 'ben '|| substr(a_name,2);
    end if;
    if length(a_name) >2 and substr(a_name,1,2) = 'el' then
          a_name := 'al'|| substr(a_name,3);
    end if;
    return a_name;  
  end;
  -------------------------------------------------------------
  -- Traite les noms speciaux
  -- @return boolean
  -------------------------------------------------------------
  function f_special_names(a_name in varchar) return  varchar is
  begin
     if a_name = 'med' or a_name ='mohd' then
       return 'mouhamed';
     end if;  
     return '';
  end;
  -------------------------------------------------------------
  -- Retourne la chaine de caractère qui correspond à la consonne en question
  -- @return boolean
  -------------------------------------------------------------
function f_consonant_equivalent(currentChar in char, nextChar in char) return  varchar is
  begin
  if currentChar = 'c' THEN
      if  f_is_consonent(nextChar)THEN 
            return 'k';
      elsif f_is_vowel(nextChar) THEN
            return 's';
      end if;
  end if;
  if currentChar = 'd' THEN
        if  f_is_consonent(nextChar)THEN
          return '';
        end if;
  end if;  
  return currentChar;
end;
  -------------------------------------------------------------
  -- Retourne la chaine de caractère qui correspond à la voyelle en question
  -- @return boolean
  -------------------------------------------------------------
function f_vowel_equivalent(currentChar in char, nextChar in char) return varchar is
begin  
  if currentChar = 'e' then
      if  (nextChar=' ') then
          return '';
      else    
          return 'a';
      end if;
  end if;    
  return currentChar;
end; 

  -------------------------------------------------------------
  -- Retourne l'equivalent phonetic
  -- Main function 
  -- @return boolean
  -------------------------------------------------------------
function f_arabic_phonetic_aquivalent (a_name in out varchar, depth in out number) return varchar is
   toReturn varchar(256);
   nextChar char;
   currentChar char;
   nextDepth number;
BEGIN

IF depth = 1 THEN
     a_name :=  lower(a_name);
     a_name :=  f_special_chars(a_name);  
END IF;

toReturn  := f_special_names(a_name);
IF toReturn <> '' THEN 
   return toReturn;
END IF;

IF length(a_name) < depth THEN 
    return '';
END IF;

toReturn :=''; 
nextDepth := depth + 1;

currentChar := substr(a_name,depth,1);
IF currentChar = ' ' THEN
    return  ' '||f_arabic_phonetic_aquivalent (a_name , nextDepth);
END IF;
IF length(a_name) > depth THEN
     nextChar := substr(a_name,nextDepth,1);      
ELSE
     nextChar := ' ';
END IF;

IF nextChar = currentChar THEN 
    return  f_arabic_phonetic_aquivalent (a_name , nextDepth);
END IF;

IF f_is_consonent(substr(a_name,depth,1)) THEN    
    IF substr(a_name,depth,1) = 'h' THEN 
         IF depth > 1 AND f_is_consonent(substr(a_name,depth-1,1)) THEN
              return  f_arabic_phonetic_aquivalent (a_name , nextDepth);
         END IF;
     END IF;
     return concat(f_consonant_equivalent(substr(a_name,depth,1),nextChar) , f_arabic_phonetic_aquivalent (a_name ,nextDepth));
END IF;

IF f_is_vowel(substr(a_name,depth,1)) THEN
       return concat(f_vowel_equivalent(substr(a_name,depth,1),nextChar) , f_arabic_phonetic_aquivalent (a_name , nextDepth));
END IF;
return NULL;
END ; 
----------------------------------------------------------------
function f_pkg_runner (a_name in out varchar) return varchar is
toReturn varchar2(256);
depth number;
begin
   depth :=1;
   toReturn := f_arabic_phonetic_aquivalent (a_name, depth);
   toReturn := replace(toReturn,' mohd ','mouhamad');
   toReturn := replace(toReturn,' med ','mouhamad');
   if toReturn = 'mohd' or toReturn = 'med'then
      return 'mouhamad';
   end if;   
   return toReturn;
end;
----------------------------------------------------------------
function f_pkg_tester() return varchar is
sampleSurname varchar(256);

begin
      sampleSurname  := 'wassim';
      sampleSurname   := PKG_SEARCH .f_pkg_runner(sampleSurname);
      dbms_output.put_line(sampleSurname);
      return sampleSurname;
 
end;
end PKG_SEARCH ;
This will output oasim

Aucun commentaire:

Enregistrer un commentaire