The goal of this article is to give an approach to search record on a latin database of recrods containing arab names / surnames, of course, there is no problem if the records were written in arabic letter encoded in a charset that supports this language such ar8mswin1256 charset recommended for oracle databases, but when we phonetic translate an arabic name and spell it in latin chars, that's another kind of non bijective job,
Let's take a look:
Mohammed (salla Allah Alyhi wa sallam) can be spelled :
1- Mohamed
2- Muhamed
3- Mouhamed
4- Mouhammed
5- Med (Abusive spelling in north Afraica countries ex french occupied countries)
6- Mohd(Same a 5)
you must note that in arabic theres there is often a successive double consonant, and those who don't respect phonetic translation can omit the second char, even if the word become incorrect in certain case
eg:
'S' between two vowels that become 'Z'
Prohet's name is a very frequent name in Muslim countries, we can say then that's a particular case?
Let's get another look
i'll take now my name and others
Wassim, can be spelled:
1- Ouassim
2- Wessim
3- Ouessim
4- Wassime
5- Ouassime
6- Ouessime
And that for all names with 'W' char, we can replace the W by ou and we have the same name.
We can note also that the 'E' char at the end of the word often doesn't metter
Tayeb can be spelled:
1-Taieb
We can note that 'Y' char and 'I' (also 'EE' in middle east culture ) have the same phonetic effect.
then we can spell
Sami-->Samy (don't forget Sammy -double consonant - evoked earlier)
Rym -->Rim
etcetera etcetera...
Let's now comeback to the approach that can be adapted to resolve the problem.
First of all, we must establish our names dictionary, not a real dictionary because arabic language has the must vast dictionary with 12 000 000 word vs 600 000 word in english **, but let's say that we will build a function that will transform an arabic name on another word that we can exploit later by a simple matching with the result of the same function executed on the searched for name.
behind sentences
We'll build function f
for each name of our database
f(name) = name' stored somewhere
select from database rows with condition ; f(searchedForName) = name'
f(name) = name' stored somewhere ??? why??
storing the result will be very useful because the result is always the same for records stored in our database, we win a time and money by ding that until the change of our function f we must re-execute
the function on our population.
function f(name):
Considering our parameter name which is a String, an intelligent algorithm will consider that string a char sequence, other ways we will be in the case **.We'll transform that char sequence to another one which we will respect our universal rules, and unlike ** the cardinality of rules set is an infinitesimal (x 10) in front of arabic words set count (12 000 000)
Our algorithm will take a decision and return a string that can be blank if needed for all incoming letters in our world: Typical recursive behavior, the normalized name will be the concatenation of results of recursive call on the word.
We remark also that the pseudo strings generated during recursive call, needs some extra information:
specially previous char 'in consonant case (double char) or other cases' end sometimes we need the next char if it exists.
ALGORITHM String getArabicPhoneticEquivalent (String name , Int depth) RETURN String
DECLARE
String toReturn;
char nextChar,currentChar;
BEGIN
IF StringLength(name) == depth THEN //Condition to break recursivity
return "";
END IF;
toReturn =""; // Must be set to empty string
IF depth == 0 THEN
name = StringLowerCase(name);//Affects lowercase to input
name = cleanSpecialChars(name); //Transfom some unsupported letters
toReturn = getSpecialNames(name);// Tests for special names
END IF;
IF toReturn != "" THEN // Case we have special name, no need to continue
return toReturn;
END IF;
currentChar = name[depth];
IF StringLength(name) > depth THEN
nextChar = name[depth+1];
ELSE
nextChar = ' ';
END IF;
depth = depth + 1;
IF nextChar == currentChar THEN //Double letter simulated to one
return getArabicPhoneticEquivalent (name , depth);
END IF;
IF consonant(name[depth]) THEN
IF name[depth] == 'h' THEN //Skip h after a consonant
IF depth > 1 AND consonant(name[depth-1]) THEN
return getArabicPhoneticEquivalent (name , depth);
END IF;
END IF;
return StringConcat(getConsonantEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
END IF;
IF vowel(name[depth]) THEN
return StringConcat(getVowelEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
END IF;
END ALGORITHM;
ALGORITHM skipSpecialChars(String name) RETURN String
BEGIN
name = StringReplaceAllSequences("w","ou");
name = StringReplaceAllSequences("ï","i");
name = StringReplaceAllSequences("î","i");
name = StringReplaceAllSequences("ô","o");
name = StringReplaceAllSequences("é","e");
name = StringReplaceAllSequences("è","e");
name = StringReplaceAllSequences("ê","e");
name = StringReplaceAllSequences("à","a");
name = StringReplaceAllSequences("ç","c");
return name;
END ALGORITHM;
ALGORITHM getSpecialNames(String name) RETURN String
BEGIN
IF name == "med" OR name =="mohd" THEN
return 'mouhamed';
END IF;
return "";
END ALGORITHM;
ALGORITHM getConsonantEquivalent(char currentChar, char nextChar) RETURN String
BEGIN
IF currentChar == 'c' THEN
IF consonant(nextChar)THEN
return "k";
ELSEIF vowel(nextChar) THEN
return "s"
END IF;
END IF;
return currentChar+"";
END ALGORITHM
ALGORITHM getVowelEquivalent(char currentChar, char nextChar) RETURN String
BEGIN
IF currentChar == 'y' THEN
return "i";
END IF;
IF currentChar == 'e' THEN
return "a";
END IF;
return currentChar+"";
END ALGORITHM;
Note that we can make this algorithm more performant by adding additional controls
such as:
The abreviation of ben 'b'
The 'el' similar to 'al'
In the PLSQL code below i've added these controls and some others
'Code may be more up te date than algorithm ( no worry the spirit is kept)'
Let's take a look:
Mohammed (salla Allah Alyhi wa sallam) can be spelled :
1- Mohamed
2- Muhamed
3- Mouhamed
4- Mouhammed
5- Med (Abusive spelling in north Afraica countries ex french occupied countries)
6- Mohd(Same a 5)
you must note that in arabic theres there is often a successive double consonant, and those who don't respect phonetic translation can omit the second char, even if the word become incorrect in certain case
eg:
'S' between two vowels that become 'Z'
Prohet's name is a very frequent name in Muslim countries, we can say then that's a particular case?
Let's get another look
i'll take now my name and others
Wassim, can be spelled:
1- Ouassim
2- Wessim
3- Ouessim
4- Wassime
5- Ouassime
6- Ouessime
And that for all names with 'W' char, we can replace the W by ou and we have the same name.
We can note also that the 'E' char at the end of the word often doesn't metter
Tayeb can be spelled:
1-Taieb
We can note that 'Y' char and 'I' (also 'EE' in middle east culture ) have the same phonetic effect.
then we can spell
Sami-->Samy (don't forget Sammy -double consonant - evoked earlier)
Rym -->Rim
etcetera etcetera...
Let's now comeback to the approach that can be adapted to resolve the problem.
First of all, we must establish our names dictionary, not a real dictionary because arabic language has the must vast dictionary with 12 000 000 word vs 600 000 word in english **, but let's say that we will build a function that will transform an arabic name on another word that we can exploit later by a simple matching with the result of the same function executed on the searched for name.
behind sentences
We'll build function f
for each name of our database
f(name) = name' stored somewhere
select from database rows with condition ; f(searchedForName) = name'
f(name) = name' stored somewhere ??? why??
storing the result will be very useful because the result is always the same for records stored in our database, we win a time and money by ding that until the change of our function f we must re-execute
the function on our population.
function f(name):
Considering our parameter name which is a String, an intelligent algorithm will consider that string a char sequence, other ways we will be in the case **.We'll transform that char sequence to another one which we will respect our universal rules, and unlike ** the cardinality of rules set is an infinitesimal (x 10) in front of arabic words set count (12 000 000)
Our algorithm will take a decision and return a string that can be blank if needed for all incoming letters in our world: Typical recursive behavior, the normalized name will be the concatenation of results of recursive call on the word.
We remark also that the pseudo strings generated during recursive call, needs some extra information:
specially previous char 'in consonant case (double char) or other cases' end sometimes we need the next char if it exists.
ALGORITHM String getArabicPhoneticEquivalent (String name , Int depth) RETURN String
DECLARE
String toReturn;
char nextChar,currentChar;
BEGIN
IF StringLength(name) == depth THEN //Condition to break recursivity
return "";
END IF;
toReturn =""; // Must be set to empty string
IF depth == 0 THEN
name = StringLowerCase(name);//Affects lowercase to input
name = cleanSpecialChars(name); //Transfom some unsupported letters
toReturn = getSpecialNames(name);// Tests for special names
END IF;
IF toReturn != "" THEN // Case we have special name, no need to continue
return toReturn;
END IF;
currentChar = name[depth];
IF StringLength(name) > depth THEN
nextChar = name[depth+1];
ELSE
nextChar = ' ';
END IF;
depth = depth + 1;
IF nextChar == currentChar THEN //Double letter simulated to one
return getArabicPhoneticEquivalent (name , depth);
END IF;
IF consonant(name[depth]) THEN
IF name[depth] == 'h' THEN //Skip h after a consonant
IF depth > 1 AND consonant(name[depth-1]) THEN
return getArabicPhoneticEquivalent (name , depth);
END IF;
END IF;
return StringConcat(getConsonantEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
END IF;
IF vowel(name[depth]) THEN
return StringConcat(getVowelEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
END IF;
END ALGORITHM;
ALGORITHM skipSpecialChars(String name) RETURN String
BEGIN
name = StringReplaceAllSequences("w","ou");
name = StringReplaceAllSequences("ï","i");
name = StringReplaceAllSequences("î","i");
name = StringReplaceAllSequences("ô","o");
name = StringReplaceAllSequences("é","e");
name = StringReplaceAllSequences("è","e");
name = StringReplaceAllSequences("ê","e");
name = StringReplaceAllSequences("à","a");
name = StringReplaceAllSequences("ç","c");
return name;
END ALGORITHM;
ALGORITHM getSpecialNames(String name) RETURN String
BEGIN
IF name == "med" OR name =="mohd" THEN
return 'mouhamed';
END IF;
return "";
END ALGORITHM;
ALGORITHM getConsonantEquivalent(char currentChar, char nextChar) RETURN String
BEGIN
IF currentChar == 'c' THEN
IF consonant(nextChar)THEN
return "k";
ELSEIF vowel(nextChar) THEN
return "s"
END IF;
END IF;
return currentChar+"";
END ALGORITHM
ALGORITHM getVowelEquivalent(char currentChar, char nextChar) RETURN String
BEGIN
IF currentChar == 'y' THEN
return "i";
END IF;
IF currentChar == 'e' THEN
return "a";
END IF;
return currentChar+"";
END ALGORITHM;
Note that we can make this algorithm more performant by adding additional controls
such as:
The abreviation of ben 'b'
The 'el' similar to 'al'
In the PLSQL code below i've added these controls and some others
'Code may be more up te date than algorithm ( no worry the spirit is kept)'
create or replace package body PKG_SEARCH is ------------------------------------------------------------- -- Teste les voyelles -- @return boolean ------------------------------------------------------------- function f_is_vowel(c in char) return boolean is begin if c in ('a','e','i','o','y','u') then return true; else return false; end if; end; ------------------------------------------------------------- -- Teste les consonnes -- @return boolean ------------------------------------------------------------- function f_is_consonent(c in char) return boolean is begin if c in ('b','c','d','f','g','h','j','k','l','m','n','p','q','r','s','t','v','w','x','z') then return true; else return false; end if; end; ------------------------------------------------------------- -- Teste les consonnes -- @return boolean ------------------------------------------------------------- function f_special_chars(a_name in out varchar) return varchar is begin a_name := replace(a_name,'ï','i'); a_name := replace(a_name,'î','i'); a_name := replace(a_name,'ô','o'); a_name := replace(a_name,'é','e'); a_name := replace(a_name,'è','e'); a_name := replace(a_name,'ê','e'); a_name := replace(a_name,'à','a'); a_name := replace(a_name,'ç','c'); a_name := replace(a_name,'w','o'); a_name := replace(a_name,'u','o'); a_name := replace(a_name,'y','i'); a_name := replace(a_name,' b ','ben'); a_name := replace(a_name,' b','ben'); a_name := replace(a_name,' al ',' al'); if (length(a_name) >1 and substr(a_name,1,2) = 'b ') or (length(a_name) >3 and substr(a_name,1,2) = 'ben ') then a_name := 'ben '|| substr(a_name,2); end if; if length(a_name) >2 and substr(a_name,1,2) = 'el' then a_name := 'al'|| substr(a_name,3); end if; return a_name; end; ------------------------------------------------------------- -- Traite les noms speciaux -- @return boolean ------------------------------------------------------------- function f_special_names(a_name in varchar) return varchar is begin if a_name = 'med' or a_name ='mohd' then return 'mouhamed'; end if; return ''; end; ------------------------------------------------------------- -- Retourne la chaine de caractère qui correspond à la consonne en question -- @return boolean ------------------------------------------------------------- function f_consonant_equivalent(currentChar in char, nextChar in char) return varchar is begin if currentChar = 'c' THEN if f_is_consonent(nextChar)THEN return 'k'; elsif f_is_vowel(nextChar) THEN return 's'; end if; end if; if currentChar = 'd' THEN if f_is_consonent(nextChar)THEN return ''; end if; end if; return currentChar; end; ------------------------------------------------------------- -- Retourne la chaine de caractère qui correspond à la voyelle en question -- @return boolean ------------------------------------------------------------- function f_vowel_equivalent(currentChar in char, nextChar in char) return varchar is begin if currentChar = 'e' then if (nextChar=' ') then return ''; else return 'a'; end if; end if; return currentChar; end; ------------------------------------------------------------- -- Retourne l'equivalent phonetic -- Main function -- @return boolean ------------------------------------------------------------- function f_arabic_phonetic_aquivalent (a_name in out varchar, depth in out number) return varchar is toReturn varchar(256); nextChar char; currentChar char; nextDepth number; BEGIN IF depth = 1 THEN a_name := lower(a_name); a_name := f_special_chars(a_name); END IF; toReturn := f_special_names(a_name); IF toReturn <> '' THEN return toReturn; END IF; IF length(a_name) < depth THEN return ''; END IF; toReturn :=''; nextDepth := depth + 1; currentChar := substr(a_name,depth,1); IF currentChar = ' ' THEN return ' '||f_arabic_phonetic_aquivalent (a_name , nextDepth); END IF; IF length(a_name) > depth THEN nextChar := substr(a_name,nextDepth,1); ELSE nextChar := ' '; END IF; IF nextChar = currentChar THEN return f_arabic_phonetic_aquivalent (a_name , nextDepth); END IF; IF f_is_consonent(substr(a_name,depth,1)) THEN IF substr(a_name,depth,1) = 'h' THEN IF depth > 1 AND f_is_consonent(substr(a_name,depth-1,1)) THEN return f_arabic_phonetic_aquivalent (a_name , nextDepth); END IF; END IF; return concat(f_consonant_equivalent(substr(a_name,depth,1),nextChar) , f_arabic_phonetic_aquivalent (a_name ,nextDepth)); END IF; IF f_is_vowel(substr(a_name,depth,1)) THEN return concat(f_vowel_equivalent(substr(a_name,depth,1),nextChar) , f_arabic_phonetic_aquivalent (a_name , nextDepth)); END IF; return NULL; END ; ---------------------------------------------------------------- function f_pkg_runner (a_name in out varchar) return varchar is toReturn varchar2(256); depth number; begin depth :=1; toReturn := f_arabic_phonetic_aquivalent (a_name, depth); toReturn := replace(toReturn,' mohd ','mouhamad'); toReturn := replace(toReturn,' med ','mouhamad'); if toReturn = 'mohd' or toReturn = 'med'then return 'mouhamad'; end if; return toReturn; end; ---------------------------------------------------------------- function f_pkg_tester() return varchar is sampleSurname varchar(256); begin sampleSurname := 'wassim'; sampleSurname := PKG_SEARCH .f_pkg_runner(sampleSurname); dbms_output.put_line(sampleSurname); return sampleSurname; end; end PKG_SEARCH ;This will output oasim
No comments:
Post a Comment