The goal of this article is to give an approach to search record on a latin database of recrods containing arab names / surnames, of course, there is no problem if the records were written in arabic letter encoded in a charset that supports this language such ar8mswin1256 charset recommended for oracle databases, but when we phonetic translate an arabic name and spell it in latin chars, that's another kind of non bijective job,
Let's take a look:
Mohammed (salla Allah Alyhi wa sallam) can be spelled :
1- Mohamed
2- Muhamed
3- Mouhamed
4- Mouhammed
5- Med (Abusive spelling in north Afraica countries ex french occupied countries)
6- Mohd(Same a 5)
you must note that in arabic theres there is often a successive double consonant, and those who don't respect phonetic translation can omit the second char, even if the word become incorrect in certain case
'S' between two vowels that become 'Z'
Prohet's name is a very frequent name in Muslim countries, we can say then that's a particular case?
Let's get another look
i'll take now my name and others
Wassim, can be spelled:
1- Ouassim
2- Wessim
3- Ouessim
4- Wassime
5- Ouassime
6- Ouessime
And that for all names with 'W' char, we can replace the W by ou and we have the same name.
We can note also that the 'E' char at the end of the word often doesn't metter
Tayeb can be spelled:
We can note that 'Y' char and 'I' (also 'EE' in middle east culture ) have the same phonetic effect.
then we can spell
Sami-->Samy (don't forget Sammy -double consonant - evoked earlier)
Rym -->Rim
etcetera etcetera...
Let's now comeback to the approach that can be adapted to resolve the problem.
First of all, we must establish our names dictionary, not a real dictionary because arabic language has the must vast dictionary with 12 000 000 word vs 600 000 word in english **, but let's say that we will build a function that will transform an arabic name on another word that we can exploit later by a simple matching with the result of the same function executed on the searched for name.
behind sentences
We'll build function f
for each name of our database
f(name) = name' stored somewhere
select from database rows with condition ; f(searchedForName) = name'
f(name) = name' stored somewhere ??? why??
storing the result will be very useful because the result is always the same for records stored in our database, we win a time and money by ding that until the change of our function f we must re-execute
the function on our population.
function f(name):
Considering our parameter name which is a String, an intelligent algorithm will consider that string a char sequence, other ways we will be in the case **.We'll transform that char sequence to another one which we will respect our universal rules, and unlike ** the cardinality of rules set is an infinitesimal (x 10) in front of arabic words set count (12 000 000)
Our algorithm will take a decision and return a string that can be blank if needed for all incoming letters in our world: Typical recursive behavior, the normalized name will be the concatenation of results of recursive call on the word.
We remark also that the pseudo strings generated during recursive call, needs some extra information:
specially previous char 'in consonant case (double char) or other cases' end sometimes we need the next char if it exists.
ALGORITHM String getArabicPhoneticEquivalent (String name , Int depth) RETURN String
String toReturn;
char nextChar,currentChar;
IF StringLength(name) == depth THEN //Condition to break recursivity
return "";
toReturn =""; // Must be set to empty string
IF depth == 0 THEN
name = StringLowerCase(name);//Affects lowercase to input
name = cleanSpecialChars(name); //Transfom some unsupported letters
toReturn = getSpecialNames(name);// Tests for special names
IF toReturn != "" THEN // Case we have special name, no need to continue
return toReturn;
currentChar = name[depth];
IF StringLength(name) > depth THEN
nextChar = name[depth+1];
nextChar = ' ';
depth = depth + 1;
IF nextChar == currentChar THEN //Double letter simulated to one
return getArabicPhoneticEquivalent (name , depth);
IF consonant(name[depth]) THEN
IF name[depth] == 'h' THEN //Skip h after a consonant
IF depth > 1 AND consonant(name[depth-1]) THEN
return getArabicPhoneticEquivalent (name , depth);
return StringConcat(getConsonantEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
IF vowel(name[depth]) THEN
return StringConcat(getVowelEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
ALGORITHM skipSpecialChars(String name) RETURN String
name = StringReplaceAllSequences("w","ou");
name = StringReplaceAllSequences("ï","i");
name = StringReplaceAllSequences("î","i");
name = StringReplaceAllSequences("ô","o");
name = StringReplaceAllSequences("é","e");
name = StringReplaceAllSequences("è","e");
name = StringReplaceAllSequences("ê","e");
name = StringReplaceAllSequences("à","a");
name = StringReplaceAllSequences("ç","c");
return name;
ALGORITHM getSpecialNames(String name) RETURN String
IF name == "med" OR name =="mohd" THEN
return 'mouhamed';
return "";
ALGORITHM getConsonantEquivalent(char currentChar, char nextChar) RETURN String
IF currentChar == 'c' THEN
IF consonant(nextChar)THEN
return "k";
ELSEIF vowel(nextChar) THEN
return "s"
return currentChar+"";
ALGORITHM getVowelEquivalent(char currentChar, char nextChar) RETURN String
IF currentChar == 'y' THEN
return "i";
IF currentChar == 'e' THEN
return "a";
return currentChar+"";
Note that we can make this algorithm more performant by adding additional controls
such as:
The abreviation of ben 'b'
The 'el' similar to 'al'
In the PLSQL code below i've added these controls and some others
'Code may be more up te date than algorithm ( no worry the spirit is kept)'
Let's take a look:
Mohammed (salla Allah Alyhi wa sallam) can be spelled :
1- Mohamed
2- Muhamed
3- Mouhamed
4- Mouhammed
5- Med (Abusive spelling in north Afraica countries ex french occupied countries)
6- Mohd(Same a 5)
you must note that in arabic theres there is often a successive double consonant, and those who don't respect phonetic translation can omit the second char, even if the word become incorrect in certain case
'S' between two vowels that become 'Z'
Prohet's name is a very frequent name in Muslim countries, we can say then that's a particular case?
Let's get another look
i'll take now my name and others
Wassim, can be spelled:
1- Ouassim
2- Wessim
3- Ouessim
4- Wassime
5- Ouassime
6- Ouessime
And that for all names with 'W' char, we can replace the W by ou and we have the same name.
We can note also that the 'E' char at the end of the word often doesn't metter
Tayeb can be spelled:
We can note that 'Y' char and 'I' (also 'EE' in middle east culture ) have the same phonetic effect.
then we can spell
Sami-->Samy (don't forget Sammy -double consonant - evoked earlier)
Rym -->Rim
etcetera etcetera...
Let's now comeback to the approach that can be adapted to resolve the problem.
First of all, we must establish our names dictionary, not a real dictionary because arabic language has the must vast dictionary with 12 000 000 word vs 600 000 word in english **, but let's say that we will build a function that will transform an arabic name on another word that we can exploit later by a simple matching with the result of the same function executed on the searched for name.
behind sentences
We'll build function f
for each name of our database
f(name) = name' stored somewhere
select from database rows with condition ; f(searchedForName) = name'
f(name) = name' stored somewhere ??? why??
storing the result will be very useful because the result is always the same for records stored in our database, we win a time and money by ding that until the change of our function f we must re-execute
the function on our population.
function f(name):
Considering our parameter name which is a String, an intelligent algorithm will consider that string a char sequence, other ways we will be in the case **.We'll transform that char sequence to another one which we will respect our universal rules, and unlike ** the cardinality of rules set is an infinitesimal (x 10) in front of arabic words set count (12 000 000)
Our algorithm will take a decision and return a string that can be blank if needed for all incoming letters in our world: Typical recursive behavior, the normalized name will be the concatenation of results of recursive call on the word.
We remark also that the pseudo strings generated during recursive call, needs some extra information:
specially previous char 'in consonant case (double char) or other cases' end sometimes we need the next char if it exists.
ALGORITHM String getArabicPhoneticEquivalent (String name , Int depth) RETURN String
String toReturn;
char nextChar,currentChar;
IF StringLength(name) == depth THEN //Condition to break recursivity
return "";
toReturn =""; // Must be set to empty string
IF depth == 0 THEN
name = StringLowerCase(name);//Affects lowercase to input
name = cleanSpecialChars(name); //Transfom some unsupported letters
toReturn = getSpecialNames(name);// Tests for special names
IF toReturn != "" THEN // Case we have special name, no need to continue
return toReturn;
currentChar = name[depth];
IF StringLength(name) > depth THEN
nextChar = name[depth+1];
nextChar = ' ';
depth = depth + 1;
IF nextChar == currentChar THEN //Double letter simulated to one
return getArabicPhoneticEquivalent (name , depth);
IF consonant(name[depth]) THEN
IF name[depth] == 'h' THEN //Skip h after a consonant
IF depth > 1 AND consonant(name[depth-1]) THEN
return getArabicPhoneticEquivalent (name , depth);
return StringConcat(getConsonantEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
IF vowel(name[depth]) THEN
return StringConcat(getVowelEquivalent(name[depth],nextChar) , getArabicPhoneticEquivalent (name , depth));
ALGORITHM skipSpecialChars(String name) RETURN String
name = StringReplaceAllSequences("w","ou");
name = StringReplaceAllSequences("ï","i");
name = StringReplaceAllSequences("î","i");
name = StringReplaceAllSequences("ô","o");
name = StringReplaceAllSequences("é","e");
name = StringReplaceAllSequences("è","e");
name = StringReplaceAllSequences("ê","e");
name = StringReplaceAllSequences("à","a");
name = StringReplaceAllSequences("ç","c");
return name;
ALGORITHM getSpecialNames(String name) RETURN String
IF name == "med" OR name =="mohd" THEN
return 'mouhamed';
return "";
ALGORITHM getConsonantEquivalent(char currentChar, char nextChar) RETURN String
IF currentChar == 'c' THEN
IF consonant(nextChar)THEN
return "k";
ELSEIF vowel(nextChar) THEN
return "s"
return currentChar+"";
ALGORITHM getVowelEquivalent(char currentChar, char nextChar) RETURN String
IF currentChar == 'y' THEN
return "i";
IF currentChar == 'e' THEN
return "a";
return currentChar+"";
Note that we can make this algorithm more performant by adding additional controls
such as:
The abreviation of ben 'b'
The 'el' similar to 'al'
In the PLSQL code below i've added these controls and some others
'Code may be more up te date than algorithm ( no worry the spirit is kept)'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | create or replace package body PKG_SEARCH is ------------------------------------------------------------- -- Teste les voyelles -- @return boolean ------------------------------------------------------------- function f_is_vowel(c in char ) return boolean is begin if c in ( 'a' , 'e' , 'i' , 'o' , 'y' , 'u' ) then return true ; else return false ; end if; end ; ------------------------------------------------------------- -- Teste les consonnes -- @return boolean ------------------------------------------------------------- function f_is_consonent(c in char ) return boolean is begin if c in ( 'b' , 'c' , 'd' , 'f' , 'g' , 'h' , 'j' , 'k' , 'l' , 'm' , 'n' , 'p' , 'q' , 'r' , 's' , 't' , 'v' , 'w' , 'x' , 'z' ) then return true ; else return false ; end if; end ; ------------------------------------------------------------- -- Teste les consonnes -- @return boolean ------------------------------------------------------------- function f_special_chars(a_name in out varchar ) return varchar is begin a_name := replace (a_name, 'ï' , 'i' ); a_name := replace (a_name, 'î' , 'i' ); a_name := replace (a_name, 'ô' , 'o' ); a_name := replace (a_name, 'é' , 'e' ); a_name := replace (a_name, 'è' , 'e' ); a_name := replace (a_name, 'ê' , 'e' ); a_name := replace (a_name, 'à' , 'a' ); a_name := replace (a_name, 'ç' , 'c' ); a_name := replace (a_name, 'w' , 'o' ); a_name := replace (a_name, 'u' , 'o' ); a_name := replace (a_name, 'y' , 'i' ); a_name := replace (a_name, ' b ' , 'ben' ); a_name := replace (a_name, ' b' , 'ben' ); a_name := replace (a_name, ' al ' , ' al' ); if (length(a_name) >1 and substr(a_name,1,2) = 'b ' ) or (length(a_name) >3 and substr(a_name,1,2) = 'ben ' ) then a_name := 'ben ' || substr(a_name,2); end if; if length(a_name) >2 and substr(a_name,1,2) = 'el' then a_name := 'al' || substr(a_name,3); end if; return a_name; end ; ------------------------------------------------------------- -- Traite les noms speciaux -- @return boolean ------------------------------------------------------------- function f_special_names(a_name in varchar ) return varchar is begin if a_name = 'med' or a_name = 'mohd' then return 'mouhamed' ; end if; return '' ; end ; ------------------------------------------------------------- -- Retourne la chaine de caractère qui correspond à la consonne en question -- @return boolean ------------------------------------------------------------- function f_consonant_equivalent(currentChar in char , nextChar in char ) return varchar is begin if currentChar = 'c' THEN if f_is_consonent(nextChar) THEN return 'k' ; elsif f_is_vowel(nextChar) THEN return 's' ; end if; end if; if currentChar = 'd' THEN if f_is_consonent(nextChar) THEN return '' ; end if; end if; return currentChar; end ; ------------------------------------------------------------- -- Retourne la chaine de caractère qui correspond à la voyelle en question -- @return boolean ------------------------------------------------------------- function f_vowel_equivalent(currentChar in char , nextChar in char ) return varchar is begin if currentChar = 'e' then if (nextChar= ' ' ) then return '' ; else return 'a' ; end if; end if; return currentChar; end ; ------------------------------------------------------------- -- Retourne l'equivalent phonetic -- Main function -- @return boolean ------------------------------------------------------------- function f_arabic_phonetic_aquivalent (a_name in out varchar , depth in out number) return varchar is toReturn varchar (256); nextChar char ; currentChar char ; nextDepth number; BEGIN IF depth = 1 THEN a_name := lower (a_name); a_name := f_special_chars(a_name); END IF; toReturn := f_special_names(a_name); IF toReturn <> ' ' THEN return toReturn; END IF; IF length(a_name) < depth THEN return ' '; END IF; toReturn :=' '; nextDepth := depth + 1; currentChar := substr(a_name,depth,1); IF currentChar = ' ' THEN return ' '||f_arabic_phonetic_aquivalent (a_name , nextDepth); END IF; IF length(a_name) > depth THEN nextChar := substr(a_name,nextDepth,1); ELSE nextChar := ' '; END IF; IF nextChar = currentChar THEN return f_arabic_phonetic_aquivalent (a_name , nextDepth); END IF; IF f_is_consonent(substr(a_name,depth,1)) THEN IF substr(a_name,depth,1) = ' h ' THEN IF depth > 1 AND f_is_consonent(substr(a_name,depth-1,1)) THEN return f_arabic_phonetic_aquivalent (a_name , nextDepth); END IF; END IF; return concat(f_consonant_equivalent(substr(a_name,depth,1),nextChar) , f_arabic_phonetic_aquivalent (a_name ,nextDepth)); END IF; IF f_is_vowel(substr(a_name,depth,1)) THEN return concat(f_vowel_equivalent(substr(a_name,depth,1),nextChar) , f_arabic_phonetic_aquivalent (a_name , nextDepth)); END IF; return NULL; END ; ---------------------------------------------------------------- function f_pkg_runner (a_name in out varchar) return varchar is toReturn varchar2(256); depth number; begin depth :=1; toReturn := f_arabic_phonetic_aquivalent (a_name, depth); toReturn := replace(toReturn,' mohd ',' mouhamad '); toReturn := replace(toReturn,' med ',' mouhamad '); if toReturn = ' mohd ' or toReturn = ' med 'then return ' mouhamad '; end if; return toReturn; end; ---------------------------------------------------------------- function f_pkg_tester() return varchar is sampleSurname varchar(256); begin sampleSurname := ' wassim'; sampleSurname := PKG_SEARCH .f_pkg_runner(sampleSurname); dbms_output.put_line(sampleSurname); return sampleSurname; end ; end PKG_SEARCH ; |
No comments:
Post a Comment