CY's Take on The Weekly Challenge #186 Task 2

CY's Take on The Weekly Challenge #186 Task 2 ‐ No Lost in Transliteration?

If you want to challenge yourself on programming, especially on Perl and/or Raku, go to https://theweeklychallenge.org, code the latest challenges, submit codes on-time (by GitHub or email).

Do tell me, if I am wrong or you strongly oppose my statements!

It's time for challenges in Week #186 !

image info

Task 2: Unicode Makeover

Intro: Why concern?

Growing up with (the character encodings) Big5, Big5-HKSCS, GB(usually traditional Chinese users can read simplified Chinese, and I know some university classmates from mainland China can read traditional Chinese) - and the unwelcomed visitor "chaotic code", Unicode has been a lifesaver!

I am very interested in different aspects of Unicode.

Approach: Perl

Once I saw the task released, I check out the nice Perl Unicode Cookbook by Tom Christiansen for inspiration. I found the part related to character name. Knowing named Ã named "LATIN CAPITAL LETTER A WITH TILDE", â named "LATIN SMALL LETTER A WITH CIRCUMFLEX", Ò named "LATIN CAPITAL LETTER O WITH GRAVE", etc. I tried to work out my solution:

use v5.30.0;
use charnames ();
use utf8;

sub ch_latin {
    my $name = charnames::viacode(ord($_[0]));
    return $1 if $name =~ /^LATIN CAPITAL LETTER (\w)/;
    return lc($1) if $name =~ /^LATIN SMALL LETTER (\w)/;
    return $_[0];
}

sub makeover {
    return join "", map {ch_latin $_} split "", $_[0]
}

use Test::More tests=>5;
ok makeover("ÃÊÍÒÙ") eq "AEIOU";
ok makeover("âÊíÒÙ") eq "aEiOU";
ok makeover("chữ Quốc ngữ") eq "chu Quoc ngu";
ok makeover("Paul Erdős") eq "Paul Erdos";
ok makeover("香港") eq "香港";

Languages, Test Data

So, there is a limitation ‐ my script is applicable only for Latin characters and its descents. I wonder whether there are "normalized" needs in other sets of alphabets. There may be, but are those sets of alphabets in Unicode? Is it very rare/obsolete???

I don't know when the A/E/I/O/U with tilde and A/E/I/O/U with circumflex are being used. From my limited language exploration, besides "pinyin" for Chinese and Chinese-related languages, I know Vietnamese script uses the Latin alphabet with tonally additional symbols. One of my test data sets is the Vietnamese script from Wikipedia: chữ Quốc ngữ. I don't speak Vietnamese (the 20th largest language of the world in certain measure), just learnt some knowledge of its scripting and tones from this YouTube video: The Vietnamese Language | Langfocus.

Approach: Java

After finishing the Perl script, I explore the case in Java. There is an outdated StackOverflow solution using java.text.Normalizer.

Anyway, this class is a right way to go. I figure out a solution after reading an Oracle official Normalizer API tutorial.

import java.text.Normalizer;
// Please also take a look at: java.lang.Character;

public class UnicodeMakeover
{
    public static void main(String[] args) {
        System.out.println(makeover("ÃÊÍÒÙ"));
        System.out.println(makeover("âÊíÒÙ"));
        System.out.println(makeover("chữ Quốc ngữ"));
        System.out.println(makeover("Paul Erdős"));
        System.out.println(makeover("香港")); // no output
    }

    public static String makeover(String text)
    {
        StringBuilder aaa = new
            StringBuilder(Normalizer.normalize(text, Normalizer.Form.NFKD));
        String bbb = "";
        for (int i = 0; i < aaa.length(); i++)
        {
            if (aaa.codePointAt(i) <= 127)
            bbb += aaa.charAt(i);
        }
        return bbb;
    }
}

Links

Perl Unicode Cookbook (2012) / Tom Christiansen
Seems like many teammates use this module:
CPAN: Unicode::Normalize
Know the following module from @polettix (here his blogpost)
CPAN: Unicode::UCD
A brief guide to perl character encoding (2022) / David Cantrell ‐ It does not recommend use utf8;.
I haven't read it in details; this seems like another nice article.
Wikipedia: Mojibake
A fun read for me.
Wikipedia: (Unicode) Han unification
Another fun read for me.

Stay alert and also care for the world! □

The image of the Venn digram is from Wikimedia Commons. Image details.

Except from images and codes from other personnels, the content of this blogpost is released under a copyleft spirit. One may share (full or partial) content of this blogpost on other platform if you share it under the free and open content spirit.

link for CY's full codes: ch-2.pl, UnicodeMakeover.java

Contact on twitter: @e7_87.

Discuss via GitHub issues: here.

Email: fungcheokyin at gmail.com

Created Date: 16th October, 2022.

The Weekly Challenge ‐ Perl and Raku