Japanese Character Recognition with JavaScript

9 Dec 2016

Distinguishing the different types of Japanese characters is an important task for many applications involving Japanese input. This article discusses how to recognise the different character types using JavaScript and create some simple parsers for processing Japanese text.

Know Your Character Codes

To determine if a character is Japanese and if so if it is hiragana, katakana or kanji we need to check its unicode character code against those belonging to these sets. Here is a great reference for Japanese character codes from which the following summary is taken:

Hiragana	3040 - 309f
Katakana	30a0 - 30ff
Kanji (common & uncommon)	4e00 - 9faf
Kanji (rare)	3400 - 4dbf

A Simple Character Matcher

Let's make a simple function to check if a character is a kanji using the unicode character ranges above:

function isKanji(ch) {
    return (ch >= "一" && ch <= "龯") ||
    (ch >= "㐀" && ch <= "䶿"); 
}

console.log(isKanji("a")); // false
console.log(isKanji("あ")); // false;
console.log(isKanji("水")); // true;

If you wished to check if a kanji belonged to the ‘common & uncommon’ or ‘rare’ subsets then you could split the operands of the OR operator into the return values of two distinct functions.

Parsing Strings

To determine if a string contains a character we must iterate over the individual characters in the string and call a predicate which returns true if the character is matched; if the predicate is satisfied for any character then the result is true, otherwise it is false.

The Array object features the function Array.prototype.some which does exactly this for arrays. Since strings are array-like objects we can leverage Array.prototype.some instead of implementing the iteration ourselves. We do so via Function.prototype.call as follows:

function some(str, callback) {
    return Array.prototype.some.call(str, callback);
}

We can then pass our isKanji() function as a predicate to some() to create a new function which checks if a string contains a kanji:

function hasKanji(str) {
    return some(str, isKanji);
}

console.log(hasKanji("Only English here.")); // false
console.log(hasKanji("これは日本語です。")); // true

What if we want to return an array of all of the kanji in a sentence? Let's define an abstract parser to extract all of the character from a string which match a predicate and then pass it our isKanji function:

function basicParser(str, condition) {
    let result = [];

    for (let i = 0; i < str.length; ++i) {
        if (condition(str[i])) {
            result.push(str[i]);
        }
    }

    return result;
}

function parseKanji(str) {
    return basicParser(str, isKanji);
}

console.log(parseKanji("これは日本語です。")); // ["日", "本", "語"]

We can take this one step further by extracting consecutive kanji characters to get a very crude jukugo parser. In this instance iterate over the string adding consecutive kanji to a result string, when we encounter a non-kanji character we add the string to an array and start creating the next result string. When all characters have been exhausted we return an array of our results. This is of course a very crude parser which omits okurigana.

function accumulativeParser(str, condition) {
    let accumulations = [];
    let accumulator = "";

    for (let i = 0; i < str.length; ++i) {
        let ch = str[i];

        if (condition(ch)) {
            accumulator += ch;
        } else if (accumulator !== "") {
            accumulations.push(accumulator);
            accumulator = "";
        }
    }

    return accumulations;
}

function parseKanjiCompounds(str) {
    return accumulativeParser(str, isKanji);
}

console.log(parseKanjiCompounds("私は日本語が好きです。")); // ["私", "日本語", "好"]

Conclusion

Now we can do some basic analysis of strings containing kanji we can create similar functions for kana by substituting in the relevant character code ranges. Feel free to check out my simple Japanese language utility library nihongo which features these functions and is available as a node package.

Tags: Japanese, Language, JavaScript