I'm testing a site whose audience doesn't usually have English as their main language. When it comes to non-English, I consider there to be 3 groups
- The target language's alphabet is a subset of the English alphabet
This is easy. Just use something like cewl
.
- The target language uses some characters not in the English alphabet
These are languages like German and Spanish. In these cases, I would use cewl
, run the output through something like sed
to replace the non-English characters with their English equivalent (รค
to ae
and/or a
), and process as usual.
- Languages which don't use any of the English alphabet
Languages like Chinese and Japanese don't use the English alphabet at all. However, they do have romanized versions of their character sets (Romanji for Japanese, Pinyin for Chinese). Considering just Japanese, it would be easy to romanize all their Hiragana & Katakana, but really hard to work with Kanji, since each character can have multiple Romanji representations, depending on context, and I don't wanna have to learn a new language mid-project.
What are some other good tools & strategies for generating custom wordlists from languages which don't use just a subset of the English alphabet?