How can MS-DOS and other text mode programs display double-width CJK characters?

9

1

I've seen many text mode BIOS setup screens in Japanese and Chinese. Recently I've even seen Windows XP setup in Japanese. MS-DOS had Japanese versions too. Real DOS mode, not Windows command prompt!

Japanese BIOS setup

Japanese MS-DOS 6.2

One typical text mode screen has the size of 80x25. With Japanese character took as large as double normal Latin character width, the maximum number of Japanese characters that can be displayed at the same time on screen is about 1000. So we need 2000 code points to display the left and right part of the characters.

As default text mode can only display 256 characters, but the first 128 is used for ASCII, so usable ones are limited to the high 128 code points. If needed we can expand it to 512 but this still can't support enough code points for the display. I always wonder how they managed to display the large character set with such limited number of characters.

[Japanese XP installer]8]

Text mode in Linux seems to use graphics mode driver because it can display Unicode and has a lot more colors. But I can't explain how they do it in MS-DOS and BIOS setup screens.


EDIT: I've even found a Japanese text input for DOS

Japanese IME

There are Korean in text mode too!

Korean

VMWare Korean DOS

phuclv

Posted 2013-09-20T08:38:05.857

Reputation: 14 930

1

Note that the page you likely took the OS/2 installer screenshot from says right next to the screenshot that "the graphical text mode support was initialized almost immediately when booting OS/2". Key word graphical.

– a CVn – 2014-07-18T15:15:56.250

@MichaelKjörling it's not only OS/2 but MS-DOS and BIOS setup programs have this ability in text mode too – phuclv – 2014-07-18T20:23:04.047

You are probably not looking at Japanese "characters", i.e. kanji, but rather hiragana or katakana, which do have Unicode mappings. – sawdust – 2013-09-20T09:51:35.833

@sawdust: look at the picture above and you'll see that it can display not only all kana but also Kanji – phuclv – 2013-09-20T13:09:02.473

Answers

6

The normal "80x25 characters" mode is actually 720x350 pixels (meaning that each character cell is 9 pixels wide by 14 pixels high). Double-width character modes ("40x25") can either simply interpolate this to the larger width by doubling each column to save on video content memory (cutting the required amount of video content memory in half), or use additional glyph memory and an identical amount of video content memory to increase the character cells to 18*14 pixels.

Fairly early on (I think it was done when EGA was introduced), support for user-defined character glyphs was added to the IBM PC's text display mode.

The normal text mode of the IBM PC is simply a sequential 4000 bytes of video content RAM at a particular address. These are read as one byte of character attributes (originally blinking, bold, underline etc.; later re-used for foreground and background colors and blinking/highlight, hence the limitation to 16 colors in text mode) and one byte describing the character to be displayed. The actual glyph to be displayed for each character byte value is stored elsewhere.

This means that as long as you can make do with 256 distinct glyphs on the screen at any one time, and each glyph can be represented as a 9x14 one-bit bitmap, you can simply replace the glyphs in memory to make the characters appear differently. In part, this was one portion of what mode con codepage select did on DOS. This is relatively trivial.

If you need more than 256 distinct glyphs but can live with the reduced number of glyphs on screen, you can go with a 40x25 scheme with double-width (18 pixels wide) glyphs. Assuming that the total amount of video content RAM is fixed and assuming that you can increase the glyph bitmap memory, you can move to using two bytes out of every four bytes to represent one on-screen glyph, giving you access to 2^16 = 65,536 different glyphs (including the blank glyph). If you feel daring, you could even skip the second attribute byte which gives you access to 2^24 ~ 16.7M different glyphs. Both of these approaches rely on special software support, but the hardware and firmware portion should be pretty easy to do. 65,536 glyphs at 18x14 one-bit pixels works out to about 2 MiB, a sizeable but not insurmountable amount of memory at the time. 256 glyphs at 18x14 one-bit pixels is about 8 KiB, which was absolutely reasonable even in the first half of the 1980s when EGA was developed and introduced.

Basic US English needs at least 62 dedicated glyphs (numbers 0-9, letters A-Z in upper and lower case) so you have something like 180-190 glyphs to play with if you also want to be able to display US English text at the same time and go with 8 bits per glyph. If you can live without simultaneous US English support, which you might choose to do in a resource-constrained environment such as the early IBM PC architecture, you have access to the full number of glyphs.

With some trickery you could probably mix and match the two schemes, too.

I don't know how it was actually done but both of these are viable schemes for how to get particularly limited-character-count "fancy" alphabets onto a plain IBM PC screen in text mode that I can come up with just sitting in front of Stack Exchange for a moment. It's perfectly possible that there are additional graphics modes that make this easier in practice.

Also, keep in mind the distinction between text mode and graphical mode displaying text. If you are in graphical mode, perhaps through VESA which is pretty universally supported, you're on your own as far as drawing character glyphs go but you also have a lot more freedom in how to draw them. For example, I'm pretty sure the text-based parts of Windows NT (which is the product family Windows XP belongs to) use a graphical mode to display text, including the Windows NT 4.0 boot screen and BSODs.

a CVn

Posted 2013-09-20T08:38:05.857

Reputation: 26 553

You may see that there are normal width Latin characters beside double width Japanese/Korean ones so it can't be 40x25 double width mode. Therefore you can't combine 2 bytes of every 4 bytes to represent the glyph. Using bit 3 of the foreground color you can represent 512 glyphs at the same time but still not enough if the characters fill most of the screen https://en.wikipedia.org/wiki/VGA-compatible_text_mode#Fonts

– phuclv – 2013-09-20T14:20:12.680

@LưuVĩnhPhúc You could repossess the high bit, or use any number of possible other tricks to mix multibyte-requiring characters with singlebyte ones. I still think the answer is to recognize the statement made in the opening paragraph: even when showing characters, at some level you are still dealing with pixels, and those pixels can be worked with even though perhaps not directly. – a CVn – 2013-09-20T14:30:54.213

I know all the text-based and the graphical-mode-displaying-text thing, just confuse how they have enough code points for multibyte as left and right part require 2 code points. But from what you said I've thought of another way of doing it. I think your answer is acceptable – phuclv – 2013-09-21T00:33:32.233

1

I found something in "VGA-compatible text mode" page in Wikipedia and also in some VGA programming books:

Both EGA and VGA text modes allows simultaneous 512 glyphs on screen, or 2 banks with 256 glyphs each. The atribute bit 3 (Foreground Color Intensity) can also select between bank A or B. What normally occurs is that by default both A and B Font Registers points to the same address, giving you only 256 glyphs. So, for it to work, you have to set the Font Registers to the correct adresses.

Each bank has 8192 bytes, and each one of the 256 glyphs in the bank has 32 bytes (8 pixels wide and 32 pixels tall). You can set Scanline Count register in order to tell the correct height of your characters. VGA cards print 400 scanlines onscreen while EGA print 350 scanlines onscreen, therefore, in order to give you 25 character rows, they set their character height to 16 and 14 scanlines respectively. Also, in VGA each glyph can have 8 or 9 dots wide, but the 9th column is either blank or just a 8th column repetition. All these glyphs in both banks can be user-defined.

How can you get more than 256 different characters onscreen in some languages? In the examples above, each special foreign character is made of two glyphs (left and right), or more. You could set the first, say, 128 glyphs from bank A apart for ASCII text, and you still would have 128 glyphs from bank A + 256 glyphs from bank B = 384 glyphs for you to customize.

Also, you can combine different left- and right sides to make a huge character set! Let's say, for example, that from the 384 user-defined glyphs, you want to reserve 184 for left-sides and 200 for right-sides: you can have 184*200 = 36800 different characters! (sure, most of them would probably be invalid characters for that language, but still you can get a good number of valid combinations).

In the japanese language example above, you have the "ha" and "ba" characters sharing the left-side glyph. Same for the "si" and "zi" charaters. "ko" and "ni" right-sides are so similar they could share the same right-side glyph. The same could be said about "ru" and "ro" characters. With good design you could expand your character set very well. The right-side glyph of the "le" character is appearing in the top left of the screen (in gray), and in the vertical scrollbar, the up and down buttonss were also changed, meaning that at least a part of the bank A was also used to accommodate the new glyphs.

In conclusion, the BIOS string functions in early PC era were not Unicode-aware, but it doesn't have to be. All you had to do was customize your 512 glyphs and set the correct EGA or VGA registers. For example, you could customize the "!@" "#$" "%^" "&*" "çé" "ñÑ" glyphs to your foreign characters (in bank A or B), and then make the BIOS print "!@#S%^&*çéñÑ" string at once. BIOS would not check the glyphs. You could also not use the BIOS functions at all, since you could write directly in the video memory. To use a glyph from bank B, just set the character Foreground Color attribute to a value between 8 and 15 (bright color).

(sorry my bad english)

Fabiano Freitas

Posted 2013-09-20T08:38:05.857

Reputation: 11

I know that we can have 512 characters as mentioned in the question. However the thing is that those programs above are displaying the real Kanji characters, not Kana, which increases the number of things displaying at the same time significantly. In systems with limited encoding half-width Katakana will be used, which has separate maru and tenten, so the same code point can be used for both し and じ, or は and ば, no need to share the left and right part – phuclv – 2018-12-16T13:21:30.387

1

This is simplifying what @Michael Kjörling is saying.

In text mode, you have "screen memory" that has 1 byte per onscreen character that tells the adapter what character appears in each screen position. (There are also "attribute" bytes that tell the adapter what color and things like underline, blink, etc.).

The adapter uses this byte to index into another "character table" that has the small 8x12 or whatever bitmap of the character. DOS calls this character table a code page.

Starting with CGA, you can tell the adapter to get the character table at a specific place in the adapter's RAM. Each adapter has a character ROM that has the default "font" for that card (which is the standard IBM font), but you can tell the adapter to switch to a location in RAM and put your own images there.

As long as the software knows what's going on, the codes in screen memory that point to the images in the character table do not have line up with any ASCII codes, though it's easier if they do. You'll notice there's screen memory codes (and character table shapes) for 1-31 which are unprintable ASCII characters - but by writing to screen memory directly (fond memories of DEFSEG = &HB800 : POKE 0,1 in GW-BASIC to change the upper most character to a smiley come to mind) you can still display them.

So displaying other languages is fine, if you can put the right images into the adapter's RAM and have the necessary software support.

LawrenceC

Posted 2013-09-20T08:38:05.857

Reputation: 63 487

Was it as early as CGA? I must be getting old. (To my defense, I did write that answer largely from memory, and haven't actually used those techniques even for fun in like forever.) – a CVn – 2013-09-20T17:17:56.347

I think you're right after looking into it, it was EGA. – LawrenceC – 2013-09-20T18:14:47.370

I know we can change the text font by changing the pointer, I've learnt how to do it years before, just don't know how they can represent the double byte character set, as 256 or 512 code points can't even hold enough the maximum number of different characters on screen, not counting the whole complex charset – phuclv – 2013-09-21T00:38:28.683

0

I did some research and as I anticipated, you have to use graphics mode or need special hardware support because there's no way to use more than 512 characters in VGA text mode

Well, DOS itself cannot print in charsets beyond 1-byte-per-char, because it uses the BIOS functions which in turn use the VGA hardware which cannot have more than 2 x 256 chars sized fonts. So this again sounds like a job for a DRIVER, one which uses graphics mode to render extensive fonts. We already have support for Unicode fonts in a few graphical DOS text editors and similar (thanks :-)) and whether DBCS or UTF-8 is used, both share the "size of character can be one or more bytes" handling "anomaly".

Will there ever be any official support for the Japanese language in FreeDOS?

The Japanese version of DOS (DOS/V) uses the first approach and simulates text mode by rendering the characters in graphics mode using a special driver. The driver follows IBM V-Text standard which is a mechanism for extending the DOS's text display capabilities. You can choose between various 16/24/32/48-dot fonts like this

DOS/V font

Some other text mode systems also use the same technique. In FreeDOS you can load some special driver for Japanese support

FreeDOS Japanese driver

The renderer will intercept int 10h and int 21h calls and draw the text manually, so it'll work even for normal English programs. But it won't work for programs that write to VGA memory directly. For printing Japanese characters int 5h and int 17h are also hooked.

According to the DOS/V manual later IBM BIOS also added support for V-Text through int 15h with the below 4 new functions

5010H Video extension information acquisition
5011H Video extension function registration
5012H Video extension driver release
5013H Video extension driver lock setting

I suppose this is also the reason I saw Japanese support in my old PCs' BIOSes

Nevertheless the slowness of graphics mode may introduce glitches while scrolling which needs special handling

DOS/V is actually the first software solution for Japanese text mode

Meanwhile, serious research had been going on at IBM Japan since the early 1980s to produce a software solution to the problem of displaying Japanese characters. With the advent of high-resolution VGA monitors, faster processors, and larger memories and hard drives, designers at IBM's Fujisawa and Yamato research laboratories realized that information about the shape and size of kanji characters could be stored on disk, loaded into extended memory, and displayed through graphics-mode VRAM. (The "V" in DOS/V, by the way, comes from the VGA monitor necessary to display the Japanese characters via software.)

DOS/V: The Soft(ware) Solution to Hard(ware) Problems

According to the same article, before the invention of DOS/V other systems all need a Kanji ROM in hardware

All of the brands of computers used hardware solutions to handle the display of Japanese characters, storing the data for all of the characters on special chips known as kanji ROMs. This method required the double-byte code for each character of keyboard input to be sent to the CPU, which in turn fetched the corresponding character from the kanji ROM and sent it to the screen via text-mode VRAM. The use of kanji ROM meant that the shape of each character was fixed, while the use of text-mode VRAM set a standard 16x16 dot size for each character.

For example the IBM Personal System/55 which uses a special graphics adapter with Japanese font, so they get real text mode

In early 1980s, IBM Japan released two x86-based personal computer lines for Asian-pacific region, IBM 5550 and IBM JX. The 5550 read Kanji fonts from the disk, and drew text as graphic characters on 1024 x 768 high resolution monitor.

https://en.wikipedia.org/wiki/DOS/V#History

Similar to IBM 5550, the text mode was 1040x725 pixels (12x24 and 24x24 pixel font, 80x25 characters) in 8 colors, can display Japanese characters read from font ROM

The AX architecture uses a special JEGA adapter instead of the standard EGA

AX (Architecture eXtended) was a Japanese computing initiative starting in around 1986 to allow PCs to handle double-byte (DBCS) Japanese text via special hardware chips, whilst allowing compatibility with software written for foreign IBM PCs.

...

To display Kanji characters with sufficient clarity, AX machines had JEGA (ja) screens with a resolution of 640x480 rather than the 640x350 standard EGA resolution prevalent elsewhere at the time. Users could typically switch between Japanese and English modes by typing 'JP' and 'US', which would also invoke the AX-BIOS and an IME enabling the input of Japanese characters.

Later versions also add a special AX-VGA/H hardware and AX-VGA/S for software emulation on VGA

However, soon after the release of the AX, IBM released the VGA standard with which AX was obviously not compatible (they were not the only one promoting non-standard "super EGA" extensions). Consequently, the AX consortium had to design a compatible AX-VGA (ja). AX-VGA/H was a hardware implementation with AX-BIOS, whereas AX-VGA/S was a software emulation.

Due to less available software and other problems, AX failed and was not able to break the PC-9801 dominance in Japan. In 1990, IBM Japan unveiled DOS/V which enabled IBM PC/AT and its clones to display Japanese text without any additional hardware using a standard VGA card. Soon after, AX disappeared and the decline of NEC PC-9801 began.

The NEC PC-98 series also have a character ROM in the display controller

A standard PC-98 has two µPD7220 display controllers (a master and a slave) with 12 KB main memory and 256 KB of video RAM respectively. The master display controller handles font ROM, displaying JIS X 0201 (7x13 pixels) and JIS X 0208 (15x16 pixels) characters

I don't know the situation for Chinese and Korean but I think the same techniques are used. I'm not sure if there are any other ways to achieve that or not

phuclv

Posted 2013-09-20T08:38:05.857

Reputation: 14 930

-1

You need a graphic mode instead of a hard-coded text mode so that unicode text glyphs can be displayed. Then you set MS-DOS to use a unicode font and change the language mapping to use that.

http://www.mobilefish.com/tutorials/windows/windows_quickguide_dos_unicode.html

headkase

Posted 2013-09-20T08:38:05.857

Reputation: 1 690

The title in the article is completely wrong and misleading. cmd.exe is not DOS despite having a terminal interface resembling DOS and a few similar commands. Are the Command Prompt and MS-DOS the same thing?

– phuclv – 2019-10-25T03:09:54.673

No, look at the images I posted, it's real DOS mode, not command promt in windows – phuclv – 2013-09-20T13:26:54.090