How information is stored?

In the old times 8 bit information and 8 bit computers got along well

There was an ASCII of 8 bits so a single byte is a single char and a single and whole position in memory/disk

then came 16 bit, 32 bit, and 64 bits computers but I lost the path

how chars are stored? Is an 16/32/64 bit ASCII used??

what if I have an 8 bit width data? can I store Many chars in a single position?

for example for 32 bits, if only 8 bits are used, there are 24 bits unused?

memory/disk position-> 0000000 00000000 0000000 xxxxxxx

or are 16/32/64 memory/disk direction still keep pointing to 8 bits instead of 16/32/64-bit-words?

so 8 bits are still alive and kicking? seem YES

EDIT

Forgetting about ASCII, I would like to know if a single address (within memory/disk) is pointing to a single 8 bit byte in a 8/16/32/64 bits platform

Hernán Eche

Posted 2010-10-20T14:02:57.563

Reputation: 190

This does not answer your question, but based on some of the stuff you are asking you (and several of the people who have provided you answers) need to read this: http://www.joelonsoftware.com/articles/Unicode.html

– ubiquibacon – 2010-10-21T07:24:56.873

Answers

If it's more than 8 bits, a character is not ASCII by definition. Numbers are still numbers.

Bytes are still bytes. Computers with wider data paths just grab more of them at the same time. A 32-bit system will manipulate 4 bytes at a time natively, and a 64-bit computer will use 8 bytes.

How the disk manages data is a separate issue - it will do it's own thing internally and respond to the interface (SATA etc) with proper-sized data chunks.

DaveE

Posted 2010-10-20T14:02:57.563

Reputation: 186

1+1 for mentioning characters other than ASCII. Today Unicode is very common and UTF-8 character for example can take between 1 and 4 bytes. – AndrejaKo – 2010-10-20T17:31:33.930

"Manipulate 4 bytes at a time natively" is not clear for me. Forgetting about ASCII, I would like to know if a single address (within memory/disk) is always pointing a single byte I will put this in the question – Hernán Eche – 2010-10-20T17:44:44.870

ockquote>

if a single address (within memory/disk) is always pointing a single byte... Depends on the data type. Single bytes are addressable - think CHAR type, but pointers on a 32-bit system are 4 bytes, and so on. So yes it can happen but "it depends" when you're talking about a specific address (the address belongs to something in a program presumably).

– DaveE – 2010-10-20T20:44:07.487

so the answer is NO, it is not always pointing a single byte because there are some hardware that has a wider minimal data width, but within PC hardware minimal acces is 8 bit – Hernán Eche – 2010-10-22T15:05:11.707

The size of the address space is in bytes. For example, you buy a computer with 4GB of RAM, or 3TB of disk. So the addresses also point to a single byte.

When addressing more than 8 bits, you also reference the bytes that follow. Suppose you have a pointer to address 104. If you load to a 64-bit register, you get bytes 104 through 111. If you store, you overwrite those addresses.

Your basic question of how character data is stored in memory, both in RAM and on-disk? Generally, data in working memory takes up more space but is easier to work with; while on-disk it is more compact, with some kind of character encoding. But there is a lot of variation, and pros and cons for them.

For example, it is not unusual for characters to always take two bytes each in memory, but when stored on disk, take one to four bytes. For example "ABC" in memory: 65 00 66 00 67 00; on disk: 65 66 67. For a special character known as the Byte Order Mark, in memory: 255 254; on disk: 239 187 191. These are Unicode characters, stored with UTF-8 encoding on disk.

(And technically speaking, ASCII is 7-bit; it only defines 128 characters. Unicode is a 16-bit superset of ASCII.)

Ken

Posted 2010-10-20T14:02:57.563

Reputation: 7 497

ok, as the size is specified in bytes then 8 bits are still alive =) and will be with us for a long time – Hernán Eche – 2010-10-22T14:39:05.217

How does Notepad differentiates a single UNICODE character from four ASCII bytes?? – Hernán Eche – 2010-10-22T14:51:22.583

For any stream of text, the character encoding must be expressed somehow, either in the stream (near the beginning) or outside it (like a file property); otherwise, some default is assumed, often based on the current locale (country and language). The presence of that special Byte Order Mark character I mentioned will help Notepad guess. If the wrong character encoding is applied, you may get the wrong characters, from a few to all of them.

– Ken – 2010-10-22T21:00:54.647

Its all a bit more complicated than the simple answers given so far.

There are 2 aspects: The machine, and mass storage.

On the Machine:

It depends on the hardware architecture.

On a PC, addressing is by byte, and you can access a byte (8 bits), a word (16 bits), a double word (32 bits), and a quadword (64 bits).

On other architectures you might only have access to some other sized "blob" for the machine data type. For example on the TMS320C40 you can access 32 bit words, and 8 bit bytes are packed into these words. You can can pack the bytes in and out, but its quite a slow process requiring several machine instructions.

So on that TMS320C40 the C compiler has a native char type that is 32 bits!

(when programming in C, never ASSUME that a char is 8 bits. Read your compiler manual, especially if doing embedded programming).

Things get even more complicated when endian-ness comes into play, there are 2 common arrangements: little and big endian, this describes how byte are arranged to fit into a larger quantity (normally that machines native word size). So for example, on a 32 bit machine you might find the bytes arrange like this:

Address X: Byte 0, Byte 1, Byte 2, Byte 3

Address X+4: Byte 4, Byte 5, Byte 6, Byte 7

Address X: Byte 3, Byte 2, Byte 1, Byte 0

Address X+4: Byte 7, Byte 6, Byte 5, Byte 4

(And it gets even more complex because the bits in a byte have endian-ness as well.)

MOSTLY this kind of thing only comes up as a worry for the hardware designers. But if you have to write device drivers and things that talk to hardware that is through memory mapped registers, it becomes a big deal.

A simple example can suffice:

Dumping a block of memory at address X might present a stream of bytes:

01 02 03 04 05 06 07 08

BUT dumping that same block from the same address and presenting as 16 bit (hex) integers might present as:

0201 0403 0605 0807

And dumping again from the same address as 32 bit integers in hex might present as:

04030201 08070605

This causes vast amounts of confusion to the uninitiated, because it all depends on the endian-ness, and the method (byte order) used to make bigger quantities out of smaller ones.

Generally high level languages hide this level of gruesomeness, but it can be important for things like overlay data structures, and, again, memory mapped device control registers.

Mass Storage.

Fortunately here, life gets easier.

Just think of your mass storage as a great big bunch of bytes, that can be accessed, and the machine will magically take care of it all. The common term used is to thing of files as a "stream", where you start at the start and the stream comes rolling in. (This conveniently ignores random access.) The smallest part you can break the file's stream into is a byte.

If a machine wants to store larger quantities (16 bit words, etc), then it may or may not do some level of transform to get that into the bytes that go to the storage.

Caveats.

All of the above is in relation to underlying low level stuff - bytes, words, and so on.

Programs make use of this in all kinds of ways. So for example you will get CHARACTERS represented by bytes if they fit happily into plain ASCII (or even EBCDIC for those with long memories). The modern Unicode character systems may use Wide Characters (generally these are 16 bits), but there are many encoding systems for unicode. The Wikipedia page on Unicode is pretty instructive.

The convention in C of assume CHARACTER = BYTE is these days, misleading and misguided. Its best to thing of "char" is a synonym for "byte" - unless your machine / compiler tells otherwise (see above). GOOD C programs generally define a set of preferred types such as "UINT8" - unsigned 8 bit integer, "SINT8" - signed 8 bit integer, and so on, so that the program written becomes as independent as is sensibly possible from the peculiarities of the specific compiler and underlying hardware.

To the specific question: How are characters stored? The answer is: it depends. Frequently, ascii characters that fit in a byte are stored as a byte. Wide characters are frequently stored as 16 bit words. But unicode might implement wide characters or one of any number of coding systems, in which case characters might occupy anywhere from 1 to about 4 bytes, depending on the character.

quickly_now

Posted 2010-10-20T14:02:57.563

Reputation: 1 797

Today's RAM, much like RAM in the 1970's, is still addressable by 8 bits at a time. So each memory address points to an 8-bit byte.

When 16-bit CPUs were developed, they maintained the ability to address 8 bits at once for speed and compatibility purposes. There's various components of a CPU that can have "bitness," register width is one of them. But almost all 16-bit or greater CPUs have instructors for accessing the upper or lower 8 bits of a register. So, just because the CPU is so many bits doesn't mean it has to access memory or registers in that size of chunks.

So to answer your questions:

are 16/32/64 memory/disk direction still keep pointing to 8 bits instead of 16/32/64-bit-words? Yes. A 32-bit CPU loading 32 bits into a register from a given memoy location is going to grab 4 bytes from DRAM and put it in the register.

8 bits are still alive and kicking? Yes. The Motorola 68000 CPU, while it was a 16-bit chip (some would argue 32-bit), had an instruction called ADDQ (for ADD QUICK) that would take an operand from a register or memory and add it to an 8-bit value actually encoded in the instruction itself. I don't know too much about x86 assembly but I'm sure there are similar instructions that restrict to 8 bits for speed.

LawrenceC

Posted 2010-10-20T14:02:57.563

Reputation: 63 487

64 bits CPUs are able to address 8 bits data.

A single char is stored on a single byte.

mouviciel

Posted 2010-10-20T14:02:57.563

Reputation: 2 858

Ohh so the 8 bits are still alive and kicking ! – Hernán Eche – 2010-10-20T14:10:43.727

True for a char in C, but not necessarily other languages. – Ken – 2010-10-20T18:45:44.820

again, forget about datatypes, question is about bits – Hernán Eche – 2010-10-20T19:33:55.467