9

Let's say a MLC SSD has lasted a very long time and the first cell has hit its last erase cycle and refuses to erase.

What happens after that? Does the controller detect that as a bad block and moves to the next one and tries to erase that instead? Would the total capacity of the drive just slowly decrease over time?

EDIT

And of course we can forget about wear leveling. Yes it extends the life of a drive, but I am not talking about that. Eventually a cell will hit its last erase cycle.

Pyrolistical
  • 892
  • 2
  • 13
  • 20

3 Answers3

8

The NAND flash chips have some built-in mechanisms to detect failures on write and erase operations, and will alert the controller if one fails. In this case, the controller can either try again, or treat that block as bad and map it out of its wear-leveling algorithm. Each page in the NAND device also has a spare area alongside the main data area, which is intended for metadata such as ECC and other forms of fault detection and tolerance. The controller can decide on its own fault tolerance scheme using the spare area. Hamming Codes are one common scheme, though there are several, including simple parity bits and Reed-Solomon codes. If things don't match up on a read operation, again, the controller is free to do as it pleases. Ideally, it would also map these blocks out of the wear leveling algorithm, and you would just lose capacity little by little until "too many" blocks fail, where "too many" depends on the algorithms and hardware structure sizes within the controller. Many first-cut controller designs simply declare an error to the operating system.

Note that this is not an MLC-specific issue; though MLC cells may be more prone to a read error, since there is necessarily a smaller margin for error, SLC cells fail with mostly the same mechanisms, and can be dealt with by the controller in the same way.

Matt J
  • 181
  • 1
  • 3
2

Just like with hard disks, it's up to the implementation in the operating system. The controller would simply report that write (erase is actually a write operation) failed and it's up to the devide driver in operating system kernel to decide what to do. From what I've seen so far, Microsoft and Linux implementations simply return the error code to the calling application - so it produces I/O error.

In short: You simply get a "broken" device at some point.

Milan Babuškov
  • 1,010
  • 2
  • 15
  • 19
  • Well, that sucks. Not a very good abstraction then... – Pyrolistical Jun 09 '09 at 18:31
  • 1
    And wrong. Primarily this is handled in the SSD itself - not the device driver. Because this is normal operations. Wear leveling will record the sector as failed and remap the sector. – TomTom Jan 08 '14 at 08:05
1

SSDs use something called "wear leveling", where the drive keeps a statistic about sector usage and at some point or when it detects problems it will move the sector to a reserve one, just like it happens with regular hard drives.

Sven
  • 97,248
  • 13
  • 177
  • 225