Badblocks reports "weird value (4294967295) in do_write" when writing patterns

-1

1

This is the second time that I'm receiving this error running badblocks, approximately 2 years apart from the last time, and the vast majority of factors from hardware (cables, etc.) to software (the installation of the operating system itself) have changed since, with the only relevant common factors being Cygwin and the badblocks program itself, making it highly likely that the issue is between those.


When running badblocks in destructive mode (i.e. with the -w switch), I get the error:

Weird value (4294967295) in do_writerrors

...at each stage of writing the patterns to the drive.

As far as I can tell, I seem to get this error only when running the command with the specified last block reported by fdisk -l:

$ fdisk -l /dev/sda
Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

$ badblocks -b 512 -vws /dev/sda 1953525168 1953525168
Checking for bad blocks in read-write mode
From block 1953525168 to 1953525168
Testing with pattern 0xaa: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: 1953525168ne, 0:00 elapsed. (0/0/0 errors)
done
Testing with pattern 0x55: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Testing with pattern 0xff: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Testing with pattern 0x00: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Pass completed, 1 bad blocks found. (1/0/0 errors)

$ badblocks -b 512 -vws /dev/sda 1953525168 1950000000
Checking for bad blocks in read-write mode
From block 1950000000 to 1953525168
Testing with pattern 0xaa: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: 1953525168ne, 0:49 elapsed. (0/0/0 errors)
done
Testing with pattern 0x55: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Testing with pattern 0xff: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Testing with pattern 0x00: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Pass completed, 1 bad blocks found. (1/0/0 errors)

As can be seen, this also results in a false positive of a bad block, whereas this supposed bad block is nowhere to be found via CrystalDiskInfo:

enter image description here

At this point the drive has been zeroed multiple times and had badblocks write to its last few blocks tens of times, so there's been plenty of opportunity for the SMART values to have picked up a bad sector in block 1953525168 if one existed.

What do these errors actually mean, and what could be causing them?

Hashim

Posted 2018-04-03T01:18:41.273

Reputation: 6 967

I believe 4294967295 is the partition size limitation (FF FF FF FF) on Windows. On Windows, there is a 2 TB limitation. It is in the number of sectors. Please see https://blogs.technet.microsoft.com/askcore/2010/02/18/understanding-the-2-tb-limit-in-windows-storage/ Given these values, you could try writing 0s (current Windows formatting system does this) to the disk and remap, but I recommend you to back it up before you try out anything. C6 indicates there is a problem on disk surface or perhaps a mechanical issue. This disk seems not recoverable.

– Epoxy – 2018-04-03T01:46:39.070

When looking at the power on hours, it does not seem old (probably less than 2 years) and in that case, you may still have the warranty as well. – Epoxy – 2018-04-03T01:46:41.800

@Epoxy Doesn't that partition size limitation apply only to NTFS? It's not NTFS, but a proprietary filesystem, and I've had no previous issues with using it or using the machine I'm currently using to run badblocks on it. – Hashim – 2018-04-03T01:52:57.277

@Epoxy The drive itself is a DiamondMax 10, released in 2005, the low power on hours are due to the fact that the machine it's in simply isn't used that often. – Hashim – 2018-04-03T01:54:14.873

@Epoxy Also, what do you mean by "remap" here? – Hashim – 2018-04-03T01:58:29.603

1If there is a compatible with the diagnostic utility with respect to the file system it will tell something different. In case if there is a firmware bug and/or if the software is unable to determine the file system there is a chance that it reports false statistics. I have seen it happen sometimes (rare). The proprietary file system may also have certain disk standard and limitations similar to that which I am not aware of. However, with these stats. it is still difficult to determine what caused it. – Epoxy – 2018-04-03T02:23:52.383

1If a read error occurs, it might mark the sector as bad but something else can cause read errors as well. Could be the moving heads or the circuitry issues. You could further test with something like Maxtor PowerMax. This will help you to get more valid results. (remap: mapping workable sectors and mark the bad blocks to prevent further use) – Epoxy – 2018-04-03T02:24:48.293

Recently ran into this error again on a completely different drive, this time a SATA one. Circumstances are completely different except that I'm using badblocks over Cygwin again, so I'm pretty sure this is a problem with badblocks at this point - maybe the Cygwin repo version of it since it seems to be such a rare error. Adding a bounty to see if someone can figure it out. – Hashim – 2019-11-28T22:38:25.770

I suggest you change the cable between the board and the disk before you keep testing the disk. (or test the disk on another computer) – Eduardo Trápani – 2019-11-29T12:10:04.897

I thought this was clear enough from my last comment and from the note on the bounty itself, but I have now drastically overhauled the question to make it clear that this is not an issue with any one particular drive nor is it very likely to be a hardware issue. I am only looking to award the bounty to questions that actually answer the larger question here of what the error is, what it's caused by and how it can be fixed. – Hashim – 2019-12-01T22:12:18.793

1Have you tried to boot a live Linux distro and run badblocks on it? – dirdi – 2019-12-01T23:23:41.867

I'll try that as soon as I can turn off my machine, but I'm pretty confident I won't be able to reproduce it, as this problem seems limited to the Cygwin repo's version of badblocks based on what I've seen of others running the same commands on their machines. Will report back once I have. – Hashim – 2019-12-01T23:27:14.477

For what is worth, you should be testing with 1953525167. 1953525168 is "last block +1" in your case. – Tom Yan – 2019-12-02T10:47:20.650

Answers

1

Although harrymc might have given you the core of my answer (that 4294967295 is -1 as unsigned int), he didn't further explain why badblocks doesn't simply "recognize" it as -1 (i.e. why the "weird value" error with a Cygwin build of it on Windows).

I took a look into the code of badblocks and Cygwin:

https://github.com/tytso/e2fsprogs/blob/v1.45.4/misc/badblocks.c#L463

https://github.com/cygwin/cygwin/tree/01c253a4c58b6c1da01615431bdc4c88fcba48ea/newlib/libc/syscalls/syswrite.c

https://github.com/cygwin/cygwin/tree/01c253a4c58b6c1da01615431bdc4c88fcba48ea/newlib/libc/reent/writer.c

And I have came up with this:

[tom@archlinux ~]$ cat test.c 
#include <stdio.h>

unsigned int eh() {
  return -1;
}

int main() {
  long got;
  got = eh();
  printf("%ld\n", got);
  got = (long) eh();
  printf("%ld\n", got);
  got = (int) eh();
  printf("%ld\n", got);
}
[tom@archlinux ~]$ cc test.c 
[tom@archlinux ~]$ ./a.out 
4294967295
4294967295
-1
[tom@archlinux ~]$ 

Basically this is saying, if you want to interpret an unsigned variable (that may be used intentionally to store a signed value) as a signed one, you should interpret with its own size, but not the size of another variable that you are going to put its value into.

I am not exactly familiar with programming, but as you can see, the (_ssize_t) type casting in reent/writer.c is probably wrong. If we assume _write() is of the int type (or any signed type), such type casting is redundant. If we assume _write() is of the unsigned int type, then the type casting it needs should be (int). (For the record, it is needed only because we are "expanding" its value to a _ssize_t (i.e. ret). Comparison like (an_unsigned_int == -1) could work just fine, AFAIK.)

Though I have to say this is merely my guessing, as I don't really know about the _write() Cygwin uses (like, whether it has anything to do with this, and if so, whether the documentation is just crap). But I think it is a valid case for a bug report, which might get you to find out more.

Update: This could be the commit that introduces the "regression" (as you can see, _ssize_t would be based on __SIZE_TYPE__ (which is essentially size_t according the commit message). It would likely end up being unsigned long when Cygwin is 64-bit, based on this and this), so I'm betting you won't be able to reproduce the problem with 32-bit Cygwin (even on 64-bit Windows, that is). It might be worth mentioning that an even earlier commit probably once "fixed" it. That's why I call it a "regression".

Update 2: and yes, I'm right: enter image description here Perhaps now I should get Visual Studio and check _write() (and maybe write()) for a bit...

P.S. You shouldn't bump into the "weird value" error if you are doing a read-only test on "last block + 1" as _read() would return 0, unlike _write() which would return -1 and set errno to ENOSPC, when it "tries to read at end of file" (the drive).

Tom Yan

Posted 2018-04-03T01:18:41.273

Reputation: 4 744

It looks like the _write() used in (64-bit) Cygwin could actually return different "-1". It returns 4294967295 when I am writing "at the end of the drive" but 18446744073709551615 when I am writing to a non-existing drive (file descriptor -1). I wonder if Windows is the one to blame for that. Fortunately, switching to an (int) cast will still sort things out anyway. – Tom Yan – 2019-12-02T18:43:38.100

This is exactly the kind of answer I was looking for, thank you very much for your time and work on this. You should really report the bug over to the people at Cygwin yourself, you would be in a much better position than me to further expand on the issue and provide any confirmation they need. – Hashim – 2019-12-02T19:28:04.667

0

The decimal value 4294967295, in hex FFFFFFFF, is simply -1 depicted as an unsigned 32-bit integer. This is a common API error code and has no other meaning. The utility badblocks is very basic, written decades ago by Linus Torvalds, which only writes out data and reads it back.

Uncorrectable Sector Count denotes the number of bad sectors that the disk firmware has detected but has not been able to relocate to good sectors because these sectors could not be read. The firmware has given up on trying to relocate these sectors.

So there are 459 uncoverable sectors that the firmware has detected but is not able to remap.

The disk is undoubtedly in a terminal phase.

If you wish to salvage the disk and don't care about its contents, you could try to deep format it, to rewrite and renew all the good sectors, while marking as bad the sectors that the firmware cannot touch. A utility by the manufacturer is preferable here. Cygwin is to be avoided, as its Linux utilities are not guaranteed of a good Windows integration.

The DiamondMax Support page suggests the quite recent disk utility DiscWizard Version: 23.0.17160, which could perhaps be able to do the deep format. This is a Windows utility.

If the disk in question is the Windows system disk, you might need to execute the utility from a Windows PE boot disk or from such a rescue disk as Bob.Omb’s Modified Win10PEx64. You might also use a Bootable Windows PE-Based Recovery Disc such as Hiren’s BootCD PE. In a pinch you could try to format the disk from a Linux Live boot.


(Addition for the rewritten post)

The above answer was apparently accepted by the poster two years before it was written and the disk was replaced. This part is about the new disk.

The new disk is in perfect shape and with zero defects, yet badblocks gives one error message.

Badblocks is an ancient utility, written by Linus Torvalds, perhaps even before Linux existed. All it does is create a temporary file, write to it until end-of-space is encountered, then re-read the data. As a disk test it is abysmal, and only "tests" the free space on the disk.

In addition, it is being run on Cygwin and not even on Windows, so its understanding of Windows returned error codes is extremely doubtful. It cannot even report the real error code, instead always reporting a -1 as error code. There is no way to imagine what would be the result of Cygwin trying to translate a Windows API error code to what it imagines is the equivalent Linux error code.

Quite frankly, I would ignore this one spurious error as meaningless, probably just coming from the misunderstanding of the "no-more-space" return code, misunderstood by either badblocks or Cygwin. The data returned by the SMART firmware is much more to the point.

In the post Equivalent of badblocks on Windows or DOS several suggestions were offered, all of them much better than badblocks, as they test the entire disk and not only the free space.

One good alternative is chkdsk /r, which uses the Windows utility chkdsk to locate bad sectors and recover readable information, analyzing physical disk errors on the entire disk.

harrymc

Posted 2018-04-03T01:18:41.273

Reputation: 306 093

I mostly agree with your answer but you are wrong talking about the pending sector count and saying "there are 253 uncoverable sectors that the firmware has detected but is not able to remap". The Current and Worst values are relative, typically on a scale of 0-100 or 0-255 where higher is better; they can go either up or down . Generally the Raw value converted from hex to decimal would instead give the actual value, in this case 1. The value is low because it has already reallocated sectors. The Uncorrectable Sector Count attribute is the real problem here (459). – James P – 2019-11-29T10:35:17.817

@JamesP: Absolutely right, I was using the wrong number. Corrected and thanks. – harrymc – 2019-11-29T10:41:32.273

The first paragraph of this answer starts off answering the question but the rest is limited to the particular hardware I was using 2 years ago. I had thought this was clear enough from my description for the bounty and my comment below the question when I set the bounty, but I've now drastically edited the question to make it as clear as I can. I am looking for an answer that explains what this persistent error is, what it's caused by and how it can be fixed. – Hashim – 2019-12-01T22:42:24.677