2

This might be a bit esoteric, but I've got an odd sort of a problem with stat().

Specifically - on a Linux system, with Isilon storage NFS mounted. I have a process that is getting time differences when running fstat on the same file.

I can reproduce this with a bit of C, which does nothing more complicated than open-write-stat-close; open-stat-close

And these two numbers will - occasionally - differ by a few milliseconds - a file might appear 'older' than it actually is.

I'm unclear if the 'stat' is reading the time from the host, or from the server. And I'm also not clear if the problem is with the Isilon (although it doesn't seem to happen on NetApp storage).

Can anyone give me a bit of insight into what might be going on?

As an example:

time mismatch: 1468936451.862865611 != 1468936451.860183107

The first is the time read when the file handle is open. The second is the time read after closing (and fsync).

As you can see - the file is a bit older 'on the server' than on the client.

The C to test this is:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>

int main ()
{
  while (1)
  {
    struct stat sb[2];

    /*
     * Open file
     */
    int fd = open ("test.dat", O_CREAT | O_RDWR | O_APPEND, 0644);
    if (fd < 0)
    {
      perror ("failed to open file (1)");
      return EXIT_FAILURE;
    }

    /*
     * Write to it
     */
    write (fd, "", 1);

    /*
     * Ensure it's sync'd to disk
     */
    if (fsync (fd) < 0)
    {
      perror ("failed to fsync file");
      return EXIT_FAILURE;
    }

    /*
     * Stat the file
     */
    if (fstat (fd, &(sb[0])) < 0)
    {
      perror ("failed to fstat file (1)");
      return EXIT_FAILURE;
    }

    /*
     * Close it
     */
    close (fd);

    /*
     * Open file
     */
    fd = open ("test.dat", O_RDONLY);
    if (fd < 0)
    {
      perror ("failed to open file (2)");
      return EXIT_FAILURE;
    }

    /*
     * Stat it again
     */
    if (fstat (fd, &(sb[1])) < 0)
    {
      perror ("failed to fstat file (2)");
      return EXIT_FAILURE;
    }

    /*
     * Close it
     */
    close (fd);

    /*
     * Check the times match
     */
    if (sb[0].st_mtime != sb[1].st_mtime ||
        sb[0].st_mtim.tv_nsec !=sb[1].st_mtim.tv_nsec)
    {
      printf ("time mismatch: %d.%09ld != %d.%09ld\n",
              (int)(sb[0].st_mtime), sb[0].st_mtim.tv_nsec,
              (int)(sb[1].st_mtime), sb[1].st_mtim.tv_nsec);
    }
  }
}

And you can trigger it by running this in an NFS mount, then calling 'stat' (shell command is fine) from another system. E.g. on the isilon itself:

while true
do 
    stat test.dat
done

This gets me errors like:

time mismatch: 1468936613.313045037 != 1468936613.310174576
time mismatch: 1468936613.547768543 != 1468936613.544878047
time mismatch: 1468936615.228495345 != 1468936615.225862947
time mismatch: 1468936619.036053897 != 1468936619.033362349
time mismatch: 1468936619.988151772 != 1468936619.985529620
time mismatch: 1468936664.541226541 != 1468936664.538549678
time mismatch: 1468936666.165674866 != 1468936666.163171366
time mismatch: 1468936666.787528027 != 1468936666.784797051
time mismatch: 1468936711.334376729 != 1468936711.331868927

And ... I don't think it should.

Edit:

The plot thickens - when running the above and capturing packets, the response from the NFS GETATTR call from the server is inconsistent. I get two different timestamps.

Mount options don't seem to make a difference - but we have tried setting noac and actimeo=0, sync lookupcache=none etc.

Our isilon nodes are all ntp-synced to the same sources. As usual, there's a small amount of drift between them.

We have found that we can apply a workaround in the form of isi export modify --zone <zoneid> <exportnumber> --time-delta 1

This sets the resolution of the timestamp to 1s rather than 1ns. However whilst it diminishes the frequency of the problem occuring, it will still happen - just less frequently, because it only occurs at a '1s' boundary.

Sobrique
  • 3,697
  • 2
  • 14
  • 34
  • What are your NFS mount options? And IIRC Isilon is a clustered file system. Are all the cluster nodes synced to the same time? – Andrew Henle Jul 21 '16 at 13:40
  • Edited - we've tried a variety of mount options (e.g. `noac`, `nocto`, `lookupcache=none`, `actimeo=0`, `sync` etc.). The nodes are using the same NTP servers as each other. As always though, there'll be a little jitter. – Sobrique Jul 21 '16 at 14:01
  • Have you talked to EMC/Isilon support? – ewwhite Jul 21 '16 at 14:03
  • Case open at the moment. I _suspect_ there's an Isilon code bug, but I'm I'm hoping to run other possible sources of problems in case that's not so. – Sobrique Jul 21 '16 at 14:04
  • I'm not 100% sure this is an Isilon fault - but it does occur on EL7 and EL6.5, and the packet capture would seem to indicate the Isilon is reporting time changes. – Sobrique Jul 21 '16 at 14:05

0 Answers0