4

When doing an ls inside an Amazon EFS mount point, it just hangs.

The EFS troubleshooting section on AWS EFS troubleshooting

Mentions the following:

Mount Does Not Respond

An Amazon EFS mount appears unresponsive. For example, commands like ls hang.

Action to Take

This error can occur if another application is writing large amounts of data to the file system. Access to the files that are being written might be blocked until the operation is complete. In general, any commands or applications that attempt to access files that are being written to might appear to hang. For example, the ls command might hang when it gets to the file that is being written. This is because some Linux distributions alias the ls command so that it retrieves file attributes in addition to listing the directory contents.

To resolve this issue, verify that another application is writing files to the Amazon EFS mount, and that it is in the Uninterruptible sleep (D) state, as in the following example:

$ ps aux | grep large_io.py

root 33253 0.5 0.0 126652 5020 pts/3 D+ 18:22 0:00 python large_io.py /efs/large_file

After you've verified that this is the case, you can address the issue by waiting for the other write operation to complete, or by implementing a workaround. In the example of ls, you can use the /bin/ls command directly, instead of an alias, which will allow the command to proceed without hanging on the file being written. In general, if the application writing the data can force a data flush periodically, perhaps by using fsync(2), this might help improve the responsiveness of your file system for other applications. However, this improvement might be at the expense of performance when the application writes data.

So I verified to see if anything was writing to it but the only thing that showed up was

root 43556 0.0 0.0 124356 756 pts/6 D+ 19:15 0:00 ls --color=auto /efs/

root 43558 0.0 0.0 112664 972 pts/3 S+ 19:16 0:00 grep --color=auto efs

So nothing is being written to EFS as far as I know. Are there any other things I can look into as causes of this?

I also tried mounting the EFS on a separate machine just to verify, I also tested another machine in a different AZ to the other mount point in that AZ and saw the same behavior.

update:

lsof shows:

nfsv4.1-s 113422 root cwd DIR 202,1 4096 128 /

nfsv4.1-s 113422 root rtd DIR 202,1 4096 128 /

nfsv4.1-s 113422 txt cwd unknown /proc/113422/exe

This disappears when unmounted, and reappears after mounting.

John Doe
  • 93
  • 1
  • 8
  • Check the output of 'lsof' to see if anything is accessing the efs volume. – Jason Martin Nov 15 '17 at 23:40
  • 1
    Are there a lot of files in the directory? How much total data is in the filesystem? – jordanm Nov 16 '17 at 05:39
  • @JasonMartin Added the output to the question. – John Doe Dec 04 '17 at 15:41
  • @jordanm The EFS mount has a lot of small files, about 5TB in total – John Doe Dec 04 '17 at 15:42
  • 5TB in small files..... How many directories is this spread across / how many files do you have per directory? – USD Matt Dec 04 '17 at 15:45
  • @USDMatt I don't have an exact number right now, but I'd say that it's between 500 and 800 directories, files per directory I'm not sure at the moment. – John Doe Dec 04 '17 at 15:53
  • I'm not sure how EFS handles file metadata but most traditional file systems really do not like large numbers of files in a single directory. The metadata lookups required for a listing can kill performance, especially in distributed systems. Personally I would aim for ~1000 files with a maximum of 10,000 in a directory, but those are pretty arbitrary numbers. – USD Matt Dec 04 '17 at 16:14
  • @USDMatt I'll keep that in mind. At this point though I'm not sure how else to proceed with this unresponsive EFS. I expected EFS to be able to work with what we're throwing at it since Amazon advertises it as such https://aws.amazon.com/efs/details/#performance – John Doe Dec 04 '17 at 16:33

1 Answers1

0

Given all the previous information, it is difficult to say exactly what is going on. However, you need that Amazon EFS mount to work, so:

Your lsof results show what is likely pseudofile in the /proc filesystem. At some point that process lost its executable, and I suspect it is trying to keep running. It disappears when you unmount because the lsof command can’t see the volume, and when you re-mount the command sees that lost executable again. This is likely the process that is chewing up resources. When you run a ps command, do you see process 113422? Since you did not report that another application is running, you can try killing this process.

First I would run ps -aux to see all the processes running, including background processes, and see if you can find process 113422. If so, what is it running? (Or thinking it is running.) If you feel comfortable stopping that process then run kill -9 113422 and stop it entirely.

Re-try your ls command, and it should run normally. You can also use the /bin/ls command directly. In fact, since you have so many small files, I’d recommend using this method only, so the system won’t hang waiting on a file.

As for performance, from your comment it sounds like you chose EFS due to the unrestricted filesystem size, so likely EBS wasn’t an option although it can provide better performance. Each type has its pros and cons. However, if you keep experiencing issues, perhaps re-visiting the filesystem decision will help.

Pang
  • 273
  • 3
  • 8
Mika Wolf
  • 169
  • 3