119

Background: physical server, about two years old, 7200-RPM SATA drives connected to a 3Ware RAID card, ext3 FS mounted noatime and data=ordered, not under crazy load, kernel 2.6.18-92.1.22.el5, uptime 545 days. Directory doesn't contain any subdirectories, just millions of small (~100 byte) files, with some larger (a few KB) ones.

We have a server that has gone a bit cuckoo over the course of the last few months, but we only noticed it the other day when it started being unable to write to a directory due to it containing too many files. Specifically, it started throwing this error in /var/log/messages:

ext3_dx_add_entry: Directory index full!

The disk in question has plenty of inodes remaining:

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda3            60719104 3465660 57253444    6% /

So I'm guessing that means we hit the limit of how many entries can be in the directory file itself. No idea how many files that would be, but it can't be more, as you can see, than three million or so. Not that that's good, mind you! But that's part one of my question: exactly what is that upper limit? Is it tunable? Before I get yelled at—I want to tune it down; this enormous directory caused all sorts of issues.

Anyway, we tracked down the issue in the code that was generating all of those files, and we've corrected it. Now I'm stuck with deleting the directory.

A few options here:

  1. rm -rf (dir)

    I tried this first. I gave up and killed it after it had run for a day and a half without any discernible impact.

  2. unlink(2) on the directory: Definitely worth consideration, but the question is whether it'd be faster to delete the files inside the directory via fsck than to delete via unlink(2). That is, one way or another, I've got to mark those inodes as unused. This assumes, of course, that I can tell fsck not to drop entries to the files in /lost+found; otherwise, I've just moved my problem. In addition to all the other concerns, after reading about this a bit more, it turns out I'd probably have to call some internal FS functions, as none of the unlink(2) variants I can find would allow me to just blithely delete a directory with entries in it. Pooh.
  3. while [ true ]; do ls -Uf | head -n 10000 | xargs rm -f 2>/dev/null; done )

    This is actually the shortened version; the real one I'm running, which just adds some progress-reporting and a clean stop when we run out of files to delete, is:

    export i=0;
    time ( while [ true ]; do
      ls -Uf | head -n 3 | grep -qF '.png' || break;
      ls -Uf | head -n 10000 | xargs rm -f 2>/dev/null;
      export i=$(($i+10000));
      echo "$i...";
    done )

    This seems to be working rather well. As I write this, it has deleted 260,000 files in the past thirty minutes or so.

Now, for the questions:
  1. As mentioned above, is the per-directory entry limit tunable?
  2. Why did it take "real 7m9.561s / user 0m0.001s / sys 0m0.001s" to delete a single file which was the first one in the list returned by ls -U, and it took perhaps ten minutes to delete the first 10,000 entries with the command in #3, but now it's hauling along quite happily? For that matter, it deleted 260,000 in about thirty minutes, but it's now taken another fifteen minutes to delete 60,000 more. Why the huge swings in speed?
  3. Is there a better way to do this sort of thing? Not store millions of files in a directory; I know that's silly, and it wouldn't have happened on my watch. Googling the problem and looking through SF and SO offers a lot of variations on find that are not going to be significantly faster than my approach for several self-evident reasons. But does the delete-via-fsck idea have any legs? Or something else entirely? I'm eager to hear out-of-the-box (or inside-the-not-well-known-box) thinking.
Thanks for reading the small novel; feel free to ask questions and I'll be sure to respond. I'll also update the question with the final number of files and how long the delete script ran once I have that.

Final script output!:

2970000...
2980000...
2990000...
3000000...
3010000...

real    253m59.331s
user    0m6.061s
sys     5m4.019s

So, three million files deleted in a bit over four hours.

BMDan
  • 7,129
  • 2
  • 22
  • 34
  • 1
    rm (GNU coreutils) 8.4 has this option: *"-v, --verbose explain what is being done"*. It will display all the files that are being deleted. – Cristian Ciupitu Sep 27 '10 at 18:34
  • 2
    Actually, that'd be a neat way to do a progress bar: since each file would be thirty-seven characters long (36 + a '\n'), I could easily write a parser for that, and since printf() is cheap and the rm command already has the name of the file loaded, there's no especial performance penalty. Seems like a non-starter for doing the whole shebang, since I could never get "rm" to do anything like that, anyway. But it could work quite well as an intra-10,000 progress bar; perhaps a "." for every hundred files? – BMDan Sep 27 '10 at 21:40
  • 8
    `rm -rfv | pv -l >/dev/null`. pv should be available in the [EPEL](https://fedoraproject.org/wiki/EPEL/FAQ#howtouse) repository. – Cristian Ciupitu Sep 27 '10 at 23:04
  • 5
    pv is overwhelmingly awesome. I leave a trail of pv installations in my wake. – BMDan Sep 28 '10 at 01:54
  • I had this exact same issue recently. Thankyou! – richo Dec 23 '10 at 23:26
  • Just the `ls -U` tip save my world. Thanks! – Rafael Kassner Jul 07 '13 at 16:36
  • @CristianCiupitu your comment about using pipe viewer deserves it's own answer -- it's a great option that works and anything to help pv get visibility to admins that don't know about it would be great – Bane Oct 09 '14 at 14:26
  • @Bane, there's already an [answer](http://serverfault.com/a/184838/4276) mentioning it, but if you think it's not enough feel free to expand it or write your own. – Cristian Ciupitu Oct 09 '14 at 15:09

24 Answers24

99

Update August 2021

This answer continues to attract a lot of attention and I feel as if its so woefully out of date it kind of is redundant now.

Doing a find ... -delete is most likely going to produce acceptable results in terms of performance.

The one area I felt might result in a higher performance is tackling the 'removing' part of the problem instead of the 'listing' part.

I tried it and it didn't work. But I felt it was useful to explain what I did and why.

In todays newer kernels, through the use of the IO uring subsystem in the kernel (see man 2 io_uring_setup) it is actually possible to attempt to perform unlinks asynchronously -- meaning we can submit unlink requests without waiting or blocking to see the result.

This program basically reads a directory, submits hundreds of unlinks without waiting for the result, then reaps the results later once the system is done handling the request.

It tries to do what dentls did but uses IO uring. Can be compiled with gcc -o dentls2 dentls2.c -luring.

#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <err.h>
#include <sched.h>

#include <sys/stat.h>
#include <sys/types.h>
#include <dirent.h>

#include <linux/io_uring.h>
#include <liburing.h>

/* Try to keep the queue size to under two pages as internally its stored in
 * the kernel as contiguously ordered pages. Basically the bigger you make it
 * the higher order it becomes and the less likely you'll have the contiguous
 * pages to support it, despite not hitting any user limits.
 * This reduces an ENOMEM here by keeping the queue size as order 1
 * Ring size internally is rougly 24 bytes per entry plus overheads I haven't
 * accounted for.
 */
#define QUEUE_SIZE 256

/* Globals to manage the queue */
static volatile int pending = 0;
static volatile int total_files = 0;

/* Probes kernel uring implementation and checks if action is 
 * supported inside the kernel */
static void probe_uring(
    struct io_uring *ring)
{
  struct io_uring_probe *pb = {0};

  pb = io_uring_get_probe_ring(ring);

  /* Can we perform IO uring unlink in this kernel ? */
  if (!io_uring_opcode_supported(pb, IORING_OP_UNLINKAT)) {
    free(pb);
    errno = ENOTSUP;
    err(EXIT_FAILURE, "Unable to configure uring");
  }

  free(pb);
}


/* Place a unlink call for the specified file/directory on the ring */
static int submit_unlink_request(
    int dfd,
    const char *fname,
    struct io_uring *ring)
{
  char *fname_cpy = strdup(fname);
  struct io_uring_sqe *sqe = NULL;

  /* Fetch a free submission entry off the ring */
  sqe = io_uring_get_sqe(ring);
  if (!sqe)
    /* Submission queue full */
    return 0;

  pending++;
  /* Format the unlink call for submission */
  io_uring_prep_rw(IORING_OP_UNLINKAT, sqe, dfd, fname_cpy, 0, 0);
  sqe->unlink_flags = 0;

  /* Set the data to just be the filename. Useful for debugging
   * at a later point */
  io_uring_sqe_set_data(sqe, fname_cpy);

  return 1;
}


/* Submit the pending queue, then reap the queue
 * clearing up room on the completion queue */
static void consume_queue(
    struct io_uring *ring)
{
  char *fn;
  int i = 0, bad = 0;
  int rc;
  struct io_uring_cqe **cqes = NULL;

  if (pending < 0)
    abort();

  cqes = calloc(pending, sizeof(struct io_uring_cqe *));
  if (!cqes)
    err(EXIT_FAILURE, "Cannot find memory for CQE pointers");

  /* Notify about submitted entries from the queue (this is a async call) */
  io_uring_submit(ring);

  /* We can immediately take a peek to see if we've anything completed */
  rc = io_uring_peek_batch_cqe(ring, cqes, pending);

  /* Iterate the list of completed entries. Check nothing crazy happened */
  for (i=0; i < rc; i++) {
    /* This returns the filename we set earlier */
    fn = io_uring_cqe_get_data(cqes[i]);

    /* Check the error code of the unlink calls */
    if (cqes[i]->res < 0) {
      errno = -cqes[i]->res;
      warn("Unlinking entry %s failed", fn);
      bad++;
    }

    /* Clear up our CQE */
    free(fn);
    io_uring_cqe_seen(ring, cqes[i]);
  }

  pending -= rc + bad;
  total_files += rc - bad;
  free(cqes);
}



/* Main start */
int main(
    const int argc,
    const char **argv)
{
  struct io_uring ring = {0};
  struct stat st = {0};
  DIR *target = NULL;
  int dfd;
  struct dirent *fn;

  /* Check initial arguments passed make sense */
  if (argc < 2)
    errx(EXIT_FAILURE, "Must pass a directory to remove files from.");

  /* Check path validity */
  if (lstat(argv[1], &st) < 0)
    err(EXIT_FAILURE, "Cannot access target directory");

  if (!S_ISDIR(st.st_mode)) 
    errx(EXIT_FAILURE, "Path specified must be a directory");

  /* Open the directory */
  target = opendir(argv[1]);
  if (!target)
    err(EXIT_FAILURE, "Opening the directory failed");
  dfd = dirfd(target);

  /* Create the initial uring for handling the file removals */
  if (io_uring_queue_init(QUEUE_SIZE, &ring, 0) < 0)
    err(EXIT_FAILURE, "Cannot initialize URING");

  /* Check the unlink action is supported */
  probe_uring(&ring);

  /* So as of writing this code, GETDENTS doesn't have URING support.
   * but checking the kernel mailing list indicates its in progress.
   * For now, we'll just do laymans readdir(). These days theres no 
   * actual difference between it and making the getdents() call ourselves.
   */
  while (fn = readdir(target)) {
    if (fn->d_type != DT_REG)
      /* Pay no attention to non-files */
      continue;

    /* Add to the queue until its full, try to consume it
     * once its full. 
     */
    while (!submit_unlink_request(dfd, fn->d_name, &ring)) {
      /* When the queue becomes full, consume queued entries */
      consume_queue(&ring);
      /* This yield is here to give the uring a chance to 
       * complete pending requests */
      sched_yield();
      continue;
    }
  }

  /* Out of files in directory to list. Just clear the queue */
  while (pending) {
    consume_queue(&ring);
    sched_yield();
  }

  printf("Total files: %d\n", total_files);

  io_uring_queue_exit(&ring);
  closedir(target);
  exit(0);
}

The results were ironically opposite what I suspected, but why?

TMPFS with 4 million files

$ time ./dentls2 /tmp/many
Total files: 4000000

real    0m6.459s
user    0m0.360s
sys 0m24.224s

Using find:

$ time find /tmp/many -type f -delete

real    0m9.978s
user    0m1.872s
sys 0m6.617s

BTRFS with 10 million files

$ time ./dentls2 ./many
Total files: 10000000

real    10m25.749s
user    0m2.214s
sys 16m30.865s

Using find:

time find ./many -type f -delete

real    7m1.328s
user    0m9.209s
sys 4m42.000s

So it looks as if batched syscalls dont make an improvement in real time. The new dentls2 spends much more time working (four times as much) only to result in worse performance. So a net loss in overall efficiency and worse latency. dentls2 is worse.

The cause of this is because io_uring produces kernel dispatcher threads to do the unlink work internally, but the directory inode being worked on can only be modified by a single writer at one time.

Basically using the uring we're creating lots of little threads but only one thread is allowed to delete from the directory. We've just created a bunch of contention and eliminated the advantage of doing batched IO.

Using eBPF you can measure the unlink frequencies and watch what causes the delays.

In the case of BTRFS its the kernel function call btrfs_commit_inode_delayed_inode which acquires the lock when unlink is called.

With dentls2

# /usr/share/bcc/tools/funclatency btrfs_commit_inode_delayed_inode
    Tracing 1 functions for "btrfs_commit_inode_delayed_inode"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 18       |                                        |
       512 -> 1023       : 120      |                                        |
      1024 -> 2047       : 50982    |                                        |
      2048 -> 4095       : 2569467  |********************                    |
      4096 -> 8191       : 4936402  |****************************************|
      8192 -> 16383      : 1662380  |*************                           |
     16384 -> 32767      : 656883   |*****                                   |
     32768 -> 65535      : 85409    |                                        |
     65536 -> 131071     : 21715    |                                        |
    131072 -> 262143     : 9719     |                                        |
    262144 -> 524287     : 5981     |                                        |
    524288 -> 1048575    : 857      |                                        |
   1048576 -> 2097151    : 293      |                                        |
   2097152 -> 4194303    : 220      |                                        |
   4194304 -> 8388607    : 255      |                                        |
   8388608 -> 16777215   : 153      |                                        |
  16777216 -> 33554431   : 56       |                                        |
  33554432 -> 67108863   : 6        |                                        |
  67108864 -> 134217727  : 1        |                                        |

avg = 8533 nsecs, total: 85345432173 nsecs, count: 10000918

Using find ... -delete:

# /usr/share/bcc/tools/funclatency btrfs_commit_inode_delayed_inode
Tracing 1 functions for "btrfs_commit_inode_delayed_inode"... Hit Ctrl-C to end.
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 34       |                                        |
       512 -> 1023       : 95       |                                        |
      1024 -> 2047       : 1005784  |****                                    |
      2048 -> 4095       : 8110338  |****************************************|
      4096 -> 8191       : 672119   |***                                     |
      8192 -> 16383      : 158329   |                                        |
     16384 -> 32767      : 42338    |                                        |
     32768 -> 65535      : 4667     |                                        |
     65536 -> 131071     : 3597     |                                        |
    131072 -> 262143     : 2860     |                                        |
    262144 -> 524287     : 216      |                                        |
    524288 -> 1048575    : 22       |                                        |
   1048576 -> 2097151    : 6        |                                        |
   2097152 -> 4194303    : 3        |                                        |
   4194304 -> 8388607    : 5        |                                        |
   8388608 -> 16777215   : 3        |                                        |

avg = 3258 nsecs, total: 32585481993 nsecs, count: 10000416

You can see from the histogram that find spends 3258 nanoseconds on average in btrfs_commit_inode_delayed_inode but dentls2 spends 8533 nanoseconds in the function.

Also the histogram shows that overall io_uring threads spend at least twice as long waiting on the lock which the majority of calls taking 4096-8091 nanoseconds versus the majority in find taking 2048-4095 nanoseconds.

Find is single-threaded and isn't contending for the lock, whereas `dentls2 is multi-threaded (due to the uring) which produces lock contention and the delays that are experienced are reflected in the analysis.

Conclusion

All in all, on modern systems (as of writing this) there is less and less you can do in software to make this go faster than it is set to go.

It used to be reading a large buffer from the disk you could compound an expensive IO call down into one large sequential read, instead of seeky IO which small getdents() buffers could typically end up being.

Also due to other improvements there are smaller overheads to just invoking system calls and major improvements in sequential/random IO access times that eliminate the big IO bottlenecks we used to experience.

On my systems, this problem has become memory/cpu bound. Theres a single-accessor problem on (at least) BTRFS which limits the speed you can go to a single cpu/programs worth of unlinks per directory at a time. Trying to batch the IO's yields at best minor improvements even in ideal circumstances of using tmpfs and typically is worse on a real-world filesystem.

To top it off, we really dont have this problem anymore -- gone are the days of 10 million files taking 4 hours to remove.

Just do something simple like find ... -delete. No amount of optimization I tried seemed to yield major performance improvements worth the coding (or analysis) over a default simple setup.


Original Answer

Whilst a major cause of this problem is ext3 performance with millions of files, the actual root cause of this problem is different.

When a directory needs to be listed readdir() is called on the directory which yields a list of files. readdir is a posix call, but the real Linux system call being used here is called 'getdents'. Getdents list directory entries by filling a buffer with entries.

The problem is mainly down to the fact that that readdir() uses a fixed buffer size of 32Kb to fetch files. As a directory gets larger and larger (the size increases as files are added) ext3 gets slower and slower to fetch entries and additional readdir's 32Kb buffer size is only sufficient to include a fraction of the entries in the directory. This causes readdir to loop over and over and invoke the expensive system call over and over.

For example, on a test directory I created with over 2.6 million files inside, running "ls -1|wc-l" shows a large strace output of many getdent system calls.

$ strace ls -1 | wc -l
brk(0x4949000)                          = 0x4949000
getdents(3, /* 1025 entries */, 32768)  = 32752
getdents(3, /* 1024 entries */, 32768)  = 32752
getdents(3, /* 1025 entries */, 32768)  = 32760
getdents(3, /* 1025 entries */, 32768)  = 32768
brk(0)                                  = 0x4949000
brk(0x496a000)                          = 0x496a000
getdents(3, /* 1024 entries */, 32768)  = 32752
getdents(3, /* 1026 entries */, 32768)  = 32760
...

Additionally the time spent in this directory was significant.

$ time ls -1 | wc -l
2616044

real    0m20.609s
user    0m16.241s
sys 0m3.639s

The method to make this a more efficient process is to call getdents manually with a much larger buffer. This improves performance significantly.

Now, you're not supposed to call getdents yourself manually so no interface exists to use it normally (check the man page for getdents to see!), however you can call it manually and make your system call invocation way more efficient.

This drastically reduces the time it takes to fetch these files. I wrote a program that does this.

/* I can be compiled with the command "gcc -o dentls dentls.c" */

#define _GNU_SOURCE

#include <dirent.h>     /* Defines DT_* constants */
#include <err.h>
#include <fcntl.h>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>

struct linux_dirent {
        long           d_ino;
        off_t          d_off;
        unsigned short d_reclen;
        char           d_name[256];
        char           d_type;
};

static int delete = 0;
char *path = NULL;

static void parse_config(
        int argc,
        char **argv)
{
    int option_idx = 0;
    static struct option loptions[] = {
      { "delete", no_argument, &delete, 1 },
      { "help", no_argument, NULL, 'h' },
      { 0, 0, 0, 0 }
    };

    while (1) {
        int c = getopt_long(argc, argv, "h", loptions, &option_idx);
        if (c < 0)
            break;

        switch(c) {
          case 0: {
              break;
          }
 
          case 'h': {
              printf("Usage: %s [--delete] DIRECTORY\n"
                     "List/Delete files in DIRECTORY.\n"
                     "Example %s --delete /var/spool/postfix/deferred\n",
                     argv[0], argv[0]);
              exit(0);                      
              break;
          }

          default:
          break;
        }
    }

    if (optind >= argc)
      errx(EXIT_FAILURE, "Must supply a valid directory\n");

    path = argv[optind];
}

int main(
    int argc,
    char** argv)
{

    parse_config(argc, argv);

    int totalfiles = 0;
    int dirfd = -1;
    int offset = 0;
    int bufcount = 0;
    void *buffer = NULL;
    char *d_type;
    struct linux_dirent *dent = NULL;
    struct stat dstat;

    /* Standard sanity checking stuff */
    if (access(path, R_OK) < 0) 
        err(EXIT_FAILURE, "Could not access directory");

    if (lstat(path, &dstat) < 0) 
        err(EXIT_FAILURE, "Unable to lstat path");

    if (!S_ISDIR(dstat.st_mode))
        errx(EXIT_FAILURE, "The path %s is not a directory.\n", path);

    /* Allocate a buffer of equal size to the directory to store dents */
    if ((buffer = calloc(dstat.st_size*3, 1)) == NULL)
        err(EXIT_FAILURE, "Buffer allocation failure");

    /* Open the directory */
    if ((dirfd = open(path, O_RDONLY)) < 0) 
        err(EXIT_FAILURE, "Open error");

    /* Switch directories */
    fchdir(dirfd);

    if (delete) {
        printf("Deleting files in ");
        for (int i=5; i > 0; i--) {
            printf("%u. . . ", i);
            fflush(stdout);
            sleep(1);
        }
        printf("\n");
    }

    while (bufcount = syscall(SYS_getdents, dirfd, buffer, dstat.st_size*3)) {
        offset = 0;
        dent = buffer;
        while (offset < bufcount) {
            /* Don't print thisdir and parent dir */
            if (!((strcmp(".",dent->d_name) == 0) || (strcmp("..",dent->d_name) == 0))) {
                d_type = (char *)dent + dent->d_reclen-1;
                /* Only print files */
                if (*d_type == DT_REG) {
                    printf ("%s\n", dent->d_name);
                    if (delete) {
                        if (unlink(dent->d_name) < 0)
                            warn("Cannot delete file \"%s\"", dent->d_name);
                    }
                    totalfiles++;
                }
            }
            offset += dent->d_reclen;
            dent = buffer + offset;
        }
    }
    fprintf(stderr, "Total files: %d\n", totalfiles);
    close(dirfd);
    free(buffer);

    exit(0);
}

Whilst this does not combat the underlying fundamental problem (lots of files, in a filesystem that performs poorly at it). It's likely to be much, much faster than many of the alternatives being posted.

As a forethought, one should remove the affected directory and remake it after. Directories only ever increase in size and can remain poorly performing even with a few files inside due to the size of the directory.

Edit: I've cleaned this up quite a bit. Added an option to allow you to delete on the command line at runtime and removed a bunch of the treewalk stuff which, honestly looking back was questionable at best. Also was shown to produce memory corruption.

You can now do dentls --delete /my/path

New results. Based off of a directory with 1.82 million files.

## Ideal ls Uncached
$ time ls -u1 data >/dev/null

real    0m44.948s
user    0m1.737s
sys 0m22.000s

## Ideal ls Cached
$ time ls -u1 data >/dev/null

real    0m46.012s
user    0m1.746s
sys 0m21.805s


### dentls uncached
$ time ./dentls data >/dev/null
Total files: 1819292

real    0m1.608s
user    0m0.059s
sys 0m0.791s

## dentls cached
$ time ./dentls data >/dev/null
Total files: 1819292

real    0m0.771s
user    0m0.057s
sys 0m0.711s

Was kind of surprised this still works so well!

Matthew Ife
  • 22,927
  • 2
  • 54
  • 71
  • 1
    Two minor concerns: one, `[256]` should probably be `[FILENAME_MAX]`, and two, my Linux (2.6.18==CentOS 5.x) doesn't seem to include a d_type entry in dirent (at least according to getdents(2)). – BMDan Nov 07 '11 at 05:01
  • indeed, Noted on point one. The struct can actually have d_type removed as according to the man page the struct it utilizes doesnt physically represent the struct described. Basically d_type is reclen-1, and thats been in the kernel since 2.6.4 – Matthew Ife Nov 07 '11 at 07:51
  • 1
    Could you please elaborate a bit on btree rebalancing and why deletion in order helps preventing it? I tried Googling for it, unfortunately to no avail. – ovgolovin Jul 28 '13 at 23:28
  • 1
    Because now it seems to me if we are deleting in-order, we force rebalancing, as we remove leaves on one side and leave on the other: https://en.wikipedia.org/wiki/B-tree#Rebalancing_after_deletion – ovgolovin Jul 28 '13 at 23:37
  • 1
    I hope I don't bother you with this matters. But still I started a question about deleting files in-order http://stackoverflow.com/q/17955459/862380, which seems not to receive an answer which will explain the issue with the example, which will be understandable for ordinary programmers. If you have time and feel like so, could you look into it? Maybe you could write a better explanation. – ovgolovin Aug 08 '13 at 18:35
  • `sys 0m1.995s` is not ten times more efficient than `sys 0m3.639s`. You can't compare `stuff | wc -l` with `stuff > foo.txt`. – MattBianco Apr 25 '14 at 09:27
  • I am referring to `real` time not `sys` time. As for using a pipe versus a redirect. I can compare that since both are in-memory operations, one is a kernel memory buffer and the other is dirtying pages in memory. If you think `wc -l` is blocking the pipe as it cannot consume fast enough I would appreciate evidence for that. – Matthew Ife Apr 25 '14 at 11:23
  • @BMDan, your man page was old. The current man page says `d_type` was introduced in Linux 2.6.4. (But unless you're writing one-off code for personal use, you MUST have a fallback to `lstat` or `stat` for filesystems that leave `d_type = DT_UNKNOWN`. And to compile on non GNU or BSD.) – Peter Cordes Mar 17 '15 at 09:47
  • The answer states, the program just makes a list of file, but actually it *is* deleting stuff? – Alex Nov 04 '15 at 10:09
  • @alex You are correct. I edited this comment to correct that (this program was updated a year ago and now deletes -- apologies) – Matthew Ife Nov 04 '15 at 11:36
  • @MatthewIfe This script is pretty hard to kill once it is running :-( And is it possible to directly start the deleting in case there are myriads of files in the same level directory? – Alex Nov 04 '15 at 18:29
  • 1
    The two `printf("%s\n", *(char **)node));` lines (32 and 39) have an extra ")" at the end that must be removed in order to compile. Giving it a go on a server that cannot even finish an `ls` on a directory. Thanks :-) – Jason Feb 12 '16 at 14:02
  • 4
    This is an amazing piece of code. It was the only tool I could find able to list and delete some 11,000,000 (eleven million) session files that had built up in a directory, probably over some years. The Plesk process that was supposed to keep them under control using find and other tricks in other answers here, was unable to complete a run, so the files just kept building up. It is a tribute to the binary tree that the filesystem uses to store the directory, that the sessions were able to work at all - you could create a file and retrieve it with no delay. Just listings were unusable. – Jason Feb 12 '16 at 15:10
  • This is not compiling due to the error mentioned in the above comment by @Jason. Request the OP to correct it. – Anmol Singh Jaggi Apr 30 '16 at 18:11
  • *"Getdents list directory entries by filling a buffer is entries."* Huh? I *guess* that should be `s/is/with/;s/list/lists/`, but I'm not sure. – Wildcard Nov 05 '16 at 17:31
  • 1
    This program seems to have some memory corruption for me. On a dir with contents created by `seq 1 100000 | xargs touch` it prints entries like `4���x` and `1000�R`. – nh2 Jan 01 '18 at 17:57
  • Ah, I see where the memory corruption comes from. You're using `st_size` as the size of the buffer passed to `getdents()`. That can't work, for lots of files it can be too small. See this answer stating [you cannot deduce anything from the value in st_size](https://unix.stackexchange.com/a/252311/100270). For example in my case on ext4, `st_size` is 69 MB but `strace` shows `getdents(...) = 102320056` == 100 MB. – nh2 Jan 01 '18 at 21:48
  • @nh2 It comes from the treewalk. I ditched the twalk and made a bunch of syntatics improvements. I dont think the treewalk really offered a huge benefit on reflection. Just added more cogs unnecessarily. – Matthew Ife Jan 01 '18 at 22:06
  • @MatthewIfe I've put a fixed version of the tree-based program [on Github](https://github.com/nh2/dentls/blob/8ac4588ef9d53a2f8f395041e8a47165334f43a9/dentls.c) that fixes the corruption. The issue was passing `st_size` into `getdents`, as I mentioned. That resulted in the `while` loop around `getdents()` to be run more than once, thus invalidating the pointers in the tree from the loop iteration before (they now can point into arbitrary offsets in `buffer`). I fixed it by never reusing buffers passed to `getdents()`. Will also check out your new version. – nh2 Jan 01 '18 at 22:54
  • @MatthewIfe I don't get what the point of your new versin without the treewalk is. Now you're just doing the same thing as `find mydir -delete`, with slightly bigger `getdents()` buffers. What did you mean with `I dont think the treewalk really offered a huge benefit on reflection`? Did you not see an improvement when using the treewalk? If not, that would invalidate the entire reasoning you presented about deletion order being important due to how the underlying FS works. And this simple equivalent to `find mydir -delete` is O(n²) when benchmarked on ext4/xfs/zfs across increasing dir sizes. – nh2 Jan 01 '18 at 22:59
  • @nh2 Time in sys() is about hafl a second slower using the treewalk. I think thats potentially other environmental factors and the performance is almost the same as a linear buffer deletion. The whole point of the program is to do 'find -delete' with bigger buffers (to alleviate the otherwise large syscall overhead). – Matthew Ife Jan 01 '18 at 23:10
  • For reference to what I said about deletion being O(n²): [Report I just filed about `rm -r` taking quadratic wall time](https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29921); I measured the same with `find mydir -delete`, `rsync` and now also your tree-based version. – nh2 Jan 01 '18 at 23:12
  • 1
    @MatthewIfe `The whole point of the program is to do 'find -delete' with bigger buffers` - ah. I had assumed the main goal was `to avoid btree rebalancing` at the FS level as you had written [here](https://serverfault.com/revisions/328305/8), which I had hoped would make the deletion linear (but it turned out not to help unfortunately). – nh2 Jan 01 '18 at 23:14
  • 1
    After reading @adamf's answer, I also noticed the following: Did you intend to use `ls -u1`, or did you rather mean `ls -U1`? `-u` is the opposite of `-U`, and it seems you may have wanted `-U`. – nh2 May 26 '20 at 22:36
  • @nh2 good spot! At this point I cant remember. This is a very VERY old answer and I'd be kind of reluctant to say it holds up to modern kernels, filesystems and way faster block devices. The whole thing should be re-tested in a more modern environment. – Matthew Ife May 27 '20 at 15:12
  • This doesn't work for me. I just get `# ./dentls --delete backups Deleting files in 5. . . 4. . . 3. . . 2. . . 1. . . Total files: 0` – Turkeyphant Jul 28 '21 at 11:54
  • I should link here: I came here originally via Python bug [shutil.rmtree can have O(n^2) performance on large dirs](https://bugs.python.org/issue32453#msg309303). Then it developed into coreutils discussion [#29921 O(n^2) performance of rm -r](https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29921) and finally a `linux-fsdevel` discussion [O(n^2) deletion performance](https://lore.kernel.org/linux-fsdevel/5ca3808d-4eea-afec-75a6-2cc41f44b868@nh2.me/t/#u). Dave Chinner suggests that quadratic behaviour only occurs until the problem is seek-bound. We should try to confirm it by even larger tests. – nh2 Oct 30 '21 at 12:27
36

The data=writeback mount option deserves to be tried, in order to prevent journaling of the file system. This should be done only during the deletion time, there is a risk however if the server is being shutdown or rebooted during the delete operation.

According to this page,

Some applications show very significant speed improvement when it is used. For example, speed improvements can be seen (...) when applications create and delete large volumes of small files.

The option is set either in fstab or during the mount operation, replacing data=ordered with data=writeback. The file system containing the files to be deleted has to be remounted.

Déjà vu
  • 5,408
  • 9
  • 32
  • 52
  • 1
    He could also increase the time from the `commit` [option](http://kernel.org/doc/Documentation/filesystems/ext3.txt): "This default value (or any low value) will hurt performance, but it's good for data-safety. Setting it to 0 will have the same effect as leaving it at the default (5 seconds). Setting it to very large values will improve performance". – Cristian Ciupitu Sep 26 '10 at 19:14
  • 1
    Writeback looks stellar, except the documentation I was looking at (http://www.gentoo.org/doc/en/articles/l-afig-p8.xml#doc_chap4) explicitly mentions that it still journals metadata, which I presume includes all the data I'm changing (I'm certainly not changing any data in the files themselves). Is my understanding of the option incorrect? – BMDan Sep 27 '10 at 01:20
  • Lastly, FYI, not mentioned in that link is that fact that data=writeback can be a huge security hole, since data pointed to by a given entry may not have the data that was written there by the app, meaning that a crash could result in the old, possibly-sensitive/private data being exposed. Not a concern here, since we're only turning it on temporarily, but I wanted to alert everyone to that caveat in case either you or others who run across that suggestion weren't aware. – BMDan Sep 27 '10 at 01:23
  • commit: that's pretty slick! Thanks for the pointer. – BMDan Sep 27 '10 at 01:26
  • The `data=writeback` doesn't perform any kind of journaling. This is the reason why a reboot shouldn't happen during the delete operation. I didn't have the occasion to use it on a million files FS, but it should be faster. – Déjà vu Sep 27 '10 at 22:29
  • ring0: I believe that's incorrect; see http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html (or just google "ext3 writeback"). Writeback mode does still journal metadata changes. – BMDan Sep 29 '10 at 02:45
  • That's correct - the *definition* is mentioning `metadata` journaling. But according to the link I provided, there is an impact on *deleting large volumes of small files* - which is, I think, the problem of interest? – Déjà vu Sep 29 '10 at 03:16
  • Touché. This is therefore my favorite solution at the moment. Though I'm eager to understand the difference between deleting large numbers of files with writeback versus ordered; it still *seems* to me like it shouldn't make a whit of difference. – BMDan Sep 30 '10 at 04:22
  • I think you are the only person able to perform a test (coulé ? :-) – Déjà vu Sep 30 '10 at 12:58
  • I can't run a test against the original problem; it was nontrivial to keep it around for testing purposes, so once I ran the script, above--the one that ran for 253 minutes--the files were gone. – BMDan Sep 30 '10 at 14:42
  • Another solution would be to read the code of `ext3` to appreciate the difference with or without `writeback`... interesting challenge :-) – Déjà vu Sep 30 '10 at 16:48
  • 2
    `data=writeback` still journals metadata before writing it into the main filesystem. As I understand it, it just doesn't enforce ordering between things like writing an extent map and writing data into those extents. Maybe there are other ordering constraints it relaxes, too, if you saw a perf gain from this. Of course, mounting without the journal at all could be even higher performance. (It might let the metadata changes just happen in RAM, without needing to have anything on disk before the unlink op completes). – Peter Cordes Mar 17 '15 at 11:30
32

Would it be possible to backup all of the other files from this file system to a temporary storage location, reformat the partition, and then restore the files?

jftuga
  • 5,572
  • 4
  • 39
  • 50
  • 3
    I really like this answer, actually. As a practical matter, in this case, no, but it's not one I would have thought of. Bravo! – BMDan Sep 23 '10 at 10:23
  • Exactly what I was thinking too. This is an answer for question 3. Ideal if you ask me :) – Joshua Oct 01 '10 at 15:31
12

There is no per directory file limit in ext3 just the filesystem inode limit (i think there is a limit on the number of subdirectories though).

You may still have problems after removing the files.

When a directory has millions of files, the directory entry itself becomes very large. The directory entry has to be scanned for every remove operation, and that takes various amounts of time for each file, depending on where its entry is located. Unfortunately even after all the files have been removed the directory entry retains its size. So further operations that require scanning the directory entry will still take a long time even if the directory is now empty. The only way to solve that problem is to rename the directory, create a new one with the old name, and transfer any remaining files to the new one. Then delete the renamed one.

  • Indeed, I noticed just this behavior after deleting everything. Luckily, we had already mv'd the directory out of the "line of fire", as it were, so I could just rmdir it. – BMDan Sep 23 '10 at 10:24
  • 2
    That said, if there's no per-directory file limit, why did I get "ext3_dx_add_entry: Directory index full!" when there were still inodes available on that partition? There were no subdirectories inside this directory. – BMDan Sep 23 '10 at 10:26
  • 3
    hmm i did a little more research and it seems there's a limit of number of blocks a directory can take up. The exact number of files is dependent on a few things eg filename length. This http://www.gossamer-threads.com/lists/linux/kernel/921942 seems to indicate that with 4k blocks you should be able to have more than 8 million files in a directory. Were they particularly long filenames? – Alex J. Roberts Sep 23 '10 at 12:04
  • Each filename was exactly 36 characters long. – BMDan Sep 23 '10 at 15:34
  • well that's me out of ideas :) – Alex J. Roberts Sep 24 '10 at 00:13
8

I haven't benchmarked it, but this guy did:

rsync -a --delete ./emptyDirectoty/ ./hugeDirectory/
Qtax
  • 53
  • 5
Alix Axel
  • 2,653
  • 6
  • 28
  • 28
7

TLDR: use rsync -a --delete emptyfolder/ x.

This question has 50k views, and quite a few answers, but nobody seems to have benchmarked all the different replies. There's one link to an external benchmark, but that one's over 7 years old and didn't look at the program provided in this answer: https://serverfault.com/a/328305/565293

Part of the difficulty here is that the time it takes to remove a file depends heavily on the disks in use and the file system. In my case, I tested both with a consumer SSD running BTRFS on Arch Linux (updated as of 2020-03), but I got the same ordering of the results on a different distribution (Ubuntu 18.04), filesystem (ZFS), and drive type (HDD in a RAID10 configuration).

Test setup was identical for each run:

# setup
mkdir test && cd test && mkdir empty
# create 800000 files in a folder called x
mkdir x && cd x
seq 800000 | xargs touch
cd ..

Test results:

rm -rf x: 30.43s

find x/ -type f -delete: 29.79

perl -e 'for(<*>){((stat)[9]<(unlink))}': 37.97s

rsync -a --delete empty/ x: 25.11s

(The following is the program from this answer, but modified to not print anything or wait before it deletes files.)

./dentls --delete x: 29.74

The rsync version proved to be the winner every time I repeated the test, although by a pretty low margin. The perl command was slower than any other option on my systems.

Somewhat shockingly, the program from the top answer to this question proved to be no faster on my systems than a simple rm -rf. Let's dig into why that is.

First of all, the answer claims that the problem is that rm is using readdir with a fixed buffer size of 32Kb with getdents. This proved not to be the case on my Ubuntu 18.04 system, which used a buffer four times larger. On the Arch Linux system, it was using getdents64.

In addition, the answer misleadingly provides statistics giving its speed at listing the files in a large directory, but not removing them (which is what the question was about). It compares dentls to ls -u1, but a simple strace reveals that getdents is not the reason why ls -u1 is slow, at least not on my system (Ubuntu 18.04 with 1000000 files in a directory):

strace -c ls -u1 x >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.00    7.177356           7   1000000           lstat
  5.96    0.454913        1857       245           getdents
[snip]

This ls command makes a million calls to lstat, which slows the program way down. The getdents calls only add up to 0.455 seconds. How long do the getdents calls take in dentls on the same folder?

strace -c ./dentls x >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.91    0.489895       40825        12           getdents
[snip]

That's right! Even though dentls only makes 12 calls instead of 245, it actually takes the system longer to run these calls. So the explanation given in that answer is actually incorrect - at least for the two systems I've been able to test this on.

The same applies to rm and dentls --delete. Whereas rm takes 0.42s calling getdents, dentls takes 0.53s. In either case, the vast majority of the time is spent calling unlink!

So in short, don't expect to see massive speedups running dentls, unless your system is like the author's and has a lot of overhead on individual getdents. Maybe the glibc folks have considerably sped it up in the years since the answer was written, and it now takes a linear about of time to respond for different buffer sizes. Or maybe the response time of getdents depends on the system architecture in some way that isn't obvious.

adamf
  • 171
  • 1
  • 1
  • 1
    I noticed the use of `-u`, and I suspect that should be `-U` (which is the opposite, `-U` means to not sort on per-file properties, so that `lstat` can be avoided). I've [asked on the other answer](https://serverfault.com/questions/183821/rm-on-a-directory-with-millions-of-files/328305#comment1322881_328305) just now whether @Matthew Ife used `-u` accidentally. – nh2 May 26 '20 at 22:37
  • Could you `strace` the rsync approach and check which syscalls it makes, to figure out why it's faster than `dentls` here? – nh2 May 26 '20 at 22:39
  • Never saw this but quite honestly, really good work. My original answer is nearly 10 years old. Thats an EON in computing terms. At the VERY LEAST ext4 and extents really REALLY speeds up unlink() for large files and I'm sure stuff like block preallocation helps unlink run faster on larger directories. As for the getdents(), there is a overhead but on modern day block devices its negligable. In fact I'm not even sure on modern filesystem the problem of 'millions of files in a directory' has the same impact as it used to have. – Matthew Ife May 27 '20 at 15:19
  • 1
    As @MatthewIfe said, the circumstances have changed quite a bit from when I originally posed this question. I also had the opportunity to spend some time with a coreutils maintainer, who had some very cogent points about ways in which they had dramatically improved the performance of `rm -r` in the time since the release of the downright Paleolithic version included with CentOS 5.x. Moral of the story is: if you're on a modern system, you can probably just run `rm -rf`. Otherwise, feel free to run the updated version of dentls on Github, but @nh2 should totally accept my PR on it. :D – BMDan Nov 13 '20 at 22:26
4

find simply did not work for me, even after changing the ext3 fs's parameters as suggested by the users above. Consumed way too much memory. This PHP script did the trick - fast, insignificant CPU usage, insignificant memory usage:

<?php 
$dir = '/directory/in/question';
$dh = opendir($dir)) { 
while (($file = readdir($dh)) !== false) { 
    unlink($dir . '/' . $file); 
} 
closedir($dh); 
?>

I posted a bug report regarding this trouble with find: http://savannah.gnu.org/bugs/?31961

Alexandre
  • 151
  • 4
3

Make sure you do:

mount -o remount,rw,noatime,nodiratime /mountpoint

which should speed things up a bit as well.

Cristian Ciupitu
  • 6,226
  • 2
  • 41
  • 55
karmawhore
  • 3,865
  • 17
  • 9
  • 4
    Good call, but it's already mounted noatime, as I mentioned in the header to the question. And nodiratime is redundant; see http://lwn.net/Articles/245002/ . – BMDan Sep 27 '10 at 21:19
  • 1
    ppl repeat this mantra "noatime,nodiratime,nodevatime,noreadingdocsatime" – poige Jun 12 '16 at 06:11
3

I recently faced a similar issue and was unable to get ring0's data=writeback suggestion to work (possibly due to the fact that the files are on my main partition). While researching workarounds I stumbled upon this:

tune2fs -O ^has_journal <device>

This will turn off journaling completely, regardless of the data option give to mount. I combined this with noatime and the volume had dir_index set, and it seemed to work pretty well. The delete actually finished without me needing to kill it, my system remained responsive, and it's now back up and running (with journaling back on) with no issues.

Matthew Read
  • 132
  • 1
  • 9
  • I was going to suggest mounting it as ext2 instead of ext3, to avoid journaling the metadata ops. This should do the same. – Peter Cordes Mar 17 '15 at 09:26
2

ls very slow command. Try:

find /dir_to_delete ! -iname "*.png" -type f -delete
bindbn
  • 5,153
  • 2
  • 26
  • 23
  • rm -rf ran for a day and a half, and I finally killed it, without ever knowing if it had actually accomplished anything. I needed a progress bar. – BMDan Sep 23 '10 at 10:18
  • 4
    As to rm being very slow, "time find . -delete" on 30k files: 0m0.357s / 0m0.019s / 0m0.337s real/user/sys. "time ( ls -1U | xargs rm -f )" on those same files: 0m0.366s / 0m0.025s / 0m0.340s. Which is basically margin-of-error territory. – BMDan Sep 23 '10 at 10:20
  • 1
    You could have just run `strace -r -p ` to attach to the already-running rm process. Then you can see how fast `unlink` system calls are scrolling past. (`-r` puts the time since the previous system call at the start of every line.) – Peter Cordes Mar 17 '15 at 09:14
2

Obviously not apples to apples here, but I setup a little test and did the following:

Created 100,000 512-byte files in a directory (dd and /dev/urandom in a loop); forgot to time it, but it took roughly 15 minutes to create those files.

Ran the following to delete said files:

ls -1 | wc -l && time find . -type f -delete

100000

real    0m4.208s
user    0m0.270s
sys     0m3.930s 

This is a Pentium 4 2.8GHz box (couple hundred GB IDE 7200 RPM I think; EXT3). Kernel 2.6.27.

gravyface
  • 13,947
  • 16
  • 65
  • 100
  • Interesting, so perhaps the fact that the files were being created over a long period of time is relevant? But that shouldn't matter; the block cache should have all of the pertinent metadata blocks in RAM. Maybe it's because unlink(2) is transactional? In your estimation, would turning off journaling for the duration of the rm be a potential (albeit admittedly somewhat dangerous) solution? It doesn't look like you can just turn off journaling entirely on a mounted filesystem without a tune2fs/fsck/reboot, which somewhat defeats the purpose. – BMDan Sep 27 '10 at 01:11
  • I can't comment on that, but anecdotally (in various NIX discussions over the years), I've always heard that `rm` is horribly slow on a large number of files, hence the `find -delete` option. With a wildcard on the shell, it'll expand each filename matched, and I'm assuming there's a limited memory buffer for that, so you could see how that would get inefficient. – gravyface Sep 28 '10 at 12:58
  • 1
    rm would be slow because it's looking for a file by name, which means iterating through the directory entries one by one until it finds it. In this case, however, since each entry it's being handed is (at that point) the first one in the list (ls -U/ls -f), it should be *nearly* as fast. That said, rm -rf , which should have run like a champ, was slow as could be. Perhaps it's time to write a patch to coreutils to speed massive deletes? Maybe it's secretly globbing/sorting in some recursive way in order to implement rm -rf? Uncertainties like this are why I asked the question. ;) – BMDan Sep 30 '10 at 03:11
  • 1
    Reboot the machine after you run the creation step. You should get a noticeably slower delete. – hookenz Apr 23 '12 at 23:09
2

Is dir_index set for the filesystem? (tune2fs -l | grep dir_index) If not, enable it. It's usually on for new RHEL.

Cristian Ciupitu
  • 6,226
  • 2
  • 41
  • 55
sam
  • 21
  • 1
2

A couple of years back, I found a directory with 16 million XML files in the / filesystem. Due the criticity of the server, we used the following command that took about 30 hours to finish:

perl -e 'for(<*>){((stat)[9]<(unlink))}'

It was an old 7200 rpm hdd, and despite the IO bottleneck and CPU spikes, the old webserver continued its service.

a_foaley
  • 31
  • 4
1

My preferred option is the newfs approach, already suggested. The basic problem is, again as already noted, the linear scan to handle deletion is problematic.

rm -rf should be near optimal for a local filesystem (NFS would be different). But at millions of files, 36 bytes per filename and 4 per inode (a guess, not checking value for ext3), that's 40 * millions, to be kept in RAM just for the directory.

At a guess, you're thrashing the filesystem metadata cache memory in Linux, so that blocks for one page of the directory file are being expunged while you're still using another part, only to hit that page of the cache again when the next file is deleted. Linux performance tuning isn't my area, but /proc/sys/{vm,fs}/ probably contain something relevant.

If you can afford downtime, you might consider turning on the dir_index feature. It switches the directory index from linear to something far more optimal for deletion in large directories (hashed b-trees). tune2fs -O dir_index ... followed by e2fsck -D would work. However, while I'm confident this would help before there are problems, I don't know how the conversion (e2fsck with the -D) performs when dealing with an existing v.large directory. Backups + suck-it-and-see.

Phil P
  • 3,040
  • 1
  • 15
  • 19
  • 2
    http://www.pubbs.net/201008/squid/10731-squid-users-linux-ext3-optimizations-for-squid.html suggests that `/proc/sys/fs/vfs_cache_pressure` might be the value to use, but I don't know whether the directory itself counts towards page cache (because that's what it is) or the inode cache (because, despite not being an inode, it's FS metadata and bundled into there for that reason). As I say, Linux VM tuning is not my area. Play and see what helps. – Phil P Sep 26 '10 at 12:10
1

Sometimes Perl can work wonders in cases like this. Have you already tried if a small script such as this could outperform bash and the basic shell commands?

#!/usr/bin/perl 
open(ANNOYINGDIR,"/path/to/your/directory");
@files = grep("/*\.png/", readdir(ANNOYINGDIR));
close(ANNOYINGDIR);

for (@files) {
    printf "Deleting %s\n",$_;
    unlink $_;
}

Or another, perhaps even faster, Perl approach:

#!/usr/bin/perl
unlink(glob("/path/to/your/directory/*.png")) or die("Could not delete files, this happened: $!");

EDIT: I just gave my Perl scripts a try. The more verbose one does something right. In my case I tried this with a virtual server with 256 MB RAM and half a million files.

time find /test/directory | xargs rm results:

real    2m27.631s
user    0m1.088s
sys     0m13.229s

compared to

time perl -e 'opendir(FOO,"./"); @files = readdir(FOO); closedir(FOO); for (@files) { unlink $_; }'

real    0m59.042s
user    0m0.888s
sys     0m18.737s
Janne Pikkarainen
  • 31,454
  • 4
  • 56
  • 78
  • I hesitate to imagine what that glob() call would do; I assume it does a scandir(). If so, that's going to take FOREVER to return. A modification of the first suggestion that doesn't pre-read all the dir entries might have some legs; however, in its current form, it, too, would use an unholy amount of CPU on just reading all the directory entries at once. Part of the goal here is to divide-and-conquer; this code isn't fundamentally different from 'rm -f \*.png', notwithstanding issues with shell expansion. If it helps, there's nothing in the directory that I *didn't* want to delete. – BMDan Sep 27 '10 at 21:30
  • I gotta try more as soon as I get to work. I just tried to create 100 000 files in a single directory and find + xargs + rm combination took 7.3 seconds, Perl + unlink(glob)... combination finished in 2.7 seconds. Tried that couple of times, the result was always the same. At work I'll try this with more files. – Janne Pikkarainen Sep 28 '10 at 05:57
  • I learnt something new while testing this. At least with ext3 and ext4 the directory entry itself remains huge even after deleting all the files from there. After couple of tests my /tmp/test directory was taking 15 MB of disk space. Is there other way to clean that up other than removing the directory and recreating it? – Janne Pikkarainen Sep 28 '10 at 12:53
  • 2
    No, you need to recreate it. I hit this when dealing with a mail-system and folder-per-recipient and cleanups after significant issues: there's no way, other than creating a new directory and shuffling the directories about, then nuking the old one. So you can reduce the time window when there's no directory, but not eliminate it. – Phil P Sep 28 '10 at 23:49
  • Note that glob() will sort the results, much as shell globbing normally does, so because you only have 100k files, everything fits easily and the sort is fast. With a much larger directory, you'd want to opendir()/readdir()/closedir() just to avoid the sort. [I say *normally* for shell, since zsh has a glob modifier to make the sort order unsorted, which is useful when dealing with large numbers of files; `*(oN)` ] – Phil P Sep 28 '10 at 23:50
  • @Janne: Is the difference because of unlink? had BMDan used unlink, would he have got the same performance? – Software Mechanic May 15 '11 at 09:51
1

From what I remember the deletion of inodes in ext filesystems is O(n^2), so the more files you delete the faster the rest will go.

There was a one time I was faced with similar problem (though my estimates looked at ~7h deletion time), in the end went the jftuga suggested route in first comment.

Hubert Kario
  • 6,351
  • 6
  • 33
  • 65
0

I would probably have whipped out a C compiler and done the moral equivalent of your script. That is, use opendir(3) to get a directory handle, then use readdir(3) to get the name of files, then tally up files as I unlink them and once in a while print "%d files deleted" (and possibly elapsed time or current time stamp).

I don't expect it to be noticeably faster than the shell script version, it's just that I'm used to have to rip out the compiler now and again, either because there's no clean way of doing what I want from the shell or because while doable in shell, it's unproductively slow that way.

Vatine
  • 5,390
  • 23
  • 24
0

You are likely running into rewrite issues with the directory. Try deleting the newest files first. Look at mount options that will defer writeback to disk.

For a progress bar try running something like rm -rv /mystuff 2>&1 | pv -brtl > /dev/null

Cristian Ciupitu
  • 6,226
  • 2
  • 41
  • 55
BillThor
  • 27,354
  • 3
  • 35
  • 69
  • In terms of deleting the newest files first: ls -Ur? I'm pretty sure that'd load the dir entries, then reverse them; I don't believe ls is smart enough to start at the end of the dir entry list and rewind its way back to the beginning. "ls -1" also probably isn't a great idea, since it would probably take 50+ MB of core and several minutes to run; you'd want "ls -U" or "ls -f". – BMDan Sep 27 '10 at 01:28
  • It is likely only practical if the file names increase in predicable pattern. However you my try ls -1 piped to reverse, and piped to xargs. Use files instead of pipes if you want to see your intermediate results. You haven't provided any information on file nameing. You would generate the patter in reverse and delete the files using the pattern. You may need to handle missing file entries. Given your comment on memory required, you have an idea of the I/O require to rewrite the directory. – BillThor Sep 28 '10 at 04:12
0

Well, this is not a real answer, but...

Would it be possible to convert the filesystem to ext4 and see if things change?

marcoc
  • 738
  • 4
  • 10
  • It appears that doing this "live" requires an fsck on a mounted filesystem, which is... alarming. Got a better way? – BMDan Sep 27 '10 at 21:25
  • The filesystem has to be unmounted before the conversion, i.e. before the necessary tunefs commands. – marcoc Sep 28 '10 at 07:00
0

Alright this has been covered in various ways in the rest of the thread but I thought I would throw in my two cents. The performance culprit in your case is probably readdir. You are getting back a list of files that are not necessarily in any way sequential on disk which is causing disk access all over the place when you unlink. The files are small enough that the unlink operation probably doesn't jump around too much zeroing out the space. If you readdir and then sort by ascending inode you would probably get better performance. So readdir into ram (sort by inode) -> unlink -> profit.

Inode is a rough approximation here I think .. but basing on your use case it might be fairly accurate...

MattyB
  • 993
  • 4
  • 6
  • 1
    Correct me if I'm wrong, but unlink(2) doesn't zero the inode, it just removes the reference to it from the directory. I like the chutzpah of this approach, though. Care to run some time-trials and see if it holds true? – BMDan Sep 30 '10 at 03:07
0

Here is how I delete the millions of trace files that can sometimes gather on a large Oracle database server:

for i in /u*/app/*/diag/*/*/*/trace/*.tr? ; do rm $i; echo -n . ;  done

I find that this results in a fairly slow deletion that has low impact on server performance, usually something along the lines of an hour per million files on a "typical" 10,000 IOPS setup.

It will often take several minutes before the directories have been scanned, the initial file list generated and the first file is deleted. From there and on, a . is echoed for every file deleted.

The delay caused by echoing to the terminal has proven enough of a delay to prevent any significant load while deletion is progressing.

Roy
  • 4,256
  • 4
  • 35
  • 50
  • You're being eaten alive by globbing. How about something more like: `find /u* -maxdepth 3 -mindepth 3 -type d -path '*/app/*' -name diag -print0 | xargs -0I = find = -mindepth 4 -maxdepth 4 -type d -name 'trace' -print0 | xargs -0I = find = -mindepth 1 -maxdepth 1 -name '*.tr'` ? Add `-delete` to the last one to actually delete things; as written, it just lists what it would delete. Note that this is optimized for circumstances where you have a lot of uninteresting things in nearby directories; if that's not the case, you can simplify the logic a great deal. – BMDan Oct 10 '14 at 14:21
  • find -delete tends to cause too much I/O and easily impacts production performance. Perhaps with ionice. – Roy Oct 19 '14 at 08:15
  • It's causing all that I/O simply by being more efficient, though! The globbing is all front-loaded for your example (that is, the full list of files is generated before the first `rm` happens), so you have relatively efficient I/O at startup from that, followed by painful, out-of-order `rm`s that probably don't cause much I/O, but involve `scandir` walking the directory repeatedly (not causing I/O because that's already been loaded into block cache; see also `vfs_cache_pressure`). If you'd like to slow things down, `ionice` is an option, but I'd probably use fractional-second `sleep`s. – BMDan Oct 28 '14 at 17:59
  • `find /u*/app/*/diag -path '*/trace/*.tr' -execdir rm {} +` would run one `rm` per directory, so you'd have less CPU overhead. As long as you have tons of CPU time to spare, throttling disk IO by forking a whole `rm` process for every `unlink` works, I guess, but it's ugly. perl with a sleep per unlink would be nicer if sleeping between `rm` of whole directories at a time is too bursty. (`-execdir sh -c ...` maybe) – Peter Cordes Mar 17 '15 at 09:42
-1

You could use 'xargs' parallelization features:

ls -1|xargs -P nb_concurrent_jobs -n nb_files_by_job rm -rf
Jeremy
  • 141
  • 1
  • 6
  • 1
    This won't help. The bottleneck is the poor random I/O on the drive. Doing parallel deletes could make it even worse and just increase CPU load. – Wim Kerkhoff Aug 20 '11 at 21:16
-2
ls|cut -c -4|sort|uniq|awk '{ print "rm -rf " $1 }' | sh -x
karmawhore
  • 3,865
  • 17
  • 9
  • 1
    Wow. I guess that falls pretty firmly in the "more than one way to skin a cat" camp. Seriously, though, with the sort and the uniq? "ls" sorts by default, anyway, and I'd sure hope that filenames are unique. :/ – BMDan Sep 30 '10 at 14:39
-2

actually, this one is a little better if the shell you use does command line expansion:

ls|cut -c -4|sort|uniq|awk '{ print "echo " $1 ";rm -rf " $1 "*"}' |sh