Getting all the filenames (not content) recursively from an http directory

9

5

A large biological research project has chosen to make its archive available via https here:

https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/

Unfortunately, it appears that there is no manifest of the contents of these directories, so I simply want to build one. I'd like to grab the filenames for the entire directory tree. Are there any suggestions for how to do this? I can write something up in perl/python/R/etc. to scrape the index.html files recursively, but I thought there might be some incantation with wget that can get me the filenames but I have not found that yet.

seandavi

Posted 2013-02-01T23:14:57.030

Reputation: 193

Answers

5

I actually had the exact same problem. Both of these solutions didn't work for me. However, this did:

Install lftp, then do

lftp https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/
du -a > manifest.txt

and that'll give you all the directories and file names.

Dandan

Posted 2013-02-01T23:14:57.030

Reputation: 66

4

Opposed to the FTP protocol, HTTP does not know the concept of a directory listing. Thus, wget can only look for links and follow them according to certain rules the user defines.

That being said, if you absolutely want it, you can abuse wgets debug mode to gather a list of the links it encounters when analyzing the HTML pages. It sure ain't no beauty, but here goes:

wget -d -r -np -N --spider -e robots=off --no-check-certificate \
  https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ \
  2>&1 | grep " -> " | grep -Ev "\/\?C=" | sed "s/.* -> //"

Some sidenotes:

  • This will produce a list which still contains duplicates (of directories), so you need to redirect the output to a file and use uniq for a pruned list.
  • --spider causes wget not to download anything, but it still will do a HTTP HEAD request on each of the files it deems to enqueue. This will cause a lot more traffic than is actually needed/intended and cause the whole thing to be quite slow.
  • -e robots=off is needed to ignore a robots.txt file which may cause wget to not start searching (which is the case for the server you gave in your question).
  • If you have wget 1.14 or newer, you can use --reject-regex="\?C=" to reduce the number of needless requests (for those "sort-by" links already mentioned by @slm). This also eliminates the need for the grep -Ev "\/\?C=" step afterwards.

zb226

Posted 2013-02-01T23:14:57.030

Reputation: 493

2

I thought there would be a way to do this easily with wget/curl too but couldn't get anything to work either. You can use this Ruby gem, anemone, to do it fairly easily though.

Installing anemone gem

% gem install anemone
Fetching: robotex-1.0.0.gem (100%)
Fetching: anemone-0.7.2.gem (100%)
Successfully installed robotex-1.0.0
Successfully installed anemone-0.7.2
2 gems installed
Installing ri documentation for robotex-1.0.0...
Installing ri documentation for anemone-0.7.2...
Installing RDoc documentation for robotex-1.0.0...
Installing RDoc documentation for anemone-0.7.2...

Sample anemone script

#! /usr/bin/env ruby
require 'anemone'

Anemone.crawl("https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

Example run

% ./anemone.rb | grep -v '?C='
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/README_BCR.txt
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/README_MAF.txt
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/acc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/brca/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/blca/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/cesc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/cntl/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/dlbc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/esca/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/gbm/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/hnsc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/kich/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/kirc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/kirp/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lcll/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/laml/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lcml/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lihc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lgg/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lnnh/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lost+found/
...
...

NOTE: The bit grep -v '?C=' is filtering the boilerplate headers that Apache is generating via its Indexing directive, i.e.:

IndexOptions FancyIndexing VersionSort NameWidth=* HTMLTable

    ss of apache rendered column sorter

These allow you to sort the pages by the different columns (Name, Create Date, etc.). These show up as pages and I'm just filtering them out of the output.

slm

Posted 2013-02-01T23:14:57.030

Reputation: 7 449