There are several questions in here. You are asking how to avoid detection, how to avoid attribution, and how to avoid exploitation. Though you did elaborate on your goals, I still don't know your specific threat model. I can guess a few likely possibilities, and my answer is based on the best understanding I have on what you want to accomplish. I will edit my answer in response to question updates.
Your goals
avoid attracting the attention of target website admin, others.
Whether or not this occurs depends on how the target website is configured. Various spiders can be fingerprinted, so even if they are using a common user agent, they are still displaying some behavior unique to them. For example, the order in which client HTTP headers are sent, and even their case. There is no way to prevent a website administrator from knowing that you are using wget rather than a regular web browser if they are determined to do so or have software designed for such detection. Your techniques are probably sufficient to avoid tripping over a typical IDS, though.
be untraceable to my actual IP.
Since you said you were using torsocks, I think I should add some information on how it works. The way torsocks provides a Tor connection is by using LD_PRELOAD
to hook network-related functions. When these functions are called, the function from the torsocks library is instead executed, and it redirects the connections to a SOCKS5 proxy. This is useful for applications which do not support the SOCKS protocol, but it can easily be bypassed, either accidentally or maliciously. If an application uses raw assembly to invoke a syscall directly, it will bypass torsocks. As the latest version of wget uses libc networking functions rather than invoking syscalls directly, this should not be a problem for it. Hypothetically, though, a compromised wget could easily bypass torsocks. The solution is to run it under a user where all non-Tor traffic is denied. This is possible by running a system instance of Tor under its own user (which is usually the default), and using iptables
to block all outgoing connections from UIDs other than that of the Tor process.
I assume you are also aware of traffic analysis attacks which affect Tor and any other low-latency anonymity network. Judging by your goals, this is probably not an issue as a very large AS-level adversary is required to pull this off with any accuracy.
avoid leaving traces that would enable a web admin or whomever to detect that different jobs are executed by the same person (me). For example, I might mirror a website roughly once a month, but with some variation; I would be displeased if despite my efforts to change headers and coming out of a different Tor exit node, it was clear to the other side that it was the same person. This one is less important than general traceability.
Chances are, anyone who looks at the logs will be able to tell it is the same person. The chances that anyone else is using Tor and changing headers (which is not natural behavior) and is doing this roughly once a month and having the fingerprint of a spider is extremely low. While this does not allow the target to know who you are, they may still be able to tell that the activity is coming from the same person. Quite honestly, using regular old wget with no changes (or perhaps the bare necessary ones required to avoid triggering flood detection and such) may be better. People and bots use wget all the time, even with Tor, which means that randomizing your headers will make it such that you won't even be able to blend in with the (already few) people who are using wget and Tor on that site.
don't make myself vulnerable to exploits that a malicious actor without a high level of technical skill could pull off.
There have been multiple instances of remote exploits against wget in the past. This has ranged from fairly sophisticated, like buffer overflows, to much simpler, like providing a 301 redirect to an FTP link that overwrites a local file. You can either run as an unprivileged, isolated user to mitigate this, or use mandatory access controls like AppArmor to confine it to accessing only certain directories.
Your precautions
Some comments on a few of your precautions:
provide randomly selected HTTP headers for each job
HTTP headers are interpreted regardless of their order or their case. Because of this, each utility using the protocol may use a different order of headers or different cases, not just different headers. For example, curl gives the user agent header before the host header, whereas wget does it the other way around. Even when using identical header settings, they can still be distinguished.
For wget:
GET / HTTP/1.1
User-Agent: Wget/1.19.1 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: example.com
For curl:
GET / HTTP/1.1
Host: example.com
User-Agent: curl/7.57.0
Accept: */*
For Firefox:
GET / HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
So what happens if you set wget to use a Firefox user agent? Some IDSes can be configured specifically to detect discrepancies between the reported user agent and the behavior of any given connection. A discrepancy may allow the IDS to know what software is actually being used, or it might just alert it to the fact that the client is intentionally lying about who they are, resulting in the IDS loudly alerting the sysadmin. Take the following wget command, downloading a single page from a website while spoofing the user agent:
wget -U "Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0" "http://example.com/secretpage.html"
You would think that that would be indistinguishable from a Firefox user connecting directly to example.com/secretpage.html
, right? An IDS would be able to quickly notice that it is really wget and not Firefox, because it would see the following being sent from the client:
GET /secretpage.html HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0
Accept: */*
Accept-Encoding: identity
Host: example.com
Now compare this with the earlier example of Firefox headers. This is clearly not genuine Firefox, despite the user agent claiming it is. This is far more likely to raise an IDS alert than simply retaining the original wget user agent (after all, command line tools being used to retrieve webpages usually isn't a big deal to sysadmins).
Additionally, the patterns in which an application accesses resources can be used to determine the identity of the application. Wget has very unique behavior when used as a web spider in the order and speed it accesses resources, as well as which resources it ignores. Curl does not support being a spider so has no behavior. Firefox has some very complex behaviors involving the order in which resources are loaded and whether or not a given resource is pinged or preloaded. As you can see, it will generally always be possible to know that you are using wget if any in-depth analysis is done, and because most wget users are not changing their headers, this makes you unique.
random wait between 0 and 600 seconds
This should only be done if necessary to bypass automatic detection or to avoid flooding the website. While it is random, an administrator looking at the logs will still see that each connection is is waiting between 0 and 600 seconds. This itself is unique. It should not be done to try to act less "spider-like".
Making an automated spider behave as a genuine internet user is exceptionally hard. Many research papers have been written about it, and many research papers have been written showing how to detect it. Given that spammers are heavily invested in their bots behaving like humans, and anti-spam solutions are heavily invested in distinguishing such bots from humans, any solutions that you come up with like using random delays will not be able to get even close to the constant arms-race between spammers and anti-spam solutions. This is like trying to pitch to a major league batter. Any "clever" trick you can think up for throwing the ball will be thoroughly ineffective given the ever-escalating techniques used by major league pitchers and batters. Don't try to make your spider act like a human. You won't win that game. The only winning move is not to play.
all links converted to local references
This only matters if you are going to be browsing the site offline. I would not rely on that if you suspect that the website is malicious, because there could be many ways to embed a link to a website which is not detected and converted by wget, but which is detected and accessed in a standard browser. If you fear the offline mirror attempting to phone home, you should only connect to it from a user which does not have direct access to the content. It appears you are already doing this according to #8.
Threat modeling
Though you did add more details, you should still think about your threat model a bit more. What exactly is it you are trying to achieve by preventing them from realizing each month's scraping activity is related or that it is not natural traffic? I can think of only a few reasons this might be desirable:
- You need the website contents for reconnaissance for later exploitation.
- You don't want the website to notice and block Tor traffic or introduce captchas or delays.
- You don't want the website to serve you with custom (malicious or dummy) content.
- You are scraping an accidentally-exposed private area of the website, and bringing any attention to the existence of your traffic would result in the unintended access being closed.
- The knowledge that someone is scraping it is enough for the administrator to realize who is likely behind it (e.g. if you are scraping a friend's personal site or a forum which you are active on).
Depending on which (if any) of these apply to your situation, you do not need to expend so much effort in avoiding attribution. Most website access logs are not manually analyzed in detail unless necessary for incident-response. Most even log with low enough resolution that things like specific headers are not saved. You can avoid most forms of throttling and blockage simply by using a private proxy (with Tor, if you need anonymity) and by setting all your headers to that of a popular web spider which uses wget. Throttle and ratelimit your own connections to avoid harming the server and forcing them to take defensive action. Remember Aaron Swartz, the man who was arrested and later committed suicide after being caught downloading a large amount of scientific journals at MIT? He used wget, and was only caught because he was using so much traffic and even evaded blocking attempts that JSTOR ended up banning the entire MIT address range and complained to them about the abuses. If he had used ratelimiting, he would have never been caught, would never have died, and Sci-Hub would be a whole lot bigger.
If the website is not operated by someone with at least a moderate level of offensive security knowledge and motivation to "hack back", exploitation of wget should not be your concern. While it is certainly possible, at sometimes more easily than other times, it is not going to be a likely response from a website administrator. I personally have never seen it happen in the wild, at least. This will be a bigger risk if you are, for example, accessing an accidentally exposed backend of a sophisticated security contractor. If you're trying to download Raytheon SI's internal wiki and all you are doing is using plain wget with torsocks, you are doing it wrong and should stop.
Without at least a little more information on exactly what you are trying to achieve, it will be hard to give you a single, satisfactory answer. The most likely complete solution? Use a VPS. Purchase the VPS anonymously (if that is necessary for your threat model), and connect to the VPS using Tor. Configure wget with some basic throttling and ratelimiting to avoid being blocked. This will not only raise no red flags due to Tor usage, but it will also isolate wget in the case that it is compromised.