4

This is an Ubuntu server running PHP 7 on Apache, with a website which enforces TLS (using the standard port). To my understanding, https://example.com and https://example.com:443 are exactly equivalent (and, indeed, in my browser the port number disappears from the address bar when I type it in). And yet the HTTP_HOST usually contains just the domain name, but sometimes contains also the port number. This could be for bot visitors (I haven't analysed logs), but even so I don't see how. Is there any actual difference?

(This is causing some problems, as some of our logs and work queues and server-side cache are separated out by HTTP_HOST, so having the same site report on different hosts is confusing.)

TRiG
  • 1,167
  • 2
  • 13
  • 30
  • IIRC the HTTP_HOST is taken from the REQUEST header the client sends. In absence of a port number in the request most libraries use port 80 for http and port 443 for https requests, the default ports for the protocols. (and a user friendly browser will strip those default ports in the URL and will only display port numbers when they or not default.) An alternative in apache is to use the name you define as the ServerName in a VirtualHost rather than rely on what a client sends – HBruijn Oct 04 '19 at 11:17

1 Answers1

5

The PHP documentation states that $_SERVER['HTTP_HOST'] is:

Contents of the Host: header from the current request, if there is one.

Indeed, every variable in this associative array whose key begins with the string HTTP_ is a copy of the corresponding HTTP request variable sent by the user agent.

So, why does it sometimes contain the hostname, and sometimes contain both the hostname and port number?

It turns out that both syntaxes are legal and equivalent. The port number is required if the server uses a non-default port, but is optional otherwise.

In what circumstances would a user agent send the port number even when it is the default?

RFC 7230 section 5.4 explains that the Host: header's value is an exact copy of the authority component of the URI.

If the target URI includes an authority component, then a client MUST send a field-value for Host that is identical to that authority component, excluding any userinfo subcomponent and its "@" delimiter . . .

What is the authority component?

This comes from the definition of a URI in RFC 3986 section 3.2, which explains that it is the user information (username and password), host and port. It explains that the port SHOULD be omitted if it is the default port, but SHOULD does not equal MUST. (See RFC 2119.)

So, to put this all together, a user agent is expected to send the port number in the Host: header if it also appears in the URI. Thus, if the user agent has the URL https://example.com:443/robots.txt then it will have a header Host: example.com:443. How the user agent got such a URL, there's no real way to tell. It might have been sent by your application, or it might have been constructed by the user agent.

RFC 7230 section 2.7.3 explains URL normalization which, for this case, indicates that a URL containing no port number and a URL containing the default port number are equivalent.


TL;DR: Your application must expect that a port number may appear in this header and deal with it in some way appropriate to the context in which it is used.

You may consider instead using $_SERVER['SERVER_NAME'], which contains the value of the ServerName directive in the Apache <VirtualHost> which processed the request (or, for nginx, the first server_name in the server block).

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • Thanks so much for the detailed explanation and for the possible solution of using `$_SERVER['SERVER_NAME']` instead. We will begin to use `$_SERVER['SERVER_NAME']` in our code. – Leo Galleguillos Nov 09 '20 at 21:32