The most important point to make, IMHO, is that you should know what a variable (or database field) contains. You must know whether it's text (and what charset/encoding it is, in that case), or it's HTML (or an HTML attribute, which is yet another type of data), or SQL, etc.
Then, you need to apply to proper conversions when you need to move from one to the other.
The big issue is that in many cases, the representation of a piece of text (probably the most common type of data you can manipulate) is the same whether it's text, HTML, SQL, etc. (the text "abc" is the same as the HTML abc
or the SQL 'abc'
) and for this reason people tend to concatenate bits together without any conversion.
But that will break as soon as you encounter any characters that have a special meaning in one of the contexts. This not only leads to security issues (both XSS and SQL injections), but also to formatting issues (we have all seen sites which start showing HTML entities such as <
when they should be displaying <
), as people either forget the conversion, or do it multiple times.
It is quite rare that you actually need to allow input of actual HTML. In most cases, you want text. Just keep the text as it is, manipulate it as it is. But once you want to display it (on an HTML page), convert it to HTML (using standard and tested libraries/frameworks, not your improvised regex-based search-and-replace).
Likewise, you convert it when you want to build an SQL request (using parameterised queries, preferably). But you still store it exactly as it is.
Many frameworks will add abstraction layers that will "hide" all of this if you actually use them. But we all know that even with the best tools, you'll always end up with someone trying to build a bit of HTML themselves, so they need to know what needs to be done if they do so.
If you want/need to manipulate actual HTML, then you enter a completely different dimension in terms of XSS issues. Note sure that can be covered in an hour...