15

CentOS

Is there an easy way to convert HTML special entities from a data stream? I'm passing data to a bash script and sometimes that data includes special entities. For example:

"test" & test $test ! test @ # $ % ^ & *

I'm not sure why some characters show up fine and other don't but unfortunately, I don't have control over the data coming in.

I'm thinking I might be able to use SED here but that seems like it would be cumbersome and possibly prone to false positives. Is there a Linux command I could pipe to that specializes in decoding this type of data?

Mike B
  • 11,570
  • 42
  • 106
  • 165

6 Answers6

14

Perl is (as always) your friend. I think this will do it:

perl -n -mHTML::Entities -e ' ; print HTML::Entities::decode_entities($_) ;'

E.g.:

echo '"test" & test $test ! test @ # $ % ^ & *' |perl -n -mHTML::Entities -e ' ; print HTML::Entities::decode_entities($_) ;'

With output:

someguy@somehost ~]$ echo '"test" & test $test ! test @ # $ % ^ & *' |perl -n -mHTML::Entities -e ' ; print HTML::Entities::decode_entities($_) ;'
"test" & test $test ! test @ # $ % ^ & *
Jason Tan
  • 2,742
  • 2
  • 17
  • 24
  • This works on my OSX10.8 laptop and a RHEL5.something host. – Jason Tan Oct 22 '12 at 05:52
  • 1
    In order to output the file in UTF-8, use binmode: echo "«" | perl -n -mHTML::Entities -mutf8 -e 'binmode(STDOUT, ":utf8"); print HTML::Entities::decode_entities($_) ;' – falstaff Nov 23 '13 at 22:46
10

PHP is well suited to this. This example requires PHP 5:

cat file.html | php -R 'echo html_entity_decode($argn);'
Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
10

recode seems available on default packages repositories of main GNU/Linux distributions. E.g. to decode HTML entities into UTF-8 :

…|recode html..utf8
5

With Python 3:

python3 -c 'import html,sys; print(html.unescape(sys.stdin.read()), end="")' < file.html
Andrew Schulman
  • 8,561
  • 21
  • 31
  • 47
ariddell
  • 151
  • 1
  • 2
0

I use this script. Save it as html2utf.py, and use it ala echo $some_html | html2utf.py.

#!/usr/bin/env python3
"""
An alternative for `perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)'` (which you can use by `cpanm HTML::Entities`) and `recode html..`.
"""

import fileinput
import html

for line in fileinput.input():
    print(html.unescape(line.rstrip('\n')))
HappyFace
  • 121
  • 1
  • 9
0

Takes text file from stdin:

#!/bin/bash
#
while read lin; do
  newl=${lin//&gt;/>}
  newl=${newl//&lt;/<}
  newl=${newl//&amp;/<}
  # ...other entites
  echo "$newl"
done

It probably needs bash >= version 4

Eric Leschinski
  • 4,031
  • 4
  • 20
  • 27
bobom
  • 1