Printing graphemes given UTF-8 code units in shell

0

I have a file with a list of smileys code units, and I need another one with the 'grapheme representation' in order to feed those to a grep. If I echo the code unit right in the shell it gets converted just fine:

echo $'\xF0\x9F\x98\x81'

But if I run a bulk echo it prints code units as if they were just a string

head emos_regex_utf8.lst | xargs -d '\n' -n1 echo
$'\xF0\x9F\x98\x81'
$'\xF0\x9F\x98\x82'
$'\xF0\x9F\x98\x83'
$'\xF0\x9F\x98\x84'
$'\xF0\x9F\x98\x85'

My goal is to run a grep on a bulk of files using the file with the graphemes as list of regex. Strictly speaking, I'm trying to run, e.g.:

cat ./20160711/* | jq '. | {text}' | grep -hEi -f graphemes.lst

/20160711/ contains a bunch of json files with a bunch of tweets, some of which got smileys on their texts, and some of which got smileys on other fields (e.g. "quoted"), that I want to disregard; so I need to extract the 'text' field with jq, which turns my javascript codepoints (from the .json files) into their grapheme representation. That's why I need the bytecodes.

I've written a test file, testreg.lst, which has one grapheme and one bytecode; they're distinct smileys and I'm sure I have a match for each on my tweets files:

testreg.lst:    
⛄
$'\xF0\x9F\x98\x81'

Running my code with this file returns the tweet containing the first entry, exactly what I want.

$ cat ./20160711/* | jq '. | {text}' | grep -hEi -f testreg.lst
  "text": "RT @C5N: ⛄ #MiAmanecer: Nieve en Caviahue, provincia de Neuquén | Temperatura máxima: 3°"

So my problem would be to turn the code units file to a graphemes file, which I expected to achieve with cat emos_regex_utf8.lst | xargs -d '\n' -n1 echo > graphemes.lst, given that echo $'\xF0\x9F\x98\x81' printed just what I needed, but it seems I'm doing something wrong. btw, I'm running from Cygwin (uname: CYGWIN_NT-6.1)

Thanks! :)

Vladimir

Posted 2016-08-18T21:47:36.940

Reputation: 3

Answers

0

The $'...' syntax isn't some magical "UTF-8 code unit" notation, and it's not echo that interprets it. It's your shell that does – that's the bash syntax for performing generic C-like backslash expansion, and what the system actually runs is echo .

Being a shell feature it's specific to command lines (and namely to shell command lines, not directly executed ones like xargs). Thus it's very unlikely that grep or other tools should automagically understand $'...' strings in a text file.

There are various ways to translate them manually, though. If you have a recent Perl version, try piping the file through:

| perl -pe 's/\$\x27(.*?)\x27/$1 =~ s@\\x([0-9A-F]{2})@chr hex $1@ger/ge'

or for older Perls:

| perl -pe 's/\$\x27(.*?)\x27/($tmp = $1) =~ s@\\x([0-9A-F]{2})@chr hex $1@ge; $tmp/ge'

That said, you could just tell jq to not quote output in the first place, first by selecting only the value rather than building a useless dict, then by using the -r (raw output) option:

cat tweets.json | jq -r .text

(If it's an array of tweets, use .[].text)

user1686

Posted 2016-08-18T21:47:36.940

Reputation: 283 655

Those first two paragraphs clarified the whole thing. I worked it up with a python script – Vladimir – 2016-09-01T15:22:07.683