Printing graphemes given UTF-8 code units in shell

I have a file with a list of smileys code units, and I need another one with the 'grapheme representation' in order to feed those to a grep. If I echo the code unit right in the shell it gets converted just fine:

echo $'\xF0\x9F\x98\x81'

But if I run a bulk echo it prints code units as if they were just a string

head emos_regex_utf8.lst | xargs -d '\n' -n1 echo
$'\xF0\x9F\x98\x81'
$'\xF0\x9F\x98\x82'
$'\xF0\x9F\x98\x83'
$'\xF0\x9F\x98\x84'
$'\xF0\x9F\x98\x85'

My goal is to run a grep on a bulk of files using the file with the graphemes as list of regex. Strictly speaking, I'm trying to run, e.g.:

cat ./20160711/* | jq '. | {text}' | grep -hEi -f graphemes.lst

/20160711/ contains a bunch of json files with a bunch of tweets, some of which got smileys on their texts, and some of which got smileys on other fields (e.g. "quoted"), that I want to disregard; so I need to extract the 'text' field with jq, which turns my javascript codepoints (from the .json files) into their grapheme representation. That's why I need the bytecodes.

I've written a test file, testreg.lst, which has one grapheme and one bytecode; they're distinct smileys and I'm sure I have a match for each on my tweets files:

testreg.lst:    
⛄
$'\xF0\x9F\x98\x81'

Running my code with this file returns the tweet containing the first entry, exactly what I want.

$ cat ./20160711/* | jq '. | {text}' | grep -hEi -f testreg.lst
  "text": "RT @C5N: ⛄ #MiAmanecer: Nieve en Caviahue, provincia de Neuquén | Temperatura máxima: 3°"

So my problem would be to turn the code units file to a graphemes file, which I expected to achieve with cat emos_regex_utf8.lst | xargs -d '\n' -n1 echo > graphemes.lst, given that echo $'\xF0\x9F\x98\x81' printed just what I needed, but it seems I'm doing something wrong. btw, I'm running from Cygwin (uname: CYGWIN_NT-6.1)

Thanks! :)

Vladimir

Posted 2016-08-18T21:47:36.940

Reputation: 3

Printing graphemes given UTF-8 code units in shell

Answers