0
I have a file with a list of smileys code units, and I need another one with the 'grapheme representation' in order to feed those to a grep. If I echo the code unit right in the shell it gets converted just fine:
echo $'\xF0\x9F\x98\x81'
But if I run a bulk echo it prints code units as if they were just a string
head emos_regex_utf8.lst | xargs -d '\n' -n1 echo
$'\xF0\x9F\x98\x81'
$'\xF0\x9F\x98\x82'
$'\xF0\x9F\x98\x83'
$'\xF0\x9F\x98\x84'
$'\xF0\x9F\x98\x85'
My goal is to run a grep on a bulk of files using the file with the graphemes as list of regex. Strictly speaking, I'm trying to run, e.g.:
cat ./20160711/* | jq '. | {text}' | grep -hEi -f graphemes.lst
/20160711/ contains a bunch of json files with a bunch of tweets, some of which got smileys on their texts, and some of which got smileys on other fields (e.g. "quoted"), that I want to disregard; so I need to extract the 'text' field with jq, which turns my javascript codepoints (from the .json files) into their grapheme representation. That's why I need the bytecodes.
I've written a test file, testreg.lst, which has one grapheme and one bytecode; they're distinct smileys and I'm sure I have a match for each on my tweets files:
testreg.lst:
⛄
$'\xF0\x9F\x98\x81'
Running my code with this file returns the tweet containing the first entry, exactly what I want.
$ cat ./20160711/* | jq '. | {text}' | grep -hEi -f testreg.lst
"text": "RT @C5N: ⛄ #MiAmanecer: Nieve en Caviahue, provincia de Neuquén | Temperatura máxima: 3°"
So my problem would be to turn the code units file to a graphemes file, which I expected to achieve with cat emos_regex_utf8.lst | xargs -d '\n' -n1 echo > graphemes.lst
, given that echo $'\xF0\x9F\x98\x81'
printed just what I needed, but it seems I'm doing something wrong.
btw, I'm running from Cygwin (uname: CYGWIN_NT-6.1)
Thanks! :)
Those first two paragraphs clarified the whole thing. I worked it up with a python script – Vladimir – 2016-09-01T15:22:07.683