Tokenize a Stack-Based language

15

I've been working on another stack-based golfing language called Stackgoat. In this challenge you'll be writing a Tokenizer for Stackgoat (or really any general stack-based languages).

Examples

"PPCG"23+
["PPCG", '23', '+']

'a "bc" +
['"a"', '"bc"', '+']

12 34+-"abc\"de'fg\\"
['12', '34', '+', '-', '"abc\"de'fg\\"']

"foo
['"foo"']

(empty input)
[]

' ""
['" "', '""']

Specification

The three types you'll need to handle are:

  • Strings, anything within ""
  • Numbers, any sequence of digits
  • Operators, any other single character besides whitespace

Whitespace is essentially ignored unless it is within a string or separates two numbers.

String / char spec:

  • Strings are delimited by a ", and when a \ is encountered, the next character should be escaped.
  • Chars are prepended by a ' and the character after the ' should be converted into a string literal. 'a -> "a"
  • ' will always have a character after it
  • Closing quotes should be auto-inserted

Rules:

  • No form of eval is allowed

Input / Output:

  • Input can be taken through STDIN, function parameters, or your language's equivalent.
  • Output should be an array or your language's closest equivalent.

Downgoat

Posted 2016-01-09T23:21:31.493

Reputation: 27 116

5@Doorknob, seriously? – LegionMammal978 – 2016-01-09T23:23:48.047

4@LegionMammal978 Yes, seriously. – Alex A. – 2016-01-09T23:44:03.073

1Can output be to STDOUT? – Doorknob – 2016-01-10T00:04:06.757

@Doorknob yes, of course – Downgoat – 2016-01-10T00:04:41.733

Is -15 a) ['-', '15'] or b) ['-15']? – Martin Ender – 2016-01-10T01:05:17.500

@MartinBüttner for the simplicity of this challenge. -15 is ['-', '15'] – Downgoat – 2016-01-10T01:05:45.350

Can the input contain linefeeds? If so, what other whitespace can it contain? – Martin Ender – 2016-01-10T01:06:43.807

Any empty string in the test cases would also be good. – Martin Ender – 2016-01-10T01:07:59.850

@MartinBüttner yes, the input may contain linefeeds. It may also include carriage returns, line feeds, tabs, and form feeds – Downgoat – 2016-01-10T01:08:38.773

@Doᴡɴɢᴏᴀᴛ I meant an empty string in the code, but an empty input is good too. Also an empty, unclosed string at the end would be a good test case. – Martin Ender – 2016-01-10T01:10:05.687

Also ' followed by whitespace. – Martin Ender – 2016-01-10T01:10:46.800

What is 1.5? ['1', '.', '5']? – orlp – 2016-01-10T23:56:39.237

Could you fix the syntax of example 3? Something is mismatched or not escaped correctly. – Zach Gates – 2016-01-13T03:56:11.040

@ZachGates Everything is escaped properly :) It might be the \ or the " that's throwing off your program – Downgoat – 2016-01-13T04:15:49.967

Entering '"abc\"de'fg\\"' directly, fails for me. The single quote between de and fg should be escaped. @Doᴡɴɢᴏᴀᴛ – Zach Gates – 2016-01-13T04:17:11.643

2@ZachGates Well yes, most languages do handle \ as an escape character too, so yes, you will need to escape that if your language needs it obviously. – Downgoat – 2016-01-13T04:19:20.977

Here's my submission – cat – 2016-01-13T04:26:00.600

Are hexadecimal digits allowed in the numbers? – Fund Monica's Lawsuit – 2016-01-13T17:19:05.987

1Also, in the first example, should the first element of the result be '"PPCG"' instead of just "PPCG"? – Fund Monica's Lawsuit – 2016-01-13T17:24:24.790

Lastly, can the output use double-quotes, and escape any double-quotes inside, or does it have to be wrapped in single-quotes? – Fund Monica's Lawsuit – 2016-01-13T17:43:46.980

what can be escaped? just "? or do we have to support \n etc. – Conor O'Brien – 2016-05-09T00:27:50.567

Answers

8

Retina, 68 64 63 bytes

M!s`"(\\.|[^"])*"?|'.|\d+|\S
ms`^'(.)|^"(([^\\"]|\\.)*$)
"$1$2"

or

s`\s*((")(\\.|[^"])*(?<-2>")?|'.|\d+|.)\s*
$1$2¶
\ms`^'(.)
"$1"

I think this covers all the funky edge cases, even those not covered by the test cases in the challenge.

Try it online!

Martin Ender

Posted 2016-01-09T23:21:31.493

Reputation: 184 808

Dang, this is short. Nicely done! – Fund Monica's Lawsuit – 2016-01-13T19:37:15.843

I was able to translate this into a 95 byte ES6 function. It would have been 80 except that the regexps don't work the other way around (too many edge cases). – Neil – 2016-01-18T13:29:57.823

2

Ruby, 234 bytes

puts"[#{$stdin.read.scan(/("(?:(?<!\\)\\"|[^"])+(?:"|$))|'(.)|(\d+)|(.)/).map{|m|(m[0]?(m[0].end_with?('"')?m[0]: m[0]+'"'): m[1]?"\"#{m[1]}\"": m.compact[0]).strip}.reject(&:empty?).map{|i|"'#{/\d+|./=~i ?i: i.inspect}'"}.join', '}]"

I tried using the find(&:itself) trick that I saw... somewhere, but apparently .itself isn't actually a method. Also, I'm working on golfing the regex down, but it's already unreadable.

If we don't have to output in any fancy way (i.e. strings don't have to be quoted in the array) I can save a whole lotta bytes:

Still Ruby, 194 bytes:

p$stdin.read.scan(/("(?:(?<!\\)\\"|[^"])+(?:"|$))|'(.)|(\d+)|(.)/).map{|m|(m[0]?(m[0].end_with?('"')?m[0]: m[0]+'"').gsub(/\\(.)/,'\1'): m[1]?"\"#{m[1]}\"": m.compact[0]).strip}.reject(&:empty?)

I'm sure I can golf it more, but I'm not quite sure how.


Ungolfed coming soon. I started fiddling with the golfed directly at some point and I'll have to tease it out.

Fund Monica's Lawsuit

Posted 2016-01-09T23:21:31.493

Reputation: 564

0

Python 3, 228 bytes

import re;L=list
print(L(map(lambda i:i+'"'if i[0]=='"'and not i[-1]=='"'else i,map(lambda i:'"%s"'%i[1]if i[0]=="'"else i,filter(None,sum([L(i)for i in re.findall('(\'.)|(".*")|(\d+)|([^\w\"\'\s\\\])|(".*"?)',input())],[]))))))

Here's a nice, long, two-liner.


Test it out in Python 3. Here's some examples:

$ python3 test.py
"PPCG"23+
['"PPCG"', '23', '+']

$ python3 test.py
'a "bc" +
['"a"', '"bc"', '+']

$ python3 test.py
12 34+-"abc"de'fg\"
['12', '34', '+', '-', '"abc"de\'fg\\"']

$ python3 test.py
"foo
['"foo"']

$ python3 test.py

[]

$ python3 test.py
' ""
['" "', '""']

Zach Gates

Posted 2016-01-09T23:21:31.493

Reputation: 6 152