Remove single line and multiline comments from string

18

4

Goal

Using the programming language of your choice, write the shortest program to eliminate comments from a string representing a C program.


Input

The string can be taken as any form of input, but it may also be taken as a variable.


Instructions

Two different kinds of comments are to be removed:

  • multiline comments, starting with /* and ending with */
  • single line comments, starting with // and ending with Linux-style line breaks (LF, \n)

Comments within strings are not to be deleted. For the purpose of this challenge, you only need to consider "-delimited strings. In particular, you can ignore the possibility of '-delimited character literals. You may also ignore trigraphs and line continuations (/\<LF>*...).


Examples

Input:

#include <stdio.h>

int main(int argc, char** argv)
{
    // this comment will be removed
    if (argc > 1) {
        printf("Too many arguments.\n");   // this too will be removed
        return 1;
    }
    printf("Please vist http://this.will.not.be.removed.com\n");
    printf("/* This will stay */\n");
    printf("\"/* This will stay too */\"\n");
    printf("//and so will this\\");
    // but not this
    printf("just \"ano//ther\" test.");
    return 0;
}

Output:

#include <stdio.h>

int main(int argc, char** argv)
{

    if (argc > 1) {
        printf("Too many arguments.\n");   
        return 1;
    }
    printf("Please vist http://this.will.not.be.removed.com\n");
    printf("/* This will stay */\n");
    printf("\"/* This will stay too */\"\n");
    printf("//and so will this\\");

    printf("just \"ano//ther\" test.");
    return 0;
}

Input:

/*
    this shall disappear
*/
#include <string>
int main(int argc, char** argv)
{
    string foo = ""/*remove that!**/;
    // Remove /* this
    int butNotThis = 42;
    // But do */ remove this
    int bar = 4 /*remove this*/* 3; // but don't remove that 3. */
    return 0;//just a comment
}/*end of the file has been reached.*/

Output:

#include <string>
int main(int argc, char** argv)
{
    string foo = "";

    int butNotThis = 42;

    int bar = 4 * 3; 
    return 0;
}

Mathieu Rodic

Posted 2015-04-01T10:51:10.063

Reputation: 1 170

1From where that printf("\"/* This will stay too */\"\n"); appeared in the should become code? – manatwork – 2015-04-01T11:00:05.707

Oops, sorry... it was just a typo. Thanks for noticing! – Mathieu Rodic – 2015-04-01T11:05:41.160

Do whitespaces count? There are 4 spaces in front of // this comment will be removed which just disappeared. Any rule for that? – manatwork – 2015-04-01T11:08:04.043

1I don't know any of the listed languages that well, so some kind of a self-contained spec would be nice, together with more examples. – Zgarb – 2015-04-01T13:03:02.400

@manatwork: whitespace removal is not mandatory – Mathieu Rodic – 2015-04-01T13:03:34.200

@MartinBüttner & Zgarb: the comments to be removed are the ones described in the instruction section. – Mathieu Rodic – 2015-04-01T13:03:51.823

@MartinBüttner: trigraph management is not necessary – Mathieu Rodic – 2015-04-01T13:04:34.313

Not to mention issues around things like JavaScript's regex literals. – Peter Taylor – 2015-04-01T13:13:13.773

Are we allowed to assume that the file ends in a newline? – Martin Ender – 2015-04-01T14:16:34.057

How about line continuation? It is going to mess up many of the answers here. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2015-04-02T05:01:03.287

Answers

11

Retina, 35 + 1 + 2 = 38 bytes

This program consists of two files, hence I've included a 1-byte penalty for the second file.

//.*|/\*[\s\S]*?\*/|("(\\.|[^"])*")
$1

This is a simple regex replacement, using the .NET flavour (although this would work the same in most other flavours).

The idea is to match both comments and strings, but only write the match back if it was a string. By matching the strings explicitly, they are skipped when searching for comments.

Martin Ender

Posted 2015-04-01T10:51:10.063

Reputation: 184 808

1

This works surprisingly well in PHP: https://regex101.com/r/kB5kA4/1

– Ismael Miguel – 2015-04-01T14:47:26.490

1@IsmaelMiguel Yes, I didn't use anything feature specific. The only reason I picked .NET is because Retina allows me to write regex-only programs without any overhead of calling something like preg_replace. – Martin Ender – 2015-04-01T15:01:36.387

I'm aware of that. You've used it quite a lot before. If I'm correct, it was created by you. It was for the curious. And also, you now have a test-suite where you can test whatever changes come into this question (I predict many) – Ismael Miguel – 2015-04-01T15:10:38.043

Nice! This regular expression even works with other programming languages (when slashes are escaped). – Mathieu Rodic – 2015-04-02T11:46:19.777

I used your regex technique to improve a third party library I work with: Dojo Toolkit

– mbomb007 – 2018-08-09T13:57:51.220

This is also working in Java, thank you – Sukumaar – 2019-04-09T07:33:55.727

15

Shell + coreutils + gcc compiler collection, 31 bytes

This answer may seem a bit loopholey, but I didn't see anything specifically banning it in the question.

Rather than using clumsy regular expressions, why not use the tool that was built for the job. It should have no problem giving correct results:

cpp -fpreprocessed -o- -|sed 1d

Takes input from STDIN and output to STDOUT. Normally ccp will do all preprocessing (header files, macro expansion, comment removal, etc), but with the -fpreprocessed option, it will skip most of the steps, but it will still remove comments. In addition, cpp adds a line like # 1 "<stdin>" to the beginning of the output, so the sed is there to delete it.

Digital Trauma

Posted 2015-04-01T10:51:10.063

Reputation: 64 644

1"-fpreprocessed is implicit if the input file has one of the extensions .i, .ii or .mi". might you be able to save some bytes by saving the file in something like a.i instead of using the flag? – Martin Ender – 2015-04-01T20:20:28.303

@MartinBüttner Yes, I noticed that in the manual too. So I would expect something like cat>i.i;cpp -o- i.i|sed 1d to be equivalent. But full preprocessing ensues (e.g. full contents of stdio.h are inserted). Possible gcc bug??? Well perhaps I'll check the cpp source when I get a mo'. – Digital Trauma – 2015-04-01T20:41:13.027

You can remove the |sed 1d if you add the -P option. Note that (as allowed by the question), as it expects pre-processed code, it won't handle trigraphs or line continuations properly. – sch – 2016-07-22T06:48:09.563

3

Java 365

String a(String s){String o="";int m=1;for(int i=0;i<s.length();i++){String u=s.substring(i,Math.min(i+2,s.length()));char c=s.charAt(i);switch(m){case 1:m=u.equals("/*")?5:u.equals("//")?4:c=='"'?3:1;break;case 3:m=c=='"'?1:c=='\\'?2:3;break;case 2:m=3;break;case 4:m=c=='\n'?1:4;continue;case 5:m=u.equals("*/")?1:5;i+=m==1?1:0;continue;}o+=m<4?c:"";}return o;}}

Ungolfed

public static final int DEFAULT = 1;
public static final int ESCAPE = 2;
public static final int STRING = 3;
public static final int ONE_LINE_COMMENT = 4;
public static final int MULTI_LINE_COMMENT = 5;

String clear(String s) {
    String out = "";
    int mod = DEFAULT;
    for (int i = 0; i < s.length(); i++) {
        String substring = s.substring(i, Math.min(i + 2 , s.length()));
        char c = s.charAt(i);
        switch (mod) {
            case DEFAULT: // default
                mod = substring.equals("/*") ? MULTI_LINE_COMMENT : substring.equals("//") ? ONE_LINE_COMMENT : c == '"' ? STRING : DEFAULT;
                break;
            case STRING: // string
                mod = c == '"' ? DEFAULT : c == '\\' ? ESCAPE : STRING;
                break;
            case ESCAPE: // string
                mod = STRING;
                break;
            case ONE_LINE_COMMENT: // one line comment
                mod = c == '\n' ? DEFAULT : ONE_LINE_COMMENT;
                continue;
            case MULTI_LINE_COMMENT: // multi line comment
                mod = substring.equals("*/") ? DEFAULT : MULTI_LINE_COMMENT;
                i += mod == DEFAULT ? 1 : 0;
                continue;
        }
        out += mod < 4 ? c : "";
    }

    return out;
}

Ilya Gazman

Posted 2015-04-01T10:51:10.063

Reputation: 569

2

Python2 - 163 134 bytes

import re
def f(s):
 for x in re.findall(r'("[^\n]*"(?!\\))|(//[^\n]*$|/(?!\\)\*[\s\S]*?\*(?!\\)/)',s,8):s=s.replace(x[1],'')
 print s

As you can see here, the regex consists of 2 alternating capturing groups. The first one captures all the quoted strings. The second one all the comments.

All we need to do, is removing everything captured by the 2nd group.

Example:

Python 2.7.9 (default, Dec 11 2014, 04:42:00) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> def f(s):
...  for x in re.findall(r'("[^\n]*"(?!\\))|(//[^\n]*$|/(?!\\)\*[\s\S]*?\*(?!\\)/)',s,8):s=s.replace(x[1],'')
...  print s
... 
>>> code = r'''#include <stdio.h>
... 
... int main(int argc, char** argv)
... {
...     // this comment will be removed
...     if (argc > 1) {
...         printf("Too many arguments.\n");   // this too will be removed
...         return 1;
...     }
...     printf("Please vist http://this.will.not.be.removed.com\n");
...     printf("/* This will stay */\n");
...     printf("\"/* This will stay too */\"\n");
...     printf("//and so will this\\");
...     // but not this
...     printf("just \"ano//ther\" test.");
...     return 0;
... }
... /*
...     this shall disappear
... */
... #include <string>
... int main(int argc, char** argv)
... {
...     string foo = ""/*remove that!**/;
...     // Remove /* this
...     int butNotThis = 42;
...     // But do */ remove this
...     int bar = 4 /*remove this*/* 3; // but don't remove that 3. */
...     return 0;//just a comment
... }/*end of the file has been reached.*/'''
>>> f(code)
#include <stdio.h>

int main(int argc, char** argv)
{

    if (argc > 1) {
        printf("Too many arguments.\n");   
        return 1;
    }
    printf("Please vist http://this.will.not.be.removed.com\n");
    printf("/* This will stay */\n");
    printf("\"/* This will stay too */\"\n");
    printf("//and so will this\\");

    printf("just \"ano//ther\" test.");
    return 0;
}

#include <string>
int main(int argc, char** argv)
{
    string foo = "";

    int butNotThis = 42;

    int bar = 4 * 3; 
    return 0;
}

pepp

Posted 2015-04-01T10:51:10.063

Reputation: 61

1

PHP

Converting @Martin Ender's answer for php:

$str = preg_replace_callback('/\/\/.*|\/\*[\s\S]*?\*\/|("(\\.|[^"])*")/m', 
  function($matches){
     if(\is_array($matches) && (\count($matches) > 1)){
        return $matches[1];
     }else{
        return '';
     }
  }, $str);

now $str has lost single- and multi-line comments. This is useful for stripping comments in JSON data before feeding to json_decode().

centurian

Posted 2015-04-01T10:51:10.063

Reputation: 111

Maybe you could reduce the bytes count by using a ternary operator? – Mathieu Rodic – 2017-03-22T10:28:57.860

1

Rebol - 151

f: func[t][Q:{"}W: complement charset Q parse t[any[[Q any["\\"|"\"Q | W]Q]|[a:[["//"to[lf | end]]|["/*"thru"*/"]]b:(remove/part a b):a skip]| skip]]t]

Ungolfed + some annotations:

f: func [t] [
    Q: {"}
    W: complement charset Q     ;; any char thats not a double quote

    ; rule to parse t (c program) - it can be ANY of 
    ;     1. string 
    ;     2. OR comment (if so then remove)
    ;     3. OR pass thru

    parse t [
        any [
            ;; 1. String rule
            [Q any ["\\" | "\" Q | W] Q]

            ;; 2. OR comments rule
            | [
                a:  ;; mark beginning of match
                [
                    ;;    // comment    OR  /* comment */
                    ["//" to [lf | end]] | ["/*" thru "*/"]
                ]
                b:  ;; mark end of match 
                (remove/part a b) :a skip   ;; remove comment
            ]

            ;; 3. OR allow thru (so not a String or Comment)
            | skip
        ]
    ]

    t
]

draegtun

Posted 2015-04-01T10:51:10.063

Reputation: 1 592

0

C# (262 chars):

From this very good SO answer:

string a(string i){return Regex.Replace(i, @"/\*(.*?)\*/|//(.*?)\r?\n|""((\\[^\n]|[^""\n])*)""|@(""[^""]*"")+", m => { var v = m.Value; if (v.StartsWith("/*") || v.StartsWith("//")) return v.StartsWith("//") ? "\r\n" : ""; return v; }, RegexOptions.Singleline);

vrluckyin

Posted 2015-04-01T10:51:10.063

Reputation: 261

-1

JS (ES6), 47 chars (wip)

DEMO: http://codepen.io/anon/pen/dPEMro

a=b=>b.replace(/(\/\*[^]*?\*\/|\/\/.*)\n?/g,"")

Inspired by my codegolfed minifiers: http://xem.github.io/miniMinifier/

doesn't handle comments in strings yet...

I'm curious to see if it's possible to achieve that in JS regexes.

xem

Posted 2015-04-01T10:51:10.063

Reputation: 5 523

If this answer doesn't meet the requirements, it should either be fixed or deleted. – mbomb007 – 2018-05-23T20:02:35.967