My god, it's full of spaces!

42

3

Some people insist on using spaces for tabulation and indentation.

For tabulation, that's indisputably wrong. By definition, tabulators must be used for tabulation.

Even for indentation, tabulators are objectively superior:

  • There's clear consensus in the Stack Exchange community.

  • Using a single space for indentation is visually unpleasant; using more than one is wasteful.

    As all code golfers know, programs should be as short as possible. Not only does it save hard disk space, compilation times are also reduced if less bytes have to be processed.

  • By adjusting the tab width1, the same file looks different on each computer, so everybody can use his favorite indent width without modifying the actual file.

  • All good text editors use tabulators by default (and definition).

  • I say so and I'm always right!

Sadly, not everybody listens to reason. Somebody has sent you a file that is doing it wrongTM and you have to fix it. You could just do it manually, but there will be others.

It's bad enough that spacers are wasting your precious time, so you decide to write the shortest possible program to take care of the problem.

Task

Write a program or a function that does the following:

  1. Read a single string either from STDIN or as a command-line or function argument.

  2. Identify all locations where spaces have been used for tabulation or indentation.

    A run of spaces is indentation if it occurs at the beginning of a line.

    A run of two or more spaces is tabulation if it isn't indentation.

    A single space that is not indentation may or may not have been used for tabulation. As expected when you use the same character for different purposes, there's no easy way to tell. Therefore, we'll say that the space has been used for confusion.

  3. Determine the longest possible tab width1 for which all spaces used for tabulation or indentation can be replaced with tabulators, without altering the appearance of the file.

    If the input contains neither tabulation, nor indentation, it is impossible to determine the tab width. In this case, skip the next step.

  4. Using the previously determined tab width, replace all spaces used for tabulation or indentation with tabulators.

    Also, whenever possible without altering the appearance of the file, replace all spaces used for confusion with tabulators. (If in doubt, get rid of spaces.)

  5. Return the modified string from your function or print it to STDOUT.

Examples

  • All spaces of

    a    bc   def  ghij
    

    are tabulation.

    Each run of spaces pads the preceding string of non-space characters to a width of 5, so the correct tab width is 5 and the correct output2 is

    a--->bc-->def->ghij
    
  • The first two spaces of

    ab  cde f
    ghi jk lm
    

    are tabulation, the others confusion.

    The correct tab width is 4, so the correct output2 is

    ab->cde>f
    ghi>jk lm
    

    The last space remains untouched, since it would be rendered as two spaces if replaced by a tabulator:

    ab->cde>f
    ghi>jk->lm
    
  • All but one spaces of

    int
        main( )
        {
            puts("TABS!");
        }
    

    are indentation, the other is confusion.

    The indentation levels are 0, 4 and 8 spaces, so the correct tab width is 4 and the correct output2 is

    int
    --->main( )
    --->{
    --->--->puts("TABS!");
    --->}
    

    The space in ( ) would be rendered as three spaces if replaced by a tabulator, so it remains untouched.

  • The first two spaces of

      x yz w
    

    are indentation, the others confusion.

    The proper tab width is 2 and the correct output2 is

    ->x>yz w
    

    The last space would be rendered as two spaces if replaced by a tabulator, so it remains untouched.

  • The first two spaces of

      xy   zw
    

    are indentation, the other three are tabulation.

    Only a tab width of 1 permits to eliminate all spaces, so the correct output2 is

    >>xy>>>zw
    
  • All spaces of

    a b c d
    

    are confusion.

    There is no longest possible tab width, so the correct output2 is

    a b c d
    

Additional rules

  • The input will consist entirely of printable ASCII characters and linefeeds.

  • You may assume that there are at most 100 lines of text and at most 100 characters per line.

  • If you choose STDOUT for output, you may print a single trailing linefeed.

  • Standard rules apply.


1 The tab width is defined as the distance in characters between two consecutive tab stops, using a monospaced font.
2 The ASCII art arrows represent the tabulators Stack Exchange refuses to render properly, for which I have submitted a bug report. The actual output has to contain actual tabulators.

Dennis

Posted 2015-09-03T17:26:21.820

Reputation: 196 637

9+1 for finally putting this nonsensical space/tab issue to rest :D – Geobits – 2015-09-03T17:40:36.440

2programs should be as short as possible I believe I have found Arthur Whitney's long-lost brother!! – kirbyfan64sos – 2015-09-03T17:45:48.180

6

@Dennis "That said, only a moron would use tabs to format their code." Clear consensus, eh?

– primo – 2015-09-03T19:26:25.783

13Tabs are unholy demonspawn that deserve to have their bits ripped apart and their ASCII code disgraced until their incompetent lack-of-a-soul has been thoroughly ground into a pulp. Errr, I mean, +1, nice challenge, even though it reeks of blasphemy. ;) – Doorknob – 2015-09-04T02:42:40.527

1I was crying each time a colleague add a tab in my beautiful space indented code. Then I discovered CTRL+K+F in Visual Studio. I do it each time I open a modified file. My life is better now. – Michael M. – 2015-09-04T09:03:02.777

Let's use 4 spaces for the 1st level and a tab for the 2nd, or 1 space for the 1st level for code golfing.

– jimmy23013 – 2015-09-04T09:11:30.953

I don't understand "The last space remains untouched, since it would be rendered as two spaces if replaced by a tabulator:" in your second example. Why is it ghi>jk lm and not ghi>jk>lm or ghi jk lm, when both are confusion spaces? – Fatalize – 2015-09-04T09:15:21.027

@Fatalize Because a tabulator advances always to the next tab stop. With a tab width of 4, a single tabulator can replace 1, 2, 3 or even 4 spaces, depending on where it occurs. – Dennis – 2015-09-04T13:33:23.947

Emacs-lisp, 8 bytes: (tabify) (not really, of course, but this is close). – coredump – 2015-09-04T19:30:52.107

Answers

5

Pyth, 102 103 bytes

=T|u?<1hHiGeHGsKmtu++J+hHhGlhtH+tG]+HJ.b,YN-dk<1u+G?H1+1.)Gd]0]0cR\ .zZ8VKVNp?%eNT*hNd*/+tThNTC9p@N1)pb

Try it Online

Interesting idea, but since tabs in the input break the concept, not very usable.

Edit: Fixed bug. many thanks @aditsu

Brian Tuck

Posted 2015-09-03T17:26:21.820

Reputation: 296

It crashes on "a b c d" – aditsu quit because SE is EVIL – 2015-09-08T09:18:45.853

@aditsu crap! Thanx for the heads-up. I need better test cases :P – Brian Tuck – 2015-09-08T17:50:52.553

5

PowerShell, 414 409 bytes

function g($a){if($a.length-gt2){g $a[0],(g $a[1..100])}else{if(!$a[1]){$a[0]}else{g $a[1],($a[0]%$a[1])}}}{$a[0]}else{g $a[1],($a[0]%$a[1])}}}
$b={($n|sls '^ +|(?<!^)  +' -a).Matches}
$n=$input-split"`n"
$s=g(&$b|%{$_.Index+$_.Length})
($n|%{$n=$_
$w=@(&$b)
$c=($n|sls '(?<!^| ) (?! )'-a).Matches
$w+$c|sort index -d|%{$x=$_.Index
$l=$_.Length
if($s-and!(($x+$l)%$s)){$n=$n-replace"(?<=^.{$x}) {$l}",("`t"*(($l/$s),1-ge1)[0])}}
$n})-join"`n"

I went ahead and used newlines instead of ; where possible to make display easier. I'm using unix line endings so it shouldn't affect the byte count.

How To Execute

Copy code into SpaceMadness.ps1 file, then pipe the input into the script. I will assume the file that needs converting is called taboo.txt:

From PowerShell:

cat .\taboo.txt | .\SpaceMadness.ps1

From command prompt:

type .\taboo.txt | powershell.exe -File .\SpaceMadness.txt

I tested it with PowerShell 5, but it should work on 3 or higher.

Testing

Here's a quick PowerShell scrip that's useful for testing the above:

[CmdletBinding()]
param(
    [Parameter(
        Mandatory=$true,
        ValueFromPipeline=$true
    )]
    [System.IO.FileInfo[]]
    $File
)

Begin {
    $spaces = Join-Path $PSScriptRoot SpaceMadness.ps1
}

Process {
     $File | ForEach-Object {
        $ex = Join-Path $PSScriptRoot $_.Name 
        Write-Host $ex -ForegroundColor Green
        Write-Host ('='*40) -ForegroundColor Green
        (gc $ex -Raw | & $spaces)-split'\r?\n'|%{[regex]::Escape($_)} | Write-Host -ForegroundColor White -BackgroundColor Black
        Write-Host "`n"
    }
}

Put this in the same directory as SpaceMadness.ps1, I call this one tester.ps1, call it like so:

"C:\Source\SomeFileWithSpaces.cpp" | .\tester.ps1
.\tester.ps1 C:\file1.txt,C:\file2.txt
dir C:\Source\*.rb -Recurse | .\tester.ps1

You get the idea. It spits out the contents of each file after conversion, run through [RegEx]::Escape() which happens to escape both spaces and tabs so it's really convenient to see what's actually been changed.

The output looks like this (but with colors):

C:\Scripts\Powershell\Golf\ex3.txt
========================================
int
\tmain\(\ \)
\t\{
\t\tputs\("TABS!"\);
\t}

Explanation

The very first line defines a greatest common factor/divisor function g as succinctly as I could manage, that takes an array (arbitrary number of numbers) and calculates GCD recursively using the Euclidean algorithm.

The purpose of this was to figure out the "longest possible tab width" by taking the index + length of every indentation and tabulation as defined in the question, then feeding it to this function to get the GCD which I think is the best we can do for tab width. A confusion's length will always be 1 so it contributes nothing to this calculation.

$b defines a scriptblock because annoyingly I need to call that piece of code twice, so I save some bytes that way. This block takes the string (or array of strings) $n and runs a regex on it (sls or Select-String), returning match objects. I'm actually getting both indentations and tabulations in one here, which really saved me extra processing by capturing them separately.

$n is used for different things inside and outside the main loop (really bad, but necessary here so that I can embed it in $b's scriptblock and use that both inside and outside the loop without a lengthy param() declaration and passing arguments.

$s gets assigned the tab width, by calling the $b block on the array of lines in the input file, then summing the index and length of each match, returning the array of the sums as an argument into the GCD function. So $s has the size of our tab stops now.

Then the loop starts. We iterate over each line in the array of input lines $n. The first thing I do in the loop is assign $n (local scope) the value of the current line for the above reason.

$w gets the value of the scriptblock call for the current line only (the indentations and tabulations for the current line).

$c gets a similar value, but instead we find all the confusions.

I add up $w and $c which are arrays, giving me one array with all of the space matches I need, sort it in descending order by index, and begin iterating over each match for the current line.

The sort is important. Early on I found out the hard way that replacing parts of a string based on index values is a bad idea when the replacement string is smaller and changes the length of the string! The other indexes get invalidated. So by starting with the highest indexes on each line, I make sure I only make the string shorter from the end, and move backwards so the indexes always work.

Into this loop, $x is in the index of the current match and $l is the length of the current match. $s can in fact be 0 and that causes a pesky divide by zero error so I'm checking for its validity then doing the math.

The !(($x+$l)%$s) bit there is the single point where I check to see if a confusion should be replaced with a tab or not. If the index plus the length divided by the tab width has no remainder, then we're good to go in replacing this match with a tab (that math will always work on the indentations and tabulations, because their size is what determined the tab width to begin with).

For the replace, each iteration of the match loop works on the current line of the input, so it's a cumulative set of replaces. The regex just looks for $l spaces that are preceded by $x of any character. We replace it with $l/$s tab characters (or 1 if that number is below zero).

This part (($l/$s),1-ge1)[0] is a fancy convoluted way of saying if (($l/$s) -lt 0) { 1 } else { $l/$s } or alternatively [Math]::Max(1,($l/$s)). It makes an array of $l/$s and 1, then uses -ge 1 to return an array containing only the elements that are greater than or equal to one, then takes the first element. It comes in a few bytes shorter than the [Math]::Max version.

So once all of the replaces are done, the current line is returned from the ForEach-Object (%) iteration, and when all of them are returned (an array of fixed lines), it's -joined with newlines (since we split on newlines in the beginning).

I feel like there's room for improvement here that I'm too burnt out to catch right now, but maybe I'll see something later.

Tabs 4 lyfe

briantist

Posted 2015-09-03T17:26:21.820

Reputation: 3 110

4

PHP - 278 210 bytes

The function works by testing each tab width, starting with a value of 100, the maximal length of a line and therefore the maximal tab width.

For each tab width, we split each line into "blocks" of that length. For each of this blocks:

  • If, by concatenating the last character of the previous block with this block, we find two consecutive spaces before a character, we have an indentation or a tabulation that can't be transformed to space without altering the appearance; we try the next tab width.
  • Otherwise, if the last character is a space, we strip spaces at end of the block, add a tabulator and memorise the whole thing.
  • Otherwise, we just memorise the block.

Once each blocks of a line have been analysed, we memorise a linefeed. If all the blocks of all the lines were analysed with success, we return the string we've memorised. Otherwise, if each strictly positive tab width have been tried, there was neither tabulation, nor indentation, and we return the original string.

function($s){for($t=101;--$t;){$c='';foreach(split('
',$s)as$l){$e='';foreach(str_split($l,$t)as$b){if(ereg('  [^ ]',$e.$b))continue 3;$c.=($e=substr($b,-1))==' '?rtrim($b).'   ':$b;}$c.='
';}return$c;}return$s;}

Here is the ungolfed version:

function convertSpacesToTabs($string)
{
    for ($tabWidth = 100; $tabWidth > 0; --$tabWidth)
    {
        $convertedString = '';
        foreach (explode("\n", $string) as $line)
        {
            $lastCharacter = '';
            foreach (str_split($line, $tabWidth) as $block)
            {
                if (preg_match('#  [^ ]#', $lastCharacter.$block))
                {
                    continue 3;
                }

                $lastCharacter = substr($block, -1);
                if ($lastCharacter == ' ')
                {
                    $convertedString .= rtrim($block) ."\t";
                }
                else
                {
                    $convertedString .= $block;
                }
            }

            $convertedString .= "\n";
        }

        return $convertedString;
    }

    return $string;
}

Special thanks to DankMemes for saving 2 bytes.

Blackhole

Posted 2015-09-03T17:26:21.820

Reputation: 2 362

1You can save 2 bytes by using for($t=101;--$t;) instead of for($t=100;$t;--$t) – DankMemes – 2015-09-05T15:43:43.643

4

CJam, 112

qN/_' ff=:e`{0:X;{_0=X+:X+}%}%_:+{~;\(*},2f=0\+{{_@\%}h;}*:T;\.f{\~\{@;1$({;(T/)9c*}{\;T{T%}&S9c?}?}{1$-@><}?}N*

Try it online

I had to answer this challenge, because I must do my part to help rid the world of this abomination. Tabs are obviously superior, but sadly, some people just can't be reasoned with.

Explanation:

qN/          read input and split into lines
_            duplicate the array (saving one copy for later)
' ff=        replace each character in each line with 0/1 for non-space/space
:e`          RLE-encode each line (obtaining chunks of spaces/non-spaces)
{…}%         transform each line
  0:X;       set X=0
  {…}%       transform each chunk, which is a [length, 0/1] array
    _0=      copy the first element (the length)
    X+:X     increment X by it
    +        and append to the array; this is the end position for the chunk
_            duplicate the array (saving one copy for later)
:+           join the lines (putting all the chunks together in one array)
{…},         filter the array using the block to test each chunk
  ~          dump the chunk (length, 0/1, end) on the stack
  ;          discard the end position
  \(         bring the length to the top and decrement it
  *          multiply the 2 values (0/1 for non-space/space, and length-1)
              the result is non-zero (true) iff it's a chunk of at least 2 spaces
2f=          get all the end positions of the multiple-space chunks
0\+          prepend a 0 to deal with the empty array case
{…}*         fold the array using the block
  {_@\%}h;   calculate gcd of 2 numbers
:T;          save the resulting value (gcd of all numbers) in variable T
\            swap the 2 arrays we saved earlier (input lines and chunks)
.f{…}        for each chunk and its corresponding line
  \~         bring the chunk to the top and dump it on the stack
              (length, 0/1, end position)
  \          swap the end position with the 0/1 space indicator
  {…}        if 1 (space)
    @;       discard the line text
    1$(      copy the chunk length and decrement it
    {…}      if non-zero (multiple spaces)
      ;      discard the end position
      (T/)   divide the length by T, rounding up
      9c*    repeat a tab character that many times
    {…}      else (single space)
      \;     discard the length
      T{…}&  if T != 0
        T%   calculate the end position mod T
      S9c?   if non-zero, use a space, else use a tab
    ?        end if
  {…}        else (non-space)
    1$-      copy the length and subtract it from the end position
              to get the start position of the chunk
    @>       slice the line text beginning at the start position
    <        slice the result ending at the chunk length
              (this is the original chunk text)
  ?          end if
N*           join the processed lines using a newline separator

aditsu quit because SE is EVIL

Posted 2015-09-03T17:26:21.820

Reputation: 22 326

1

PowerShell, 165 160 153 152 142 138 137 bytes

param($s)@((0..99|%{$s-split"(
|..{0,$_})"-ne''-replace(' '*!$_*($s[0]-ne32)+' +$'),"`t"-join''})-notmatch'(?m)^ |\t '|sort{$_|% Le*})[0]

Try it online!

Less golfed:

param($spacedString)

$tabed = 0..99|%{
    $spacedString `
        -split "(\n|..{0,$_})" -ne '' `
        -replace (' '*!$_*($spacedString[0]-ne32)+' +$'),"`t" `
        -join ''
}

$validated = $tabed -notmatch '(?m)^ |\t '

$sorted = $validated|sort{$_|% Length}    # sort by a Length property

@($sorted)[0]  # $shortestProgram is an element with minimal length

mazzy

Posted 2015-09-03T17:26:21.820

Reputation: 4 832