Scrutiny all visible or invisible characters of a text file

1

Is there any software that can be used to scrutinize all visible or invisible characters in a text file (characters like BOM, direction mark, line feed ...)?

Showing the Unicode name of characters also is a useful feature.

I want use such app for analyzing text files before parsing parsing them with a programming language.

PHPst

Posted 2012-11-03T03:59:23.557

Reputation: 3 466

Question was closed 2015-09-14T00:42:04.787

Any good programming editor will have this function. Textpad has both invisible character display and hex display, for example. – Fiasco Labs – 2015-09-06T05:07:56.353

What programming language are you using? – simbabque – 2012-11-03T12:43:54.513

I use PHP, of course desired app will be used before programming. – PHPst – 2012-11-03T12:55:32.183

Answers

3

A good hex editor is probably your best bet. Try FrHed (http://frhed.sourceforge.net/en/) if you're on windows or bless (http://home.gna.org/bless/) on linux.

adamjreilly

Posted 2012-11-03T03:59:23.557

Reputation: 71

@FrHed is unable to open a file if its path or name if not in ASCII. – PHPst – 2015-09-06T05:02:38.033

You see, Hex editors are about binary files, my concern if about text files – PHPst – 2012-11-03T04:50:11.553

2Hex editors are about ALL files, including text files. Not that many hex editors have Unicode support, though. Any decent programming language will have libraries for handling Unicode charachter data. – kreemoweet – 2012-11-03T07:09:54.670

1

The BabelPad editor is great: when you place the cursor after a character, it shows you the Unicode number and the Unicode name. And it has a built-in Unicode information viewer, which shows many Unicode properties for characters. Unfortunately, it processes BOM instead of showing it, and it also interprets line break characters instead of showing them. There might be a way to change this; its documentation is... well, not the best part of it. But it will show invisible controls like LRM and can distinguish between a space and a no-break space etc.

Jukka K. Korpela

Posted 2012-11-03T03:59:23.557

Reputation: 4 475

@Thanks, BabelPad have a lot of unique features, but yet do not have a feature to make invisible characters visible. – PHPst – 2012-11-03T10:54:16.123

1

Maybe this is helpful, though the answer is more fitting to Stack Overflow. I built a small parser in Perl which does what you want. Shame there's no highlighting here.

#!/usr/bin/perl
use strict; use warnings;
use feature qw(say);
use Data::Dumper;
use Unicode::String;
use utf8;

my $line_no = 1;
# Read stuff from the __DATA__ section as if it were a file,
# one line at a time
while (my $line = <DATA>) {
  # Create a Unicode::String object
  my $us = Unicode::String->new($line);

  # Iterate over the length of the string
  for (my $i = 0; $i < $us->length; $i++) {
    # Get the next char
    my $char = $us->substr($i, 1);
    # Output a description, one line per character
    printf "Line %i, column %i, 0x%x '%s' (%s)\n",
      $line_no,         # line number
      $i,               # colum number
      $char->ord,       # the ordinal of the char, in hex
      $char->as_string, # the stringified char (as in the input)
      $char->name;      # the glyph's name
  }
  # increment line number
  $line_no++;
}

# Below is the DATA section, which can be used as a file handle
__DATA__
This is some very strange unicode stuff right here:
٩(-̮̮̃-̃)۶ ٩(●̮̮̃•̃)۶ ٩(͡๏̯͡๏)۶ ٩(-̮̮̃•̃).

Let's see what this does:

  • Read from a file handle (the DATA section can be used like that) line by line.
  • Create an object that represents a Unicode string from the line.
  • Iterate the chars in that string
  • Output name, number and stuff about each char

It's really very straightforward. Maybe you can adapt it to php, though I don't know if there's a handy library around for the names.

Hope it helps.


I lifted the smiley thingies here: Which Unicode characters do smilies like ٩(•̮̮̃•̃)۶ consist of?

simbabque

Posted 2012-11-03T03:59:23.557

Reputation: 471

1

UltraEdit is a multi-platform text editor with Unicode support and a Hex mode that will show you the hex codes for everything side-by-side with the characters they generate. It even has a Hex find/replace dialog (at least on the Mac version, which is what I'm using at the moment). It's a bit pricy, but it does a lot of other stuff as well.

adv12

Posted 2012-11-03T03:59:23.557

Reputation: 160

How do you go about showing hidden characters in Ultraedit? – newenglander – 2017-10-02T11:24:32.333

1

I'd recommend Notepad++. If you go under View->Show Symbol and select "Show All Symbols" it will show any invisible characters with it's name. For example, it will show newlines as LF, CRLF, or CR depending on the newline format you're using.

EnderShadow

Posted 2012-11-03T03:59:23.557

Reputation: 53