5

I am trying to strip odd characters from strings using PowerShell. I used the following output to attempt to learn on my own:

get-help about_regular_expressions

I am trying to take a string that is mostly ASCII, but that has one anomalous character that needs to be removed. (The registered trademark symbol; the R with a circle around it.) I'd like to strip any occurrence of that character out of a string, leaving everything else intact. What is the cleanest expression to accomplish this using PowerShell 2.0?

[EDIT]

I have done a little further digging, and I believe the problem is stemming from the Import-CSV call I'm using.

When I cut-and-paste this symbol from within notepad into the PS prompt, and assign it to a string, I match just fine:

# This code yields 'True'
$string -match "\u00ae"

However, when I use Import-CSV on a CSV file where one of the fields contains the special symbol, I believe somehow the raw bytes are getting converted, because doing something like this doesn't work:

# This code yields 'False'
$source = Import-CSV -path testing.csv
# The following extracts the entry / line containing the special symbol that was
# copy-and-pasted above
$culprit = $source[5].COMMITTEE_NAME
$culprit -match "\u00ae"

However, the following DOES work:

# This yields True
$filedata = get-content testing.csv
$filedata[6] -match "\u00ae"

So I think my followup question to all of this is:

How can I keep the strings intact through the import-csv call so that calls to -match for the individual fields will still work?

Larold
  • 802
  • 4
  • 13
  • 21

1 Answers1

1

It's important to note that the console PS doesn't display Unicode well. You'll have to use the ISE to "see" what's happening. Have a look at this related SO question for some additional reading. You can use the ® character in PS, regardless, if you don't need to watch the script in-action.

In the ISE:

PS C:\Users\jscott> $string = "This string contains the ® character"
PS C:\Users\jscott> $string
This string contains the ® character

PS C:\Users\jscott> $string.Replace("®","")
This string contains the  character

PS C:\Users\jscott> $string ="This ® string ® contains ® many ® characters ®®®®"
PS C:\Users\jscott> $string
This ® string ® contains ® many ® characters ®®®®

PS C:\Users\jscott> $string.Replace("®","")
This  string  contains  many  characters 

To use character code instead of the literal:

PS C:\Users\jscott> $string.Replace("$([char]0x00AE)","")

Per your question update:

You need to convert the ASCII file to Unicode/UTF8 before running it through Import-Csv -- I didn't realize you were using this. Have all look at this and this for other examples.

You may just want to pipe the initial CSV file thought Get-Content or Export-Csv -Encoding Unicode to pre-process the file and make life easier.

jscott
  • 24,204
  • 8
  • 77
  • 99
  • Thanks. I knew about the Replace() method, but I only know how to specify the trademark symbol as U+00AE. I'd like to know how to speficy 'U+00AE' as the character to be replaced. I tried looking at http://msdn.microsoft.com/en-us/library/20bw873z.aspx but I didn't see how to specify an individual unicode character in that spec. – Larold Sep 21 '11 at 01:34
  • @Larold Updated my answer. If that's not what you're asking, please let me know. – jscott Sep 21 '11 at 02:00
  • Thanks - I'll give it a shot. I think the problem may be that the raw bits aren't exactly matching what wikipedia is telling me the unicode value is for the symbol. I'm using Unix's od to view the raw character in several different formats to determine what I'm looking at. The octal representation of this character appears to be: 303 275 303 277 , or in hex 0xC3 0xBD 0xC3 0xBF. I'm going to see if I can match a regexp by hex... – Larold Sep 21 '11 at 02:09
  • Ok - I've looked at the raw bits with a Unix program called 'od'. The symbol I need to match on is apparently 4 bytes, so perhaps this is a two-character sequence. The raw octal representation of this character appears to be: 303 275 303 277 , or in hex 0xC3 0xBD 0xC3 0xBF. What's the proper way to match exactly that 4-byte sequence, specified in hex? Thanks! – Larold Sep 21 '11 at 02:17
  • I copied the csv file I'm taking data from, then deleted every character except the trademark symbol. Confirmed it's 0x00ae getting saved. However, your Replace() call above unfortunately does not seem to work. Any suggestions? – Larold Sep 21 '11 at 02:43
  • @Larold Perhaps you can update the question to include a snip of the CSV as well as the code you're processing it with? – jscott Sep 21 '11 at 10:31