Return value in powershell script stored in string not utf8

1

1

I'm trying to extract tables out of a pdf using tabula and powershell. When I enter the command directly in the powershell console, I get the expected result displayed (in utf8 with umlaut-symbols)

java -jar "./tabula-java/$tabulaVersion" --spreadsheet -a 114,53,180,556 "./table.pdf"

But when I put it in a string variable and then write to a file the umlaut symbols become gibberish

$text = java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet -a 114,53,180,556 "./table.pdf"   
Set-Content -Path "./file.txt" -Value $text

Even if I print the variable in the console, the umlaut-symbols are not properly displayed

$text = java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet -a 114,53,180,556 "./table.pdf"   
Write-Output $text  

Is there a way to store it in a string variable (and therefore being able to manipulate the content) and write it to a file with keeping the utf8 (without BOM) encoding?

Using the approach from https://stackoverflow.com/a/5596984/1786528 does not work for me either

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($filepath, $text, $Utf8NoBomEncoding)

I don't get an error, but also no file is created or line added.

Update:

[System.IO.File]::WriteAllLines creates a file (in UTF without BOM), I just used a relative path and did not set [System.Environment]::CurrentDirectory = (Get-Location).Path. But nonetheless are the umlaut-symbols not correct.

Additional details

case 1: output directly in console, e.g.

java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf" 

case 2: output stored in variable, then printed in console, e.g.

$text = java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf"   
Write-Output $text 

case 3: output stored in variable but with -D"file.encoding=UTF-8", then printed in console, e.g.

$text = java -D"file.encoding=UTF-8" -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf"   
Write-Output $text 

Update:

$OutputEncoding = US-ASCII and [System.Console]::OutputEncoding = OEM United States (IBM437)

case 4: output directly in console (with changing [System.Console]::OutputEncoding beforehand), e.g.

[System.Console]::OutputEncoding = System.Text.Encoding]::GetEncoding(1252)
java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf" 

case 5: output stored in variable, then printed in console (with changing [System.Console]::OutputEncoding beforehand), e.g.

[System.Console]::OutputEncoding = System.Text.Encoding]::GetEncoding(1252)
$text = java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf"   
Write-Output $text 

This will result in this for umlaut symbols

pdf    case 1    case 2     case 3    case 4     case 5
 ä      ä         Σ          ├ñ        „          ä
 ö      ö         ÷          ├╢        ”          ö
 ü      ü         ⁿ          ├╝                  ü

jost21

Posted 2018-04-12T23:08:55.973

Reputation: 195

you could try $text = java -jar "./tabula-java/$tabulaVersion" --spreadsheet -a 114,53,180,556 $filepath | out-file -FilePath "./file.txt" -Encoding UTF8 if it does not work, the output ist most likely already wrong – SimonS – 2018-04-13T06:38:51.030

Unfortunately, it doesn't. I also think that the culprit is at $text = java ... (see my update). But I do not understand why the response is displayed correctly when I just use java ... and as soon as I store it in a variable not anymore. It also creates a working file if I write the output directly with the tabula command (using -o "./file.txt") – jost21 – 2018-04-13T06:53:45.233

look at this answer on stack overflow - looks like you can specify UTF8 encoding already in java ... https://stackoverflow.com/questions/6733029/output-as-utf-8-encoding-in-java adding -Dfile.encoding=UTF-8should do the trick

– SimonS – 2018-04-13T06:57:48.987

This sounded promising, but I get this error message Error: Could not find or load main class .encoding=UTF-8. I tried to find more about -Dfile, but not very successfully – jost21 – 2018-04-13T07:31:40.007

hmm i guess you have to wait for a java commandline pro. you should search for java commandline output UTF8 or probably even java tabula UTF8 encoding. Can't help any further since I don't know too much about java (maybe readme.md in github helps) – SimonS – 2018-04-13T07:50:52.187

Maybe this isn't related, but maybe it is: When working on a PS-script that features exiftool (which supports UTF), I had the same problem as you had (same thing with working console, same thing with non-working UTF-encoding). Sadly, the only (non-)solution I could come up with was to add the BOM. So what I want to say is that I wouldn't think that Java is the culprit here. – flolilo – 2018-04-13T09:27:32.070

Please show some example (with more umlauted as well as common letters) of expected result and corresponding gibberish (for every given case). – JosefZ – 2018-04-15T19:47:20.663

Try java -D"file.encoding=UTF-8" .... You can also try setting the environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF-8. – Bacon Bits – 2018-04-16T12:41:10.743

With java -D"file.encoding=UTF-8", there is no error anymore and I get a different output, but it's still not correct (file is UTF-16 LE). ä becomes ├ñ, without java -D"file.encoding=UTF-8" it was Σ – jost21 – 2018-04-17T13:30:47.360

There is a weird and perplexing difference between case 1 and case 2. What is your $OutputEncoding and [System.Console]::OutputEncoding? What happens with [System.Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(1252) just before calling java …/$text = java …? – JosefZ – 2018-04-23T07:59:35.193

$OutputEncoding = US-ASCII and [System.Console]::OutputEncoding = OEM United States (IBM437). I added the new cases to my post – jost21 – 2018-04-24T09:42:53.510

Flagrant mojibake occurrence: for instance, try [System.IO.File]::WriteAllLines( $MyPath, 'äöü', [System.Text.UTF8Encoding]($False)). Then, [System.IO.File]::ReadAllLines($MyPath, [System.Text.Encoding]::GetEncoding(437)) covers your case 3 (and case 5 changing 437 to 1252) etc. Unfortunately, I don't know how to configure java output encoding…

– JosefZ – 2018-04-24T17:01:43.247

Thank you for the explanation. So the culprit is java, not Powershell? – jost21 – 2018-04-25T07:42:07.377

No answers