1
1
I'm trying to extract tables out of a pdf using tabula and powershell. When I enter the command directly in the powershell console, I get the expected result displayed (in utf8 with umlaut-symbols)
java -jar "./tabula-java/$tabulaVersion" --spreadsheet -a 114,53,180,556 "./table.pdf"
But when I put it in a string variable and then write to a file the umlaut symbols become gibberish
$text = java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet -a 114,53,180,556 "./table.pdf"
Set-Content -Path "./file.txt" -Value $text
Even if I print the variable in the console, the umlaut-symbols are not properly displayed
$text = java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet -a 114,53,180,556 "./table.pdf"
Write-Output $text
Is there a way to store it in a string variable (and therefore being able to manipulate the content) and write it to a file with keeping the utf8 (without BOM) encoding?
Using the approach from https://stackoverflow.com/a/5596984/1786528 does not work for me either
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($filepath, $text, $Utf8NoBomEncoding)
I don't get an error, but also no file is created or line added.
Update:
[System.IO.File]::WriteAllLines
creates a file (in UTF without BOM), I just used a relative path and did not set [System.Environment]::CurrentDirectory = (Get-Location).Path
. But nonetheless are the umlaut-symbols not correct.
Additional details
case 1: output directly in console, e.g.
java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf"
case 2: output stored in variable, then printed in console, e.g.
$text = java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf"
Write-Output $text
case 3:
output stored in variable but with -D"file.encoding=UTF-8"
, then printed in console, e.g.
$text = java -D"file.encoding=UTF-8" -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf"
Write-Output $text
Update:
$OutputEncoding
= US-ASCII and
[System.Console]::OutputEncoding
= OEM United States (IBM437)
case 4:
output directly in console (with changing [System.Console]::OutputEncoding
beforehand), e.g.
[System.Console]::OutputEncoding = System.Text.Encoding]::GetEncoding(1252)
java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf"
case 5:
output stored in variable, then printed in console (with changing [System.Console]::OutputEncoding
beforehand), e.g.
[System.Console]::OutputEncoding = System.Text.Encoding]::GetEncoding(1252)
$text = java -jar "./tabula-1.0.1-jar-with-dependencies.jar" --spreadsheet "./table.pdf"
Write-Output $text
This will result in this for umlaut symbols
pdf case 1 case 2 case 3 case 4 case 5
ä ä Σ ├ñ „ ä
ö ö ÷ ├╢ ” ö
ü ü ⁿ ├╝ ü
you could try
$text = java -jar "./tabula-java/$tabulaVersion" --spreadsheet -a 114,53,180,556 $filepath | out-file -FilePath "./file.txt" -Encoding UTF8
if it does not work, the output ist most likely already wrong – SimonS – 2018-04-13T06:38:51.030Unfortunately, it doesn't. I also think that the culprit is at
$text = java ...
(see my update). But I do not understand why the response is displayed correctly when I just usejava ...
and as soon as I store it in a variable not anymore. It also creates a working file if I write the output directly with the tabula command (using-o "./file.txt"
) – jost21 – 2018-04-13T06:53:45.233look at this answer on stack overflow - looks like you can specify UTF8 encoding already in
– SimonS – 2018-04-13T06:57:48.987java ...
https://stackoverflow.com/questions/6733029/output-as-utf-8-encoding-in-java adding-Dfile.encoding=UTF-8
should do the trickThis sounded promising, but I get this error message
Error: Could not find or load main class .encoding=UTF-8
. I tried to find more about-Dfile
, but not very successfully – jost21 – 2018-04-13T07:31:40.007hmm i guess you have to wait for a java commandline pro. you should search for
java commandline output UTF8
or probably evenjava tabula UTF8 encoding
. Can't help any further since I don't know too much about java (maybe readme.md in github helps) – SimonS – 2018-04-13T07:50:52.187Maybe this isn't related, but maybe it is: When working on a PS-script that features exiftool (which supports UTF), I had the same problem as you had (same thing with working console, same thing with non-working UTF-encoding). Sadly, the only (non-)solution I could come up with was to add the BOM. So what I want to say is that I wouldn't think that Java is the culprit here. – flolilo – 2018-04-13T09:27:32.070
Please show some example (with more umlauted as well as common letters) of expected result and corresponding gibberish (for every given case). – JosefZ – 2018-04-15T19:47:20.663
Try
java -D"file.encoding=UTF-8" ...
. You can also try setting the environment variableJAVA_TOOL_OPTIONS
to-Dfile.encoding=UTF-8
. – Bacon Bits – 2018-04-16T12:41:10.743With
java -D"file.encoding=UTF-8"
, there is no error anymore and I get a different output, but it's still not correct (file is UTF-16 LE).ä
becomesä
, withoutjava -D"file.encoding=UTF-8"
it wasΣ
– jost21 – 2018-04-17T13:30:47.360There is a weird and perplexing difference between case 1 and case 2. What is your
$OutputEncoding
and[System.Console]::OutputEncoding
? What happens with[System.Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(1252)
just before callingjava …
/$text = java …
? – JosefZ – 2018-04-23T07:59:35.193$OutputEncoding
= US-ASCII and[System.Console]::OutputEncoding
= OEM United States (IBM437). I added the new cases to my post – jost21 – 2018-04-24T09:42:53.510Flagrant mojibake occurrence: for instance, try
– JosefZ – 2018-04-24T17:01:43.247[System.IO.File]::WriteAllLines( $MyPath, 'äöü', [System.Text.UTF8Encoding]($False))
. Then,[System.IO.File]::ReadAllLines($MyPath, [System.Text.Encoding]::GetEncoding(437))
covers your case 3 (and case 5 changing437
to1252
) etc. Unfortunately, I don't know how to configurejava
output encoding…Thank you for the explanation. So the culprit is java, not Powershell? – jost21 – 2018-04-25T07:42:07.377