Exporting UTF-8 text from LibreOffice without byte order mark

4

In LibreOffice, if I save a document as file type "encoded text" and select "Unicode (UTF-8)" as the encoding, it always writes a byte order mark (BOM) at the start of the text. It does this even when exporting text that started out with no such mark (such as imported ISO-8859-8 text). Is there a way to suppress the generation of the BOM?

According to the Unicode docs: "Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning". This is exactly the problem I'm running into, as the text is going to be fed to a program that does not expect an initial BOM.

Ted Hopp

Posted 2011-10-12T16:37:16.800

Reputation: 160

Answers

1

There is pending 2018 patch attached to a relevant 2011 bug report

Martin van Zijl 2018-02-26 18:48:14 UTC

I created a patch for review. With this patch if you do:

1) File --> Save As...
2) Choose Type = "Text (Choose Encoding)"
3) Click "Use Text - ..."
4) In the final dialog will be a checkbox "Include byte-order-mark". If you un-check this, then the BOM will not be included in the output.

Video demo attached.

Review link: https://gerrit.libreoffice.org/#/c/50388/

RedGrittyBrick

Posted 2011-10-12T16:37:16.800

Reputation: 70 632

Good to know that this is finally being addressed. The BOM is useful for UTF-16 encoding, but has no value and should be removed for UTF-8 exports. I noticed this comment in the bug report thread: "I agree with this. But currently we have very few developers. This may take several years. Sorry for such situation." It's going on seven years, so whoever made that comment is apparently a good estimator. :) – Ted Hopp – 2018-03-09T20:49:50.107

-1

When saving the file with Save As, under All Formats select Text Encoded, then Save. When the Confirm File Format dialog comes up, select Use Text Encoded Format. The ASCII Filter Options dialog then comes up. Select Western Europe(ASCII/US) and click OK. If you then examine the resulting file with a hex editor such as Bless, you will see that the BOM is gone.

John F. Healy

Posted 2011-10-12T16:37:16.800

Reputation: 15

1And then none Western Europe characters as well as Western Europe characters above 127 are gone too – phuclv – 2014-12-18T14:44:29.940

Sorry, but this doesn't address the question at all. I need a way to export UTF-8 text from LibreOffice without a BOM. – Ted Hopp – 2015-07-13T21:05:20.483

@TedHopp well, what if you choose the option he says? Run the "file" command on the file, and see what it says. You could try editing the hex of the UTF-8 file and removing the BOM. Did you try any of that? – barlop – 2015-08-27T15:13:27.900

1@barlop - The option he suggests results in garbage in the file in place of non-ASCII characters (e.g., Cyrillic, Hebrew, Greek, etc.) I fail to see how the "file" command would be useful. Editing the hex and removing the BOM might work for a file here or there, but is totally unsuitable for production work. Plus, it doesn't address my question at all: is there a way to suppress the generation of the BOM from LibreOffice? – Ted Hopp – 2015-08-27T17:04:51.820

@TedHopp ah , I thought what he wrote might not work 'cos yeah his answer said ASCII And your question said UTF-8.. so his "answer" would mess up non ascii characters.. now you've confirmed that, i've given him a -1. On a related note, If it were the case that libre office didn't have a way to strip the BOM from a file then would a command line solution that could work on a bunch of files be ok? And what OS is this? Windows? Linux? – barlop – 2015-08-27T18:17:57.717

@barlop - It's Windows. A command-line solution would not work very well, but I could make do with it. What I've done instead is copy and paste text into a simple text editor that knows better how to write UTF-8 files. – Ted Hopp – 2015-08-27T18:21:11.237

@TedHopp Are you familiar with cygwin or gnuwin32? I did this in cygwin $ cat ass.txt | xxd -p | cut -b 7- | xxd -r -p >ass2.txt ass.txt has the BOM ass2 is the new file that does not. You could then make it a little script with parameters $1 or %1 depending on whether you do it in cmd or cygwin. – barlop – 2015-08-27T19:33:13.647

@barlop - Sure, I'm familiar with that. I already have ways of working around the broken export function; that's not the issue. Command line solutions don't work well when the text is exported to a network location. (In Windows, I'd have to mount the location to be able to switch to it in a command window. A total pain when the locations can vary from task to task.) – Ted Hopp – 2015-08-27T19:54:34.427

Let us continue this discussion in chat.

– barlop – 2015-08-27T20:19:50.163

4Well of course the BOM is gone, but the file is no longer encoded in UTF-8, either! – kreemoweet – 2014-03-10T16:25:52.727