LHA (file format)
LHA or LZH is a freeware compression utility and associated file format. It was created in 1988 by Haruyasu Yoshizaki (吉崎栄泰, Yoshizaki Haruyasu), and originally named LHarc. A complete rewrite of LHarc, tentatively named LHx, was eventually released as LH. It was then renamed to LHA to avoid conflicting with the then-new MS-DOS 5.0 LH ("load high") command. According to early documentation, LHA is pronounced like La.
LHarc running in MS-DOS | |
Filename extension |
.lzh, .lha |
---|---|
Internet media type |
application/x-lzh-compressed |
Type code | "LHA␣ " (L-H-A-SPACE) |
Uniform Type Identifier (UTI) | public.archive.lha |
Developed by | Haruyasu Yoshizaki (Yoshi) |
Type of format | Data compression |
Website | https://github.com/jca02266/lha |
Although no longer much used in the West, LHA remains popular in Japan. It was used by id Software to compress installation files for their earlier games, including Doom and Quake. LHA has been ported to many operating systems and is still the main archiving format used on the Amiga computer, although it competed with LZX in the mid 1990s. This was due to Aminet, the world's largest archive of Amiga-related software and files, standardising on Stefan Boberg's implementation of LHA for the Amiga. Microsoft has released a Windows XP add-on, Microsoft Compressed (LZH) Folder Add-on, designed for the Japanese version of the operating system.[1] The Japanese version of Windows 7 ships with the LZH folder add-on built-in.[2] Users of non-Japanese versions of Windows 7 Enterprise and Ultimate can also install the LZH folder add-on by installing the optional Japanese language pack from Windows Update.
Compression methods
In an LZH archive, the compression method is stored as a five-byte text string, e.g. -lz1-. These are the third through seventh bytes of the file.
Canonical LZH
LHarc compresses files using an algorithm from Yoshizaki's earlier LZHUF product, which was modified from LZARI developed by Haruhiko Okumura (奥村晴彦, Okumura Haruhiko), but uses Huffman coding instead of arithmetic coding. LZARI uses Lempel–Ziv–Storer–Szymanski with arithmetic coding.
- lh0
- No compression method is applied to the source data.
- lh1
- This method is introduced in LHarc version 1.
- It supports 4 KiB sliding window, with support of maximum 60 bytes of matching length. Dynamic Huffman encoding is used.
- lh2
- lh1 variant. This method supports 8 KiB sliding window, with support of maximum 256 bytes of matching length. Dynamic Huffman encoding is used.
- lh3
- lh2 variant with Static Huffman.
- lh4, lh5, lh6, lh7
- Methods 4, 5, 6, 7 support 4, 8, 32, 64 KiB sliding window respectively, with support of maximum 256 bytes of matching length. Static Huffman encoding is used. lh5 is first introduced in LHarc 2, followed by lh6 in LHA 2.66 (MSDOS), lh7 in LHA 2.67 beta (MSDOS). LHA itself never compresses into lh4.
- lhd
- Technically it is not a compression method, but it is used in .LZH archive to indicate that the compressed object is an empty directory.
Joe Jared extensions
Joe Jared extended LZSS to use larger dictionaries.
- lh8, lh9, lha, lhb, lhc, lhe
- Dictionary (sliding window) sizes are 64, 128, 256, 512, 1024, 2048 KiB respectively.
Jared ported LZH to Atari. The fact that lh8 is the same as lh7 was an oversight. Files using larger numbered methods may as well not exist, as Jared only considers them planned features.[3]
UNLHA32 extensions
UNLHA32.DLL uses its own method for testing purposes.
- lhx
- It uses 128–256 KiB dictionary.
PMarc extensions
These compression methods are created by PMarc, a CP/M archiver created by Miyo. The archive usually has a .PMA extension.
- pc1
- PopCom compressed executable archive. Details unknown.
- pm0
- No compression method is applied to the source data.
- pm1
- 8 KB sliding window, static huffman. Seldom generated, decompressor is reverse-engineered.[4]
- pm2
- lh5 variant, 4K sliding window.
- pms
- Used to indicate PMarc self-extracting archive. Should be skipped to reveal the real format.
LArc extensions
LArc uses the same file format as .LZH, but was written by Kazuhiko Miki, Haruhiko Okumura, Ken Masuyama, with extension name ".LZS".[5] The program seems to have come before LZH. It uses a binary search tree in the LZ matching.[6]
- lzs
- It supports 2 KiB sliding window, with support of maximum 17 bytes of matching length.
- lz2
- It is similar to lzs, except dictionary size and match length can be changed.
- lz3
- Unknown.
- lz4
- No compression method is applied to the source data.
- lz5
- It supports 4 KiB sliding window, with support of maximum 17 bytes of matching length.
- lz7
- lz8
- Unknown.
Common implementations appear to only support lzs, lz5, plus the storage-only lz4.
Issues
LHICE/ICE
There are copies of LHICE marked as version 1.14. According to Okumura, LHICE is not written by Yoshi.[7]
y2k11 bug
Because of a bug, DOS timestamps from Level 0 and 1 headers after the year 2011 will be set to 1980, meaning that some utilities need to be patched. This is caused by a bug that interprets the unsigned 8-bit year number bitfield as a 5-bit number. The maximum year should be 2107 instead.[8][9]
The newer Level 2 and 3 headers use a 32-bit Unix time instead. It suffers from the Year 2038 problem.[10]
Header size
According to Micco, the author of a popular LHA library UNLHA32.DLL, many LHA implementations do not check for the length of LHA file headers when reading the archive. Two problems could emerge from this scenario: a buffer-overrun may occur for naive implementations assuming a 4KB max size from the original specification; antivirus software may skip over files with such large headers and fail to scan for a virus. A similar problem exists with ARJ. Micco reported this problem to Japanese authorities, but they do not consider it a valid vulnerability.[11]
Micco went so far to conclude the development of UNLHA32 and advise people to give up on the format. Nevertheless, they came back in 2017 to fix a DLL hijacking issue.
See also
References
- "Archived copy". Archived from the original on 2007-08-19. Retrieved 2007-10-05.CS1 maint: archived copy as title (link)
- "Windows 7 で (LZH の圧縮に Microsoft) フォルダーのアドインをインストールできません". Support.microsoft.com. Retrieved 2016-07-17.
- Jared (1998). lzhformat.html
- "fragglet/lhasa". GitHub.
- "圧縮データの拡張子 ".LZS" | 圧縮・解凍ソフトのガイド". Lzh-zip.com. Retrieved 2016-07-17.
- "Data Compression Algorithms of LARC and LHarc". GameDev.net.
- "History of Data Compression in Japan". Oku.edu.mie-u.ac.jp. Retrieved 12 July 2016.
- "Aminet - util/arc/lha138pch.lha". Aminet.net. Retrieved 12 July 2016.
- "Aminet - util/arc/lha_68k.lha". Aminet.net. Retrieved 12 July 2016.
- Nifty's LHA Format Notes, Other data formats.
- "LZH書庫のヘッダー処理における脆弱性について(2010年版)". micco.mars.jp.
External links
- A history of data compression in Japan
- LHA Notes Document about LHA.
- jLHA – LHA library for Java
- LHA for Unix (in Japanese)
- Hacking Data Compression Lesson 11 LZH, LZARI, and LZB
- NSRL Magic File contains PMarc info
- Explzh Current Windows 7 archiver for LZH/LHA. (Besides LZH it supports RAR, Zip, 7Z, ACE, Tar, Cab & others)
- lhasa a cross-platform, open source LHA decompressor (+UNLHA32, PMArc, LArc extensions)
- lzh format document describing LZH header format.