1 |
.PU |
2 |
.TH bzip2 1 |
3 |
.SH NAME |
4 |
bzip2, bunzip2 \- a block-sorting file compressor, v1.0.7 |
5 |
.br |
6 |
bzcat \- decompresses files to stdout |
7 |
.br |
8 |
bzip2recover \- recovers data from damaged bzip2 files |
9 |
|
10 |
.SH SYNOPSIS |
11 |
.ll +8 |
12 |
.B bzip2 |
13 |
.RB [ " \-cdfkqstvzVL123456789 " ] |
14 |
[ |
15 |
.I "filenames \&..." |
16 |
] |
17 |
.ll -8 |
18 |
.br |
19 |
.B bunzip2 |
20 |
.RB [ " \-fkvsVL " ] |
21 |
[ |
22 |
.I "filenames \&..." |
23 |
] |
24 |
.br |
25 |
.B bzcat |
26 |
.RB [ " \-s " ] |
27 |
[ |
28 |
.I "filenames \&..." |
29 |
] |
30 |
.br |
31 |
.B bzip2recover |
32 |
.I "filename" |
33 |
|
34 |
.SH DESCRIPTION |
35 |
.I bzip2 |
36 |
compresses files using the Burrows-Wheeler block sorting |
37 |
text compression algorithm, and Huffman coding. Compression is |
38 |
generally considerably better than that achieved by more conventional |
39 |
LZ77/LZ78-based compressors, and approaches the performance of the PPM |
40 |
family of statistical compressors. |
41 |
|
42 |
The command-line options are deliberately very similar to |
43 |
those of |
44 |
.I GNU gzip, |
45 |
but they are not identical. |
46 |
|
47 |
.I bzip2 |
48 |
expects a list of file names to accompany the |
49 |
command-line flags. Each file is replaced by a compressed version of |
50 |
itself, with the name "original_name.bz2". |
51 |
Each compressed file |
52 |
has the same modification date, permissions, and, when possible, |
53 |
ownership as the corresponding original, so that these properties can |
54 |
be correctly restored at decompression time. File name handling is |
55 |
naive in the sense that there is no mechanism for preserving original |
56 |
file names, permissions, ownerships or dates in filesystems which lack |
57 |
these concepts, or have serious file name length restrictions, such as |
58 |
MS-DOS. |
59 |
|
60 |
.I bzip2 |
61 |
and |
62 |
.I bunzip2 |
63 |
will by default not overwrite existing |
64 |
files. If you want this to happen, specify the \-f flag. |
65 |
|
66 |
If no file names are specified, |
67 |
.I bzip2 |
68 |
compresses from standard |
69 |
input to standard output. In this case, |
70 |
.I bzip2 |
71 |
will decline to |
72 |
write compressed output to a terminal, as this would be entirely |
73 |
incomprehensible and therefore pointless. |
74 |
|
75 |
.I bunzip2 |
76 |
(or |
77 |
.I bzip2 \-d) |
78 |
decompresses all |
79 |
specified files. Files which were not created by |
80 |
.I bzip2 |
81 |
will be detected and ignored, and a warning issued. |
82 |
.I bzip2 |
83 |
attempts to guess the filename for the decompressed file |
84 |
from that of the compressed file as follows: |
85 |
|
86 |
filename.bz2 becomes filename |
87 |
filename.bz becomes filename |
88 |
filename.tbz2 becomes filename.tar |
89 |
filename.tbz becomes filename.tar |
90 |
anyothername becomes anyothername.out |
91 |
|
92 |
If the file does not end in one of the recognised endings, |
93 |
.I .bz2, |
94 |
.I .bz, |
95 |
.I .tbz2 |
96 |
or |
97 |
.I .tbz, |
98 |
.I bzip2 |
99 |
complains that it cannot |
100 |
guess the name of the original file, and uses the original name |
101 |
with |
102 |
.I .out |
103 |
appended. |
104 |
|
105 |
As with compression, supplying no |
106 |
filenames causes decompression from |
107 |
standard input to standard output. |
108 |
|
109 |
.I bunzip2 |
110 |
will correctly decompress a file which is the |
111 |
concatenation of two or more compressed files. The result is the |
112 |
concatenation of the corresponding uncompressed files. Integrity |
113 |
testing (\-t) |
114 |
of concatenated |
115 |
compressed files is also supported. |
116 |
|
117 |
You can also compress or decompress files to the standard output by |
118 |
giving the \-c flag. Multiple files may be compressed and |
119 |
decompressed like this. The resulting outputs are fed sequentially to |
120 |
stdout. Compression of multiple files |
121 |
in this manner generates a stream |
122 |
containing multiple compressed file representations. Such a stream |
123 |
can be decompressed correctly only by |
124 |
.I bzip2 |
125 |
version 0.9.0 or |
126 |
later. Earlier versions of |
127 |
.I bzip2 |
128 |
will stop after decompressing |
129 |
the first file in the stream. |
130 |
|
131 |
.I bzcat |
132 |
(or |
133 |
.I bzip2 -dc) |
134 |
decompresses all specified files to |
135 |
the standard output. |
136 |
|
137 |
.I bzip2 |
138 |
will read arguments from the environment variables |
139 |
.I BZIP2 |
140 |
and |
141 |
.I BZIP, |
142 |
in that order, and will process them |
143 |
before any arguments read from the command line. This gives a |
144 |
convenient way to supply default arguments. |
145 |
|
146 |
Compression is always performed, even if the compressed |
147 |
file is slightly |
148 |
larger than the original. Files of less than about one hundred bytes |
149 |
tend to get larger, since the compression mechanism has a constant |
150 |
overhead in the region of 50 bytes. Random data (including the output |
151 |
of most file compressors) is coded at about 8.05 bits per byte, giving |
152 |
an expansion of around 0.5%. |
153 |
|
154 |
As a self-check for your protection, |
155 |
.I |
156 |
bzip2 |
157 |
uses 32-bit CRCs to |
158 |
make sure that the decompressed version of a file is identical to the |
159 |
original. This guards against corruption of the compressed data, and |
160 |
against undetected bugs in |
161 |
.I bzip2 |
162 |
(hopefully very unlikely). The |
163 |
chances of data corruption going undetected is microscopic, about one |
164 |
chance in four billion for each file processed. Be aware, though, that |
165 |
the check occurs upon decompression, so it can only tell you that |
166 |
something is wrong. It can't help you |
167 |
recover the original uncompressed |
168 |
data. You can use |
169 |
.I bzip2recover |
170 |
to try to recover data from |
171 |
damaged files. |
172 |
|
173 |
Return values: 0 for a normal exit, 1 for environmental problems (file |
174 |
not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt |
175 |
compressed file, 3 for an internal consistency error (eg, bug) which |
176 |
caused |
177 |
.I bzip2 |
178 |
to panic. |
179 |
|
180 |
.SH OPTIONS |
181 |
.TP |
182 |
.B \-c --stdout |
183 |
Compress or decompress to standard output. |
184 |
.TP |
185 |
.B \-d --decompress |
186 |
Force decompression. |
187 |
.I bzip2, |
188 |
.I bunzip2 |
189 |
and |
190 |
.I bzcat |
191 |
are |
192 |
really the same program, and the decision about what actions to take is |
193 |
done on the basis of which name is used. This flag overrides that |
194 |
mechanism, and forces |
195 |
.I bzip2 |
196 |
to decompress. |
197 |
.TP |
198 |
.B \-z --compress |
199 |
The complement to \-d: forces compression, regardless of the |
200 |
invocation name. |
201 |
.TP |
202 |
.B \-t --test |
203 |
Check integrity of the specified file(s), but don't decompress them. |
204 |
This really performs a trial decompression and throws away the result. |
205 |
.TP |
206 |
.B \-f --force |
207 |
Force overwrite of output files. Normally, |
208 |
.I bzip2 |
209 |
will not overwrite |
210 |
existing output files. Also forces |
211 |
.I bzip2 |
212 |
to break hard links |
213 |
to files, which it otherwise wouldn't do. |
214 |
|
215 |
bzip2 normally declines to decompress files which don't have the |
216 |
correct magic header bytes. If forced (-f), however, it will pass |
217 |
such files through unmodified. This is how GNU gzip behaves. |
218 |
.TP |
219 |
.B \-k --keep |
220 |
Keep (don't delete) input files during compression |
221 |
or decompression. |
222 |
.TP |
223 |
.B \-s --small |
224 |
Reduce memory usage, for compression, decompression and testing. Files |
225 |
are decompressed and tested using a modified algorithm which only |
226 |
requires 2.5 bytes per block byte. This means any file can be |
227 |
decompressed in 2300k of memory, albeit at about half the normal speed. |
228 |
|
229 |
During compression, \-s selects a block size of 200k, which limits |
230 |
memory use to around the same figure, at the expense of your compression |
231 |
ratio. In short, if your machine is low on memory (8 megabytes or |
232 |
less), use \-s for everything. See MEMORY MANAGEMENT below. |
233 |
.TP |
234 |
.B \-q --quiet |
235 |
Suppress non-essential warning messages. Messages pertaining to |
236 |
I/O errors and other critical events will not be suppressed. |
237 |
.TP |
238 |
.B \-v --verbose |
239 |
Verbose mode -- show the compression ratio for each file processed. |
240 |
Further \-v's increase the verbosity level, spewing out lots of |
241 |
information which is primarily of interest for diagnostic purposes. |
242 |
.TP |
243 |
.B \-L --license -V --version |
244 |
Display the software version, license terms and conditions. |
245 |
.TP |
246 |
.B \-1 (or \-\-fast) to \-9 (or \-\-best) |
247 |
Set the block size to 100 k, 200 k .. 900 k when compressing. Has no |
248 |
effect when decompressing. See MEMORY MANAGEMENT below. |
249 |
The \-\-fast and \-\-best aliases are primarily for GNU gzip |
250 |
compatibility. In particular, \-\-fast doesn't make things |
251 |
significantly faster. |
252 |
And \-\-best merely selects the default behaviour. |
253 |
.TP |
254 |
.B \-- |
255 |
Treats all subsequent arguments as file names, even if they start |
256 |
with a dash. This is so you can handle files with names beginning |
257 |
with a dash, for example: bzip2 \-- \-myfilename. |
258 |
.TP |
259 |
.B \--repetitive-fast --repetitive-best |
260 |
These flags are redundant in versions 0.9.5 and above. They provided |
261 |
some coarse control over the behaviour of the sorting algorithm in |
262 |
earlier versions, which was sometimes useful. 0.9.5 and above have an |
263 |
improved algorithm which renders these flags irrelevant. |
264 |
|
265 |
.SH MEMORY MANAGEMENT |
266 |
.I bzip2 |
267 |
compresses large files in blocks. The block size affects |
268 |
both the compression ratio achieved, and the amount of memory needed for |
269 |
compression and decompression. The flags \-1 through \-9 |
270 |
specify the block size to be 100,000 bytes through 900,000 bytes (the |
271 |
default) respectively. At decompression time, the block size used for |
272 |
compression is read from the header of the compressed file, and |
273 |
.I bunzip2 |
274 |
then allocates itself just enough memory to decompress |
275 |
the file. Since block sizes are stored in compressed files, it follows |
276 |
that the flags \-1 to \-9 are irrelevant to and so ignored |
277 |
during decompression. |
278 |
|
279 |
Compression and decompression requirements, |
280 |
in bytes, can be estimated as: |
281 |
|
282 |
Compression: 400k + ( 8 x block size ) |
283 |
|
284 |
Decompression: 100k + ( 4 x block size ), or |
285 |
100k + ( 2.5 x block size ) |
286 |
|
287 |
Larger block sizes give rapidly diminishing marginal returns. Most of |
288 |
the compression comes from the first two or three hundred k of block |
289 |
size, a fact worth bearing in mind when using |
290 |
.I bzip2 |
291 |
on small machines. |
292 |
It is also important to appreciate that the decompression memory |
293 |
requirement is set at compression time by the choice of block size. |
294 |
|
295 |
For files compressed with the default 900k block size, |
296 |
.I bunzip2 |
297 |
will require about 3700 kbytes to decompress. To support decompression |
298 |
of any file on a 4 megabyte machine, |
299 |
.I bunzip2 |
300 |
has an option to |
301 |
decompress using approximately half this amount of memory, about 2300 |
302 |
kbytes. Decompression speed is also halved, so you should use this |
303 |
option only where necessary. The relevant flag is -s. |
304 |
|
305 |
In general, try and use the largest block size memory constraints allow, |
306 |
since that maximises the compression achieved. Compression and |
307 |
decompression speed are virtually unaffected by block size. |
308 |
|
309 |
Another significant point applies to files which fit in a single block |
310 |
-- that means most files you'd encounter using a large block size. The |
311 |
amount of real memory touched is proportional to the size of the file, |
312 |
since the file is smaller than a block. For example, compressing a file |
313 |
20,000 bytes long with the flag -9 will cause the compressor to |
314 |
allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 |
315 |
kbytes of it. Similarly, the decompressor will allocate 3700k but only |
316 |
touch 100k + 20000 * 4 = 180 kbytes. |
317 |
|
318 |
Here is a table which summarises the maximum memory usage for different |
319 |
block sizes. Also recorded is the total compressed size for 14 files of |
320 |
the Calgary Text Compression Corpus totalling 3,141,622 bytes. This |
321 |
column gives some feel for how compression varies with block size. |
322 |
These figures tend to understate the advantage of larger block sizes for |
323 |
larger files, since the Corpus is dominated by smaller files. |
324 |
|
325 |
Compress Decompress Decompress Corpus |
326 |
Flag usage usage -s usage Size |
327 |
|
328 |
-1 1200k 500k 350k 914704 |
329 |
-2 2000k 900k 600k 877703 |
330 |
-3 2800k 1300k 850k 860338 |
331 |
-4 3600k 1700k 1100k 846899 |
332 |
-5 4400k 2100k 1350k 845160 |
333 |
-6 5200k 2500k 1600k 838626 |
334 |
-7 6100k 2900k 1850k 834096 |
335 |
-8 6800k 3300k 2100k 828642 |
336 |
-9 7600k 3700k 2350k 828642 |
337 |
|
338 |
.SH RECOVERING DATA FROM DAMAGED FILES |
339 |
.I bzip2 |
340 |
compresses files in blocks, usually 900kbytes long. Each |
341 |
block is handled independently. If a media or transmission error causes |
342 |
a multi-block .bz2 |
343 |
file to become damaged, it may be possible to |
344 |
recover data from the undamaged blocks in the file. |
345 |
|
346 |
The compressed representation of each block is delimited by a 48-bit |
347 |
pattern, which makes it possible to find the block boundaries with |
348 |
reasonable certainty. Each block also carries its own 32-bit CRC, so |
349 |
damaged blocks can be distinguished from undamaged ones. |
350 |
|
351 |
.I bzip2recover |
352 |
is a simple program whose purpose is to search for |
353 |
blocks in .bz2 files, and write each block out into its own .bz2 |
354 |
file. You can then use |
355 |
.I bzip2 |
356 |
\-t |
357 |
to test the |
358 |
integrity of the resulting files, and decompress those which are |
359 |
undamaged. |
360 |
|
361 |
.I bzip2recover |
362 |
takes a single argument, the name of the damaged file, |
363 |
and writes a number of files "rec00001file.bz2", |
364 |
"rec00002file.bz2", etc, containing the extracted blocks. |
365 |
The output filenames are designed so that the use of |
366 |
wildcards in subsequent processing -- for example, |
367 |
"bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in |
368 |
the correct order. |
369 |
|
370 |
.I bzip2recover |
371 |
should be of most use dealing with large .bz2 |
372 |
files, as these will contain many blocks. It is clearly |
373 |
futile to use it on damaged single-block files, since a |
374 |
damaged block cannot be recovered. If you wish to minimise |
375 |
any potential data loss through media or transmission errors, |
376 |
you might consider compressing with a smaller |
377 |
block size. |
378 |
|
379 |
.SH PERFORMANCE NOTES |
380 |
The sorting phase of compression gathers together similar strings in the |
381 |
file. Because of this, files containing very long runs of repeated |
382 |
symbols, like "aabaabaabaab ..." (repeated several hundred times) may |
383 |
compress more slowly than normal. Versions 0.9.5 and above fare much |
384 |
better than previous versions in this respect. The ratio between |
385 |
worst-case and average-case compression time is in the region of 10:1. |
386 |
For previous versions, this figure was more like 100:1. You can use the |
387 |
\-vvvv option to monitor progress in great detail, if you want. |
388 |
|
389 |
Decompression speed is unaffected by these phenomena. |
390 |
|
391 |
.I bzip2 |
392 |
usually allocates several megabytes of memory to operate |
393 |
in, and then charges all over it in a fairly random fashion. This means |
394 |
that performance, both for compressing and decompressing, is largely |
395 |
determined by the speed at which your machine can service cache misses. |
396 |
Because of this, small changes to the code to reduce the miss rate have |
397 |
been observed to give disproportionately large performance improvements. |
398 |
I imagine |
399 |
.I bzip2 |
400 |
will perform best on machines with very large caches. |
401 |
|
402 |
.SH CAVEATS |
403 |
I/O error messages are not as helpful as they could be. |
404 |
.I bzip2 |
405 |
tries hard to detect I/O errors and exit cleanly, but the details of |
406 |
what the problem is sometimes seem rather misleading. |
407 |
|
408 |
This manual page pertains to version 1.0.7 of |
409 |
.I bzip2. |
410 |
Compressed data created by this version is entirely forwards and |
411 |
backwards compatible with the previous public releases, versions |
412 |
0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the following |
413 |
exception: 0.9.0 and above can correctly decompress multiple |
414 |
concatenated compressed files. 0.1pl2 cannot do this; it will stop |
415 |
after decompressing just the first file in the stream. |
416 |
|
417 |
.I bzip2recover |
418 |
versions prior to 1.0.2 used 32-bit integers to represent |
419 |
bit positions in compressed files, so they could not handle compressed |
420 |
files more than 512 megabytes long. Versions 1.0.2 and above use |
421 |
64-bit ints on some platforms which support them (GNU supported |
422 |
targets, and Windows). To establish whether or not bzip2recover was |
423 |
built with such a limitation, run it without arguments. In any event |
424 |
you can build yourself an unlimited version if you can recompile it |
425 |
with MaybeUInt64 set to be an unsigned 64-bit integer. |
426 |
|
427 |
|
428 |
|
429 |
.SH AUTHOR |
430 |
Julian Seward, jseward@acm.org. |
431 |
|
432 |
https://sourceware.org/bzip2/ |
433 |
|
434 |
The ideas embodied in |
435 |
.I bzip2 |
436 |
are due to (at least) the following |
437 |
people: Michael Burrows and David Wheeler (for the block sorting |
438 |
transformation), David Wheeler (again, for the Huffman coder), Peter |
439 |
Fenwick (for the structured coding model in the original |
440 |
.I bzip, |
441 |
and many refinements), and Alistair Moffat, Radford Neal and Ian Witten |
442 |
(for the arithmetic coder in the original |
443 |
.I bzip). |
444 |
I am much |
445 |
indebted for their help, support and advice. See the manual in the |
446 |
source distribution for pointers to sources of documentation. Christian |
447 |
von Roques encouraged me to look for faster sorting algorithms, so as to |
448 |
speed up compression. Bela Lubkin encouraged me to improve the |
449 |
worst-case compression performance. |
450 |
Donna Robinson XMLised the documentation. |
451 |
The bz* scripts are derived from those of GNU gzip. |
452 |
Many people sent patches, helped |
453 |
with portability problems, lent machines, gave advice and were generally |
454 |
helpful. |