The Linux Newbie Guide  ⇒    Fundamentals     Advanced     Supplement   Command Index   ENG⇒中
All rights reserved, please indicate the source when citing
 

 File Compression

1.0 Introduction to File Compression
1.1 Common Compressed File Formats
       gz Files
            gzip : Compress/Decompress .gz files
            gunzip : Decompress gz files
           zcat : Read gz Compressed files
       .bz2 Files
           bzip2 : Compress/Decompress .bz2 files
           bunzip2 : Decompress .bz2 files
           bzcat : Reads bz2 compressed files
           bzip2recover : Recovers data from damaged bz2 files
       .xz Files
           xz : Compress/Decompress .xz files
           unxz Decompress xz files
           xzcat: Read xz Compressed filess
       .Z Files
           compress Compress/Decompress .Z files
           uncompress Decompress .Z files
       .zip files
           zip : Compress files into zip format
           unzip : Decompress zip files
           zipinfo : Lists information about zip files
1.2 File Archiving
       .tar Files
           tarball A Compressed tar Files
           tar : Archives/Extracts files from a tar file
           tar bomb
ENG⇒中ENG⇒中
  1.0 Introduction to Compressed Files

The purpose of using compression files is to reduce the size of data. Even if users don"t directly interact with PCs, they often come across compressed files in various forms, such as MP3 music or photos taken with smartphones. However, formats like MP3 for music or JPG for photos are often "lossy compressions," which means they discard imperceptible information to sacrifice some quality and significantly reduce file size. Some audiophiles with "audiophile" claim to perceive the distortion caused by lossy compression in MP3s and prefer not to listen to digital music in that format. Similarly, many professional photographers prefer working with non-distorted raw files.

However, most data cannot afford any distortion. For example, if you deposit $1000 in a bank, you wouldn"t accept it if it became $900 due to compression. In such cases where no distortion is allowed, it is referred to as "lossless compression." Lossless compression aims to maintain 100% fidelity upon decompression, but its compression ratio is usually lower compared to lossy compression.

Now, why is data compressible? Let"s take an example: it is said that the longest place name in the United States is "Chargoggagoggmanchauggagoggchaubunagungamaugg," which consists of 45 letters. If you observe closely, you"ll notice that the letters "gg" and "ago" repeat in many places. To compress this place name, I can use "!" to represent "gg" and "@" to represent "ago," resulting in "Chargo!@!manchau!@!chaubunagungamau!" which is only 36 letters long. Users can invent other rules to store this place name using even fewer bytes.

Compression efficiency varies depending on the nature of the data. Some compression techniques are particularly effective for text files, while others achieve higher compression ratios for executable files. However, attempting to compress a file that has already been compressed, whether lossy or lossless, may actually increase its size. Additionally, the different compression software available is a result of using various compression algorithms.



^ back on top ^


1.1 Common Compressed File Formats

The only one who can fix the problem is the one who created the problem. Use the same software to compress and decompress. But how do I know which software was used to compress the file? Fortunately, it can be determined based on its file extension. In general, file extensions in UNIX/Linux are only for "reference." For example, a plain text file composed entirely of ASCII characters may not necessarily have the extension ".txt". To accurately determine the file type, the file command is usually used. However, file compression in UNIX/Linux heavily relies on file extensions.

Different compression commands usually have their own specific file extensions to indicate whether it is a compressed file and which compression format was used. Common file extensions for compressed files in UNIX/Linux include ".gz", ".bz2", ".zip", ".Z", ".xz", which are mainly used for free software or open-source software. Some compressed file formats like ".rar" and ".arj" with copyright concerns are generally avoided.

In addition, the UNIX/Linux world also popularizes the concept of "archiving" files, which means packaging multiple files or directories into a single file for convenient transmission or storage. However, archiving itself does not involve compression. To reduce file size, most archived files are further compressed. The most common archiving tool is tar, and it uses the file extension ".tar". When a file is both archived and compressed, it is referred to as a "tarball, which may have file extensions such as ".tgz", ".tbz", ".taz", ".tzx," etc.



.gz Files
".gz" file is a file compressed by gzip , which is very common in UNIX/Linux system.

^ back on top ^

.bz2 Files
The file extension ".bz2" is a file compressed by bzip2 . bzip2 is an advanced version of gzip . It has a higher compression rate and is becoming more and more popular, and its usage is almost the same as gzip , plus it has a repair function. It can be regarded as a gzip substitute (But there is no gzip -r recursive directory option, so if you want to compress the directory together, it is usually made into a tarball ).


^ back on top ^

.xz Files
The file extension ".xz" is a file compressed by xz , which has a higher compression rate  than bzip2 and can be regarded as  a substitute for bzip2 .


.Z Files
The ".Z" file is a file compressed using the compress utility, which is an antique-level compression software that was once popular but is now considered to have a low compression ratio and has gradually been phased out. Many newer Linux distributions may no longer include the compress utility. However, if you need to decompress a ".Z" file, there's no need to worry. As mentioned before, gzip can handle the decompression of ".Z" files. There may still be some people using very old versions of UNIX/Linux, so it is still necessary to introduce the traditional compress command.


..zip File
The ".zip" format is a cross-platform compression format commonly encountered not only in UNIX/Linux but also in DOS/Windows, macOS on Apple computers, and the now-discontinued IBM OS/2. Microsoft Windows versions from Windows XP and onwards even include built-in functionality for extracting and creating .zip files.

Another distinguishing feature of ".zip" files is that they offer something that formats like .gz, .bz2, or .Z do not have, which is the ability to both compress and archive files. In other words, a ".zip" file can not only compress files but also archives multiple files, including directories, into a single file.




^ back on top ^



  1.2 File Archiving

Linux tar is a command-line utility used for file archiving and compression in the Linux operating system. "tar" stands for "tape archive," as it was originally designed for creating archives on magnetic tape drives. However, it is now commonly used for creating and managing archives on various storage media, including hard drives, solid-state drives, and network shares.

The tar command allows you to combine multiple files and directories into a single archive file, which can then be compressed using various compression algorithms such as gzip or bzip2. Tar archives preserve file permissions, ownership, and directory structure, making them ideal for creating backups, transferring files, or distributing collections of files.

One of the advantages of using tar is its compatibility with various compression formats. It can create uncompressed tar archives, as well as compressed archives using gzip (.tar.gz), bzip2 (.tar.bz2), xz (.tar.xz), and other compression algorithms. This flexibility allows users to choose the compression format that best suits their needs in terms of file size and compression speed.

The tar command provides a wide range of options and flags to control various aspects of the archiving and compression process. It allows you to specify file and directory exclusions, preserve symbolic links and special file attributes, set compression levels, and more.

In summary, Linux tar is a versatile and powerful tool for creating, managing, and compressing file archives in the Linux environment. It is widely used for tasks such as backups, software distribution, and file transfers, offering flexibility and efficiency in handling large collections of files and directories.



.tar Files
The original purpose of the ".tar" file format was to bundle multiple files (including directories) into a single file for convenient backup onto magnetic tape. Therefore, it is called a "tape archive" or tar file. Tar files typically have the ".tar" extension.

^ back on top ^