Saturday, 30 March 2013

Large File Support in Linux

Large File Support in Linux

To support files larger than 2 GiB on 32-bit systems, e.g. x86, PowerPC and MIPS, a number of changes to kernel and C library had to be done. This is called Large File Support (LFS). The support for LFS should be complete now in Linux and this article should give a short overview of the current status.
64 bit systems like Alpha, IA64 and x86-64 don't have problems with large files but do support the new interfaces also. In this case the new interface is mainly an alias to the normal interface.
The LFS support is done by the Linux kernel and the GNU C library (aka glibc).

Limits

LFS raises the limit of maximal file size. For 32-bit systems the limit is 231 (2 GiB) but using the LFS interface on filesystems that support LFS applications can handle files as large as 263 bytes.
For 64-bit systems the file size limit is 263 bytes unless a filesystem (like NFSv2) only supports less.

LFS in Glibc 2.1.3 and Glibc 2.2

The LFS interface in glibc 2.1.3 is complete - but the implementation not. The implementation in 2.1.3 contains also some bugs, e.g. ftello64 is broken. If you want to use the LFS interface, you need to use a glibc that has been compiled against headers from a kernel with LFS support in it.
Since glibc 2.1.3 was released before LFS support went into Linux 2.3.X/2.4.0-testX, some fixes had to be made to glibc to support the kernel routines. The current stable release of glibc is glibc 2.2.3 (2.2 was released in November 2000) and it does support all the features from Linux 2.4.0. Glibc 2.2.x is now used by most of the major distributions in their latest release (e.g. SuSE 7.2, Red Hat 7.1). glibc 2.2 supports the following features that glibc 2.1.3 doesn't support:
  • getdents64 system call
  • 64 bit file locking interface (see below for details)
Programs compiled against glibc 2.1.3 will work on a LFS system, there's no need to recompile the programs (with the exception of the 64 bit fcntl locking). Only glibc needs to be updated to support LFS.
Note that glibc 2.0 and libc5 do not support LFS at all.

Locking on Large Files is Not Supported withfcntl/lockfin Glibc 2.1.x

Locking via fcntl/lockf doesn't work with large files in glibc 2.1.3. The support has been added in Linux 2.4.0-test7 to the kernel and needed incompatible changes to glibc, only glibc 2.2 does handle them. This means:
  • You can't use the flags F_GETLK64, F_SETLK64 and F_SETLKW64 with fcntl when you use glibc 2.1.x. If your programs use them now, they fail. They also need to be recompiled with glibc 2.2 which will support these fcntl flags.
  • lockf64 only works on files < 2 GiB with glibc 2.1.x, it does work with glibc 2.2 and no recompilation is needed.

LFS in the Linux Kernel

Since Linux 2.4.0-test7 most of the kernel interface is included into the kernel. The open problems and restrictions are described below.

File Systems

We can separate two levels of LFS compliance in the file systems:
  1. Full support for files > 2 GiB and O_LARGEFILE
  2. Limited LFS support: it gives proper EINVAL/EFBIG/EOVERFLOW error messages when you try to use O_LARGEFILE or positions > 2 GiB.
At least the second level should be generally reachable, but is some work to audit all the weird file systems.
Some bugs in NFSv2 regarding (2) have been fixed already, but some are missing (like the O_LARGEFILE check). Other file systems probably miss it too. A complete audit of all file systems is needed (see also the 2.4 kernel TODO page at http://linux24.sourceforge.net/).
The situation about the different filesystems used in Linux 2.4.0 and later can be summarized as follows:
ext2/ext3
Full support for LFS
NFSv2
Cannot handle LFS due to protocol restrictions (limited to 2 GiB - 1); limited LFS support but expect some bugs
NFSv3
The protocol is ok, but I'm not sure about the Linux implementation status
ReiserFS 3.5.x (not part of the kernel, separate patch)
Does not support LFS
ReiserFS 3.6.x (part of kernel 2.4.1 and newer)
Full support for LFS if the new on disk format is used. This format is incompatible to the format used by 3.5.x (see below for some more details).
coda
Does not work with LFS (local cache issues, protocol is ok)
UFS
Full support for LFS (although not complete vs. O_LARGEFILE flag use)
minix
limited to 2 GiB - 1 (file size is limited to 65804 MiB but note that filesystem size is limited to 64 MiB - but holes are allowed)
SysV (aka SCO)
limited to 2 GiB -1
msdos
limited to 2 GiB - 1
umsdos
based on msdos, limited to 2 GiB - 1
smbfs
Older protocols are limited to 4 GiB - 1. SMB extensions allow 64 bit filesystems. Linux smbfs implementation is currently limited to 2 GiB - 1.
NCPfs
protocol is limited to 4 GiB - 1, Linux implementation to 2 GiB - 1
JFS
Should work with LFS (for details about JFS see http://oss.software.ibm.com/developer/opensource/jfs)
XFS
Should work with LFS (for details about XFS see http://http://linux-xfs.sgi.com/projects/xfs/)
other file systems
I don't have any information yet, feel free to send me updates.
Note for ext2
When files > 2 GiB are created in ext2 older kernels will mount file systems only read-only (it sets a read-only compatibility flag).
Note for ReiserFS
Chris Mason wrote:
Disks formatted with the current 2.2 code are called our 3.5 disk format. They will not support large files under any kernel (even the 2.4 code).
But, you can mount a 3.5 disk format under the 2.4 kernel code, and use -o conv. This will turn on large file support for the old disks, but only new files will be allowed to grow past 2 GiB.
Once you mount with -o conv, you can't mount under 2.2 any more. We are testing a back port of the LFS disk format to 2.2, it should be ready soon. It has the same -o conv mount option that our 2.4 code has, so all the same rules will apply.

rlimit64 Is Not Supported

The Linux kernel doesn't support a 64bit rlimit system call yet, glibc supports getrlimit64 and setrlimit64 but wraps too large values to RLIMIT_INFINITY.

Using LFS

For using LFS in user programs, the programs have to use the LFS API. This involves recompilation and changes of programs. The API is documented in the glibc manual (the libc info pages) which can be read with e.g. "info libc".
In a nutshell for using LFS you can choose either of the following:
  • Compile your programs with "gcc -D_FILE_OFFSET_BITS=64". This forces all file access calls to use the 64 bit variants. Several types change also, e.g. off_t becomes off64_t. It's therefore important to always use the correct types and to not use e.g. int instead of off_t. For portability with other platforms you should use getconf LFS_CFLAGS which will return -D_FILE_OFFSET_BITS=64 on Linux platforms but might return something else on e.g. Solaris. For linking, you should use the link flags that are reported via getconf LFS_LDFLAGS. On Linux systems, you do not need special link flags.
  • Define _LARGEFILE_SOURCE and _LARGEFILE64_SOURCE. With these defines you can use the LFS functions like open64 directly.
  • Use the O_LARGEFILE flag with open to operate on large files.
A complete documentation of the feature test macros like _FILE_OFFSET_BITS and _LARGEFILE_SOURCE is in the glibc manual (run e.g. "info libc 'Feature Test Macros'").
The LFS API is also documented in the LFS standard which is available at http://ftp.sas.com/standards/large.file/x_open.20Mar96.html.

LFS and Libraries other than Glibc

Be careful when using _FILE_OFFSET_BITS=64 to compile a program that calls a library or a library if any of the interfaces uses off_t. With _FILE_OFFSET_BITS=64 glibc will change the type of off_t to off64_t. You can either change the interface to always use off64_t, use a different function if _FILE_OFFSET_BITS=64 is used (like glibc does). Otherwise take care that both library and program have the same _FILE_OFFSET_BITS setting. Note that glibc is aware of the _FILE_OFFSET_BITS setting, there's no problem with it but there might be problems with other libraries.

Distributions with LFS Support

SuSE 7.0

Release 7.0 of SuSE Linux supports LFS on all supported platforms. The kernel of SuSE 7.0 is based on Linux 2.2.16.
The LFS support in the SuSE Linux kernel is the same as in the development kernel 2.4.0-test1 for the file systems which are in both kernels, glibc supports all the features of the kernel. The different filesystems are ReiserFS (so far only in SuSE, the 2.2 port doesn't support LFS) and NFSv3 (not available in SuSE 7.0). This means that you need to use ext2 as file system for LFS.
Both Linux 2.4.0-test1 and SuSE 7.0 do not support the getdents64 system call and the 64 bit locking interface. These are only implemented in Linux 2.4.0-test8 and newer.

SuSE 7.1

Release 7.1 of SuSE Linux supports LFS on all supported platforms. SuSE 7.1 comes with kernels based on 2.4.0 and 2.2.18.
The 2.2.18 kernel support LFS with the ext2 file system. The 2.4.0 kernel supports LFS with the ext2 and NFSv3 filesystems and additionally with the ReiserFS filesystem if the new ReiserFS format (incompatible to the 2.2 format) is used instead of the default 2.2 format.
SuSE 7.1 comes with glibc 2.2 that supports the full LFS interface. But the 2.2.18 kernel only does not support the 64-bit filelocking and the getdents64 calls.

SuSE 7.2 and newer

The kernel support for LFS is like the one in 7.1.

Other Distributions

Since I can't verify each and every distribution, I have to trust others for the following information.

Debian

The current stable release (Debian 3.0, codename "woody") has LFS support.

Red Hat

The beta called Fisher was the first to have LFS support (thanks to Russ Marshall). Current Red Hat releases like Red Hat 8 have LFS support.
Tim Small <tim@digitalbrain.com> send the following special combo-gotcha for Red Hat 6.2 (and probably other older distros as well):
The 'ulimit' command which is built into bash 1.x (the default for Red Hat 6.2) uses the 32 bit versions of the system calls. The way that glibc currently behaves means that requests to the 32bit setrlimit, or getrlimit will translate 'unlimited' to '231 - 1' in both directions (I would argue that setting a limit to RLIM_INFINITY using the 32bit interface should end up in a call to the 64 bit setrlimit variant with the 64 bit RLIM_INFITIY).
The default PAM configuration for sshd (/etc/pam.d/sshd), includes the line:
session    required     /lib/security/pam_limits.so
Which fiddles about with various limits (using the 32bit versions of the calls).
If you log-in using ssh, and use bash 1.x to view the limits, you will be told that your file size is unlimited, when it is in fact set to 2097151 (1024 byte) blocks!
Workaround:
  • Either:
    • Comment out the line in /etc/pam.d/sshd (note that limits set in /etc/security/limits.conf will no longer be effective for ssh logins)
    • Or: Rebuild the pam package with 64 bit support
  • Install the bash2 RPM
  • Either:
    • rename the old bash, and symlink /bin/bash2 to /bin/bash (you may want to keep /bin/sh pointing at the old bash, if you are worried about compatibility)
    • Or: use vipw to change users over to /bin/bash2

Other...

I don't have any other information yet. Feel free to send me detailed information about distributions if they supports LFS.

Some Other Often Requested Data about Filesystems

Please send me information to fill in the missing bits.

Maximum On-Disk Sizes of the Filesystems

Filesystem File Size Limit Filesystem Size Limit
ext2/ext3 with 1 KiB blocksize 16448 MiB (~ 16 GiB) 2048 GiB (= 2 TiB)
ext2/3 with 2 KiB blocksize 256 GiB 8192 GiB (= 8 TiB)
ext2/3 with 4 KiB blocksize 2048 GiB (= 2 TiB) 8192 GiB (= 8 TiB)
ext2/3 with 8 KiB blocksize (Systems with 8 KiB pages like Alpha only) 65568 GiB (~ 64 TiB) 32768 GiB (= 32 TiB)
ReiserFS 3.5 2 GiB 16384 GiB (= 16 TiB)
ReiserFS 3.6 (as in Linux 2.4) 1 EiB 16384 GiB (= 16 TiB)
XFS 8 EiB 8 EiB
JFS with 512 Bytes blocksize 8 EiB 512 TiB
JFS with 4KiB blocksize 8 EiB 4 PiB
NFSv2 (client side) 2 GiB 8 EiB
NFSv3 (client side) 8 EiB 8 EiB
Note Kernel Limitations: The table above describes limitations of the on-disk format. The following kernel limits exist:
  • On 32-bit systems with Kernel 2.4.x: The size of a file and a block device is limited to 2 TiB. By using LVM several block devices can be combined enabling the handling of larger file systems.
  • 64-bit systems: The sizes of a filesytem and of a file are limited by 263 (8 EiB). But there might be hardware driver limits that do not allow to access such large devices.
  • Kernel 2.6: For both 32-bit systems with option CONFIG_LBD set and for 64-bit systems: The size of a file system is limited to 273 (far too much for today). On 32-bit systems (without CONFIG_LBD set) the size of a file is limited to 2 TiB. Note that not all filesystems and hardware drivers might handle such large filesystems.
Note in the above: 1024 Bytes = 1 KiB; 1024 KiB = 1 MiB; 1024 MiB = 1 GiB; 1024 GiB = 1 TiB; 1024 TiB = 1 PiB; 1024 PiB = 1 EiB (check http://physics.nist.gov/cuu/Units/binary.html)

Maximum Number of Partitions

An IDE disk has 64 minors, one is used for the full disk and therefore 63 partitions are possible. A SCSI disk has 16 minors and therefore only 15 partitions maximal.

No comments:

Post a Comment