Journaling for ext2fs, alpha release 0.0.3a

Released 24 August, 2000
Stephen Tweedie <sct@redhat.com>


*** Nobody accepts any responsibility if the use of this code damages
*** your filesystem, corrupts data, creates a black hole or turns you
*** into a sperm whale.  If I had a lawyer he'd probably have told me to
*** say this.  You have been warned.


Changes in this release
-----------------------

in 0.0.3a:

Development release. 

Adds new journal_abort() API.  journal_start and journal_stop can now
return errors if the journal has been aborted.  Aborting the journal
results in all current transactions running to completion, at which
point they will return -EIO.  Future transactions will return -EROFS.

The ext3_panic and ext3_error filesystem fatal error mechanisms will use
the journal_abort to fail cleanly, as will certain exceptional journal
internal errors (such as journal IO failures and unrecoverable
out-of-memory errors).

A second new journal API allows an error code to be set in the journal
on journal_abort().  ext3_error() can set this in order to record the
detection of errors on the filesystem without having to write back the
superblock (which we cannot necessarily do due to write ordering
constraints).

Added a few bug fixes:

 * Make sure that no new transactions get started once the filesystem
   has been remounted readonly (there was previously a window between
   checking the readonly flag and starting the transaction)

 * Added a one-line orphan list handling fix from Andreas Dilger:
   removing and reading an inode on the orphan list (eg. doing multiple
   truncates on a file) could corrupt the list.

 * Quota fix --- add the capability for quotas to be marked
   writethrough, to make sure that we journal all quota updates
   immediately rather than waiting for deferred quota file writeback.
   (This also prevents recursive transactions from crossing filesystems
   if a transaction tries to reuse an inode with dirty quota entries
   attached to it and flushes those to disk.)

in 0.0.2f:

Fix two bugs introduced in 0.0.2e:
 directory corruption bug on extending directory files
 fix possible OOPS on illegal directory operations

Include "rmdir" operations in the orphan-handling code

in 0.0.2e:

Port forward to current (2.2.17pre9, and Red Hat errata 2.2.16-3)
kernels

Merge in a number of ext2 fixes from 2.2.15+:
 * NFS versioning 
 * Set directory type information correctly on sockets

Fix a number of buffer leaks in recovery (prevents set_blocksize errors
on mounting filesystems)

sync(2) waits for current transactions correctly

Set the superblock s_dirt flag on all transaction completions

Fixed the order of asserts and buffer writes in fs/buffer.c: this was
causing false assertion failures on Mylex raid controllers

Delete the filesystem commit timer on unmount in all cases

Include Andreas Dilger's implementation of the "orphan list" code:
  The orphan list maintains an on-disk list of inodes needing cleaned up
  on recovery, including:
  * Deletion of unlinked, but still opened, files after a reboot;
  * Completion after recovery of truncates which were in progress but 
    which had to be split across a transaction boundary

in 0.0.2d:

Port forward to 2.2.15pre15

Fixed a missing lock_journal in journal_forget() (this could potentially
confuse the transaction engine when doing deletes under heavy memory
pressure).

Merged in a small directory IO error handling fix from ext2

Included Andrea Arcangeli's elevator IO scheduling changes.  This should
improve the performance of ext3 on non-SCSI devices substantially.


in 0.0.2c:

Lots of fixes to the way we set the filesystem's NEEDS_RECOVERY flag.
It should basically get this right now.  This flag is the thing that
prevents you from accidentally running e2fsck on a filesystem which
still needs kernel attention after a crash or unclean shutdown.

Fixed releasing of the journal inode when unmounting a readonly ext3
filesystem.


in 0.0.2b:

Fixed a lockup when the VFS is trying to reclaim dirty inodes


in 0.0.2a:

Fixed a nasty bug in truncate.  If truncate overflowed the
transaction, the transaction state machine could get seriously
confused.

0.0.2:

Bug fixes.  Lots of bug fixes.  Buckets of them.

It works on >1K blocksize filesystems.  It recovers reliably.  It
survives log wraps properly during recovery.  mknod() works properly: it
will no longer turn /dev into a socket if used on your root filesystem.  

This one survives under load quite happily.  A 50-client dbench run
completes reliably.

So basically, this is the first usable ext3 release.

Note that there are two major places where the implementation is not
complete: clean handling of all errors (in particular out-of-memory and
IO errors), and performance (there is still a lot of debugging code in
place, and all data is journaled as part of the testing cycle).  But it
is usable: I've been running it on all of my laptop's filesystems for
over a week now.


Future Milestones
-----------------

0.0.3 to deal gracefully with memory or disk failures.
0.0.4 to deal with metadata-only journaling
0.0.5 to disable the extra debugging code and add performance tuning
0.1 to be released once all of that is solid.


Introduction
------------

What is journaling?

    * It means you don't have to fsck after a crash.  Basically.

What works?

    * Journaling to a journal file on the journaled filesystem

    * Automatic recover when the filesystem is remounted

    * All VFS operations (including quota) should be journaled

    * Add data updates are also journaled


What is left to be done?

    * Journaling of metadata only.  Currently everything is journaled,
      incuding data, resulting in a performance drop as all data gets
      written twice.

      Journaling of metadata only is supported but is not enabled.  It
      turns out to involve several extra complications in the journaling
      buffer state, so I'm testing the simpler case first to get that
      reliable on its own.

    * Journaling to an off-filesystem device, eg. NVRam

    * Automatic reclamation of unlink but still-referenced files on
      reboot

    * Error recovery.  You will see that the source is marked quite
      carefully where there are potential IO or memory allocation errors
      which can disrupt things, but the code to respond to that (either
      to remount the fs readonly or to abort and panic) remains to be
      added. 

    * Decent documentation!

    * A few internal cleanups: migrating the extra buffer_head fields to
      a separate jfs_buffer_info field in particular.

    * e2fsprogs tools.  e2fsck needs to know about the journal (but see
      below). 

How to apply
------------

This README should have come with two diffs for kernel version
linux-2.2.17pre9:

  -rw-rw-r-- 1 sct sct 364703 Jul  5 20:41 linux-2.2.17pre9.ext3.diff
  -rw-rw-r-- 1 sct sct 218556 Jul  5 20:40 linux-2.2.17pre9.kdb.diff

and for the "2.2.16-3" errata kernel distributed as Red Hat 6.2 updates:

  -rw-rw-r-- 1 sct sct 364703 Jul  5 20:41 linux-2.2.17pre9.ext3.diff
  -rw-rw-r-- 1 sct sct 218556 Jul  5 20:40 linux-2.2.17pre9.kdb.diff

as well as an incremental diff to take ext3-0.0.2d up to 0.0.2e for
those of you just wanting to see what has changed in this release.

The first diff of each set is copy of SGI's kdb kernel debugger patches.
Apply this first if you want kdb.  The second patch is the ext3
filesystem.  If you apply this without the kdb diff, you will get a
couple of rejects (the ext3 diff includes a kdb module for interrogating
jfs data structures) --- ignore those.

If you can't apply kernel patches, stop reading this now.  Right now!

Now, configure the kernel, saying YES to "Enable Second extended fs
development code" (I *assume* you want it!), and build it.


What next?
----------

Now, you want to make a journaled filesystem (recommended) or journal an
existing one (for the exceptionally stupid/brave).  Great.  Go right
ahead, make a new ext2 filesystem if you need to, and mount the
filesystem you want to journal.  (Except see below for special
instructions for the root filesystem).

Be aware that the jfs patch does _not_ change the ext2 code.  Rather, it
makes a copy of ext2 called ext3, and all the fancy footwork takes place
in that.  You don't have to run ext3 on all your valuable filesystems:
just use it on the throwaway ones.

Now, create a journal file.  I don't know how big it should be yet: the
rules of thumb have yet to be established!  However, try (say) 2MB for a
small filesystem on a 486; maybe up to 30MB on a big 18G 10krpm Cheetah.
Or whatever you want.  You need at least 1024 blocks for the journal, so
on a filesystem with a blocksize of 4k the minimum journal is 4MB.

You'll need to make sure that the file is preallocated, so use something
like:

	dd if=/dev/zero of=/mnt/sparefs/journal.dat bs=1k count=10000

assuming you want a 10MB journal on a 1k ext2 filesystem mounted on
/mnt/sparefs.  You need to find the journal inode's inode number, too:

	ls -i /mnt/sparefs/journal.dat

For a newly created filesystem, this will probably show

        12 journal.dat

OK, 12 is the expected number for a clean fs.  You might want to do a
"chmod 400 journal.dat" right now to make sure that nobody will be
able to poke around in the journal once it is running (don't worry,
ext3 will be able to write to the journal even if you specify a
read-only access mode for the file).

Now, umount as ext2.  Take a deep breath.  Now mount as ext3, giving it
the inode number of the file to be created as a journal:

	mount -t ext3 /dev/sdb2 /mnt/sparefs -o journal=12

Bingo.  That's it.  Enjoy!

Note: The "-o journal=<nnn>" bit is only necessary when creating a new
journal the first time you mount a filesystem as ext3.  Do _not_ add
it to /etc/fstab: it will do no good at all there.

Warning: the journaling will get _seriously_ confused if you try to 
delete the journal file.  Future versions of ext3 will protect this 
automatically, but for now you probably want to make it into an
immutable file to guard it:

	chattr +i /mnt/sparefs/journal.dat

Setting the immutable bit will not prevent the filesystem from writing
to the journal internally, but it will stop any other processes from
modifying or removing the journal.


Creating a journal on your root filesystem
------------------------------------------

How do you add the "-o journal=<nnn>" to the mount options for the
root filesystem?  Obviously, / gets mounted for you by the kernel, so
you can't add it on the mount command.  However, the ext3 comes with a
new kernel boot option, "rootflags=", which lets you specify any
options you want to be used when / is mounted.

To create the journal on your root filesystem, then, you want to boot
once with the rootflags option.  When creating the journal, it is also
important to mount the root in read-write mode.  So, the kernel
command line options you want to add will look like this:

	rw rootflags=journal=12

if your journal.dat is inode number 12.  If you are using LILO as your
boot loader, you can either specify these options at the boot prompt,
or you can force LILO to add new temporary kernel options just for the
next boot only: if the LILO kernel image is called "ext3", then you
can run

	/sbin/lilo -R ext3 rw rootflags=journal=12

and reboot to get the kernel to build your journal on the root
filesystem.


How to fsck
-----------

Right now, e2fsck will reject an uncleanly unmounted ext3 partition.
However, if you umount an ext3 filesystem cleanly, you can fsck it using
a version of fsck which understands the journal flags: you'll want
e2fsprogs-1.17 or later, which you can get from the ext2 web pages at

	http://web.mit.edu/tytso/www/linux/ext2.html
	
You can now run e2fsck quite happily on the filesystem, *as long as the
filesystem was unmounted cleanly*.  If it wasn't, then you'll need to
get the kernel code to recover the journal from the disk by mounting the
filesystem (even a readonly mount will cause a journal recovery to
happen) and umounting it again (or, for the root filesystem, remounting
it readonly with "mount -o remount,ro /").

However, the whole point is that you don't HAVE to run e2fsck after a
crash, right?



How to move back from ext3 to ext2
----------------------------------

It's quite easy.  If you unmount an ext3 filesystem cleanly, then you
can remount it as ext2 without any other commands.  If you crash and are
left with an unclean ext3 filesystem, on the other hand, the filesystem
will prevent you from mounting it as ext2: it is not safe to mount it
until you have recovered the journal, and the only way to do that for
now is to mount it as ext3.

However, if for any reason you do have an ext3 filesystem which you want
to convert permanently back to ext2, whether it was cleanly unmounted or
not, you can use "debugfs" from e2fsprogs-1.17 or later to do it.
First, run debugfs and open the filesystem (the -w flag means open for
write, and the -f flag forces it to open the filesystem even if there
are unknown journal flags set):

    [root@sarek /root]# debugfs
    debugfs 1.18, 11-Nov-1999 for EXT2 FS 0.5b, 95/08/09
    debugfs:  open -f -w /dev/sdb1 

Now, use "features" to see which feature bits are set on the filesystem:

    debugfs:  features
    Filesystem features: has_journal filetype sparse_super

We want to clear the journal bits, then we can quit:

    debugfs:  features -has_journal -needs_recovery
    Filesystem features: filetype sparse_super
    debugfs:  quit
    [root@sarek /root]# debugfs

That's it!


Known Bugs
----------

Lots of stuff is missing, in particular the ext3-aware fsck tools with
built-in filesystem recovery. All of the other bugs are currently
unknown.  Good luck finding them.



Enjoy.
--Stephen.
