Merge Mailman - or other - mbox Archives

Situation

Mailman archives list email in an mbox formattet mailbox. There can occur several situations where some mbox files - where new emails get just appended to - forked at one point of time in the past.

  • two Mailservers, migration from one to another
  • two Mailservers, one of them crashed, and one restored from backup
  • other cases of desynchronized mbox folders

Problem

An mbox folder ist just one (sometimes large) file containing several emails one after another.

So it's not easy to tell which email is present in one folder and it's even more nasty to put a missing email into a folder at a defined position.

Solution

So the first step would be to split the folder into the individual emails. This is where formail (it comes with procmail) comes handy. It just splits an mbox folder and calls a given tool on every mail providing the mail on its stdin.

Now one would need a tool just catching this and creating files with enumerated filenames. As I need this functionality sometimes, I wrote this and called it 'enumcat':

  #!/bin/bash
  
  pfx="enumcat."
  seqf=".seq"
  c=0
  width=6
  
  usage () {
  cat << EOF
  	$0 [ -p prefix ] [ -s sequencefile ] [ -c count ] [ -w width ]
  	-p prefix         file name prefix, may include path. default: 'enumcat.'
  	-s sequencefile   sequence file, may include path. default: '.seq'
  	-c count          counter start. default: 0
  	-w width          length of numbers. default: 6
  
  EOF
  }
  
  doopt () {
      local x
      while [ $# -gt 0 ]; do
          x=$1; shift
          # echo P: $# "'$x'" : $@
  
          case "$x" in
          -p) pfx="$1"
              shift
              ;;
          -s) seqf="$1"
              shift
              ;;
          -c) c="$1"
              shift
              ;;
          -w) w="$1"
              shift
              ;;
          *)
              usage
              exit 1
              ;;
          esac
      done
  
  }
  
  doopt "$@"
  
  [ -e $seqf ] && {
  	c=$(< $seqf ) 
  }
  c=$(( $c+1 ))
  echo $c > $seqf
  
  cat > $pfx$( printf "%0${width}d" $c )

Now we can split the mbox mailfolder. You have to do this on both folders.

mkdir /tmp/folder1
formail -s enumcat -s /tmp/folder1/.seq -p /tmp/folder1/email. < /my/original/folder.mbox
mkdir /tmp/folder2
formail -s enumcat -s /tmp/folder2/.seq -p /tmp/folder2/email. < /my/other/folder.mbox

We have to prepare for comparison and compare:

cd /tmp/folder1 && md5sum email.* > md5sum.all
cd /tmp/folder2 && md5sum email.* > md5sum.all
diff /tmp/folder1/md5sum.all /tmp/folder2/md5sum.all

Collect the files unique to folder1 (!):

F=$( diff /tmp/folder2/md5sum.all /tmp/folder1/md5sum.all | awk '/^</ { print $3}' )

So we have enumerated files and enumeration represents a time flow. So we want the files in $F to become classified prior to the ones with the same numbers from the other folder. So we need to go one step back and generate file names classifying between this step and the following one. Shell sorting will help us here:

cd /tmp/folder1
FIRSTFILE=$( head -1 <<< "$F" )
FIRSTNUM=${FIRSTFILE#*.}
FIRSTBASE=${FIRSTFILE%$FIRSTNUM}
FLEN=$( echo -n "$FIRSTNUM" | wc -c )
FCOUNT=$( wc -l <<< "$F" )
FCLEN=$( echo -n "$FCOUNT" | wc -c )
PREVNUM=$( printf "%0${FLEN}d" $( expr $FIRSTNUM - 1 ) )
CNT=0
FNEW=""
for i in $F ; do
NFILE=${FIRSTBASE}${PREVNUM}$( printf "%0${FCLEN}d" $CNT )
    mv $i $NFILE
    FNEW="$FNEW $NFILE"
    CNT=$(( $CNT + 1 ))
done

$FNEW contains the new names of the files to be put to /tmp/folder2 (-i just to be really sure):

mv -i $FNEW /tmp/folder2/

Now glue it together:

mv /my/other/folder.mbox /my/other/folder.mbox.BAK
cat /tmp/folder2/email.* > /my/other/folder.mbox

That's it. Please be sure the folders are not to be written (from your mail server or mailing list manager) while this is going on. ;-)

Remarks

  • You have to use bash >= 3.0 or 3.2 for that.
  • If you use this on otherwise differing mbox mail folders you will get a unified merged folder - but without the original flow of time reconstructed.
project/merge-mailman-mbox-archives.txt · Zuletzt geändert: 2011/11/27 20:00 von 109.192.98.64