Sending errors in a roll up digest email with log4net

On production, there are many alerts in a short period. It’s nice to get error emails, but you would want to avoid receiving hundreds of emails in a few minutes, when something fails. Also, people get used to alert emails and later nobody would care about the blizzard of emails.

So what we want are

  1. Roll up notifications of faults occurring in a short period into a summary email
  2. But the first fault should be emailed straight away
  3. The thresholds are two. First, it waits for certain number of faults. Second, though the buffer is not full, after a certain period of time, the logger sends all the faults in the buffer. Practically, it will wait for 5 minutes. Even within 5 minutes, if the nubmer of faults exceeds 100, then the batch will be sent.
  4. Include contextual events. We also want to see 10 preceding events before the error to understand the context of the fault.

We will use log4net, a superb logging framework that handles the most of the features we want, just out of box.

Let me go straight to the code.

public class SmtpCachingAppender : SmtpAppender
{
    private bool _timedFlush;
    System.Threading.Timer _timer;
    private string _originalSubject;
    private readonly IList<LoggingEvent> _thresholdEvents;
    private readonly List<LoggingEvent> _allEvents;

    public SmtpCachingAppender()
    {
        FlushCount = 20;
        FlushInterval = new TimeSpan(0, 5, 0);
        _thresholdEvents = new List<LoggingEvent>();
        _allEvents = new List<LoggingEvent>();
    }

    public TimeSpan FlushInterval { get; set; }
    public int FlushCount { get; set; }

    /// <summary>
    /// Create a timer that fires to force flushing cached log events
    /// via SMTP at a specified interval.
    /// </summary>
    public override void ActivateOptions()
    {
        if (FlushInterval > TimeSpan.Zero)
            _timer = new System.Threading.Timer(OnTimer, null, FlushInterval, FlushInterval); 

        base.ActivateOptions();
    }
    
    protected override void SendBuffer(LoggingEvent[] events)
    {
        _thresholdEvents.Add(events.Last());
        _allEvents.AddRange(events);

        bool isFirstError = _thresholdEvents.Count == 1;
        if (isFirstError)
            Send(_allEvents);

        bool isBatchReady = ((FlushCount != 0) && (_thresholdEvents.Count >= FlushCount)) || _timedFlush;
        if (isBatchReady)
        {
            Send(_allEvents);
            ResetCacheBuffer();
        }
    }

    private void Send(IEnumerable<LoggingEvent> events)
    {
        _originalSubject = Subject;

        PrepareSubject();
        base.SendBuffer(events.ToArray());

        Subject = _originalSubject;
    }

    private void ResetCacheBuffer()
    {
        _thresholdEvents.Clear();
        _allEvents.Clear();
        _timedFlush = false;
    }

    private void PrepareSubject()
    {

        var matches = Regex.Matches(Subject, @"%property{(\w+)}");
        foreach (Match match in matches)
        {
            if (!_thresholdEvents[0].Properties.Contains(match.Groups[1].Value))
                continue;

            Subject = Subject.Replace(match.Value, _thresholdEvents[0].Properties[match.Groups[1].Value].ToString());
        }

        if (_thresholdEvents.Count > 1)
        {
            Subject = "SUMMARY " + Subject + " - " + _thresholdEvents.Count + " errors";
        }
    }


    private void OnTimer(Object stateInfo)
    {
        _timedFlush = true;
        Flush(true);
    }

}

SmtpCachingAppender inherits SmtpAppender. SmtpAppender already has feature for contextual events. So we can benefit from it.

When the logger logs any event, SendBuffer(LoggingEvent[] events) is called. why is the parameter events collection? It’s because of “buffersize” setting. if “buffersize” is set to 0, only 1 event is passed. If it is 10, 10 events are passed, and the last event is the threshold event (in this case, it will be error) and all previous events are contextual ones.

There are two lists, _thresholdEvents and _allEvents. _allEvents will have all events including all previous events even though they are not ERROR event (there would be some DEBUG, INFO, WARN, …). The batch will only consider the number threshold events, so we need a separate list, _thresholdEvents.

When either of the two conditions are met, we call base.SendBuffer(…), FlushInterval is done or the number of threshold events exceed the specified FlushCount.

The setup for this appender is like this.

  <appender name="RollUpErrorEmail" type="SmtpCachingAppender">
    <to value="operation@andrewchaa.me.uk" />
    <from value="backend@yourreader.net" />
    <subject value="Your Service ERROR (local from homemachine)" />
    <smtpHost value="SMTP-HOST" />
    <bufferSize value="10" />
    <lossy value="true" />
    <FlushInterval value="00:05:00" />
    <FlushCount value="100" />

    <evaluator type="log4net.Core.LevelEvaluator">
      <threshold value="ERROR"/>
    </evaluator>

    <layout type="log4net.Layout.PatternLayout">
      <conversionPattern value="%utcdate [%thread] %-5level %logger user=%property{user} - %message%n" />
    </layout>
  </appender>

The usage of each configuration values are

  • bufferSize: the number events you want to see, including the contextual ones
  • lossy: true. If lossy is false, every 10 event (by the buffersize) will be logged. so if you want to see the threshold event at the end of the list, lossy should be false. Then it ignore any non-threshold events but include only 9 preceding events from ERROR event
  • threshold: set the event type. it can be DEBUG, WARN, INFO, ERROR, …
  • FlushInterval: it is currently set to 5 minutes
  • FlushCount: currently set to 100. if the number of faults becomes 100, the email will be sent.

There was a sample code for cachingsmtpappender. I added the first email bit and also changed it a little bit. Hope this helps anyone interested.

Sending errors in a roll up digest email with log4net

2 thoughts on “Sending errors in a roll up digest email with log4net

  1. Cihan Yaşar says:

    Thanks for the info, but when OnTimer method is executed by the timer, it calls Flush(bool ) method of BufferingAppenderSkeleton class. And it does nothing, because the underlying buffer is already cleared when it calls SendBuffer method. Thus pratically it could wait more than 5 mins if no error occurs after timer elapsed. After 5 mins even if the FlushCount is lower than the limit transmission of a digest email will be done.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s