The profound rantings of the one like Tom Atkinson… and now art gallery and shop.

Get rid of load-balanced domains with a regular expression: GA Webmail Referral Traffic Source Rollup Filter

The problem? Nasty load balanced domains in Google Analytics reports like this:

Source Visits / referral 149 / referral 131 / referral 43 / referral 25 / referral 23

The solution? Clean nicely segmented source lines that "roll up" into one:

Source/Medium Visits
Webmail ( / email 643
Webmail ( / email 258
Webmail ( / email 105
Webmail ( / email 13
Webmail ( / email 12
Webmail ( / email 23

To clean these up requires two filters:

Both these filters will use the core regex code I've figured out that consolidates 99% of the worlds webmail systems without pulling in any false positives in theory.

The Magic Regex Code

Here is the filter regex that I've been currently using in my production Google analytics advanced filters since 15 Feb 2012 to cleanup these - or to roll up the load-balanced domains that you often get in referrals:

My starting point:


This grabs any domain with say "mail" in it, but runs a check on the ending of the domain: it needs to have at least 2 dots after it and a TLD between 2 and 4 chars long. It will miss and It will also miss "Mail Campaign \ email" because this is not a proper domain. So far so good 🙂

My improvement with exceptions (This is the one to use!):


Domain of webmail platform is in capture group $A2.

My improved regex has a new exception section at the end to allow some special cases (,, to get through the filter using a hard-coded approach that skips the safety net and autonomy provided by the wide open keyword matching paired with a domain name restriction ensuring "two more dots and a TLD".

Maybe that is excessive use of regex, but at least you can be sure you can now see your word of mouth / word of email traffic nice and tidy!

How does it work?

I'll break the formula down in sections:


This looks for the really obvious and common keywords in webmail services, the main one being mail. This will match but it will also grab and a huge number of others that you really don't want to catch with this filter. If we were looking for live (but we aren't in this case), then a site like would get picked up in the crossfire. I use a ^ in front of imp so that domains like don't get caught.

Basically the first part of the domain name (sn124w.snt124.mail) will be getting deleted by this filter so you could stand to lose quite a lot of data with the "mail" and "imp" keywords if this were the only parts of the expression! So the next bit of filter is designed to pass the domain through another difficult test involving the dots and TLDs...


This makes sure that the domain bits after mail or zimbra or whatever always have two dots and a TLD (top level domain extension eg .nz .jp). Which matches the end bit of and  the end of:

The (.*) part means match anything including nothing, and the \. means there must be a dot, so (.*)\.(.*)\. means there gotta be at least two more dots in this domain name coming up after the live". Which is how Answering Oliver gets through the test for live. The next part .{2,4} is all about the top level domain or TLD. These can be 2, 3, or 4 letters long like .co, .com. and .mobi. The curly braces specify how many times the previous character . (which means single char you like except nothing) can appear like {min,max}.

Then in the middle is a pipe | which cuts the regex open and allows some really hard to match exceptions through for smaller webmail systems. Only reason you see the one also appear on the left of the central | is because this is such a whacky domain name that it doesn't match the "mail" which gets most of the webmail systems on the planet. Germans aye? 🙂

* (Hi Devon! Thanks for sending traffic to Stray Travel I found you researching this post)


Campaign Source Filter

Advanced filter.

Field A -> Extract A: Campaign Source:


Field B -> Extract B: [leave both blank]

Output To -> Constructor: Campaign Source: Webmail ($A2.$A3)

GA Webmail Rollup Filter

GA Webmail Rollup Filter

Medium Rollup Filter

Advanced filter.

Field A -> Extract A: Campaign Source:


Field B -> Extract B: [leave both blank]

Output To -> Constructor: Campaign Medium: email

GA Webmail Rollup Filter

GA Webmail Rollup Filter


Additional reference domains to check:

The domains below are the really rare webmail clients that are hard to extract:

This is why I needed to grab the full domain with (.*\..{2,4}) versus the first versions (.*)\.(.*)\..{2,4} which would have only grabbed "sinamail". Now we get

These can be checked with: sina\.com\.cn|go\.mail\.ru|mail\.com|promail\.co\.nz|web\.de|outlook\.com

Errors and Issues

Currently the filter is incorrectly tracking as word of email the following referrals:

I will update the filter to address this issue at some point.


Thanks to Olivier Resoneo for the original inspiration (French). His code was:

Grouper tous les webmail francophones sous le nom de domaine principal
Custom filter
Champ A : Campaign Source : (messag|courrie|zimbra|ima?p|mail|prd[0-9]+)(.*)\.(.*)\..{2,4}
Champ B : (rien) -
Output To -> Constructor : Campaign Source : Webmail - $A3
On peut aussi décliner pour forcer le medium à 'email' quand match sur Campaign Source, ET Campaign à email-non-taggue par exemple, pour avoir le triplet medium/source/campagne
Posted by tomachi on May 6th, 2012 filed in Google Analytics, Online Marketing