The Funk Blog

Some stuff that some people wrote

Get rid of live.com load-balanced domains with a regular expression: GA Webmail Referral Traffic Source Rollup Filter

The problem? Nasty load balanced domains in Google Analytics reports like this:


SourceVisits
36ohk6dgmcd1n-c.c.yom.mail.yahoo.net / referral149
mail.google.com / referral131
du114w.dub114.mail.live.com / referral43
du103w.dub103.mail.live.com / referral25
sn124w.snt124.mail.live.com / referral23

The solution? Clean nicely segmented source lines that “roll up” into one:


Source/MediumVisits
Webmail (live.com) / email643
Webmail (yahoo.com) / email258
Webmail (google.com) / email105
Webmail (aol.com) / email13
Webmail (libero.it) / email12
Webmail (laposte.net) / email23

To clean these up requires two filters:

  • Webmail Source Rollup (search and replace with Webmail (brand])
  • Webmail Medium Rollup (swap / referral with / email as medium for further rolling up!)

Both these filters will use the core regex code I’ve figured out that consolidates 99% of the worlds webmail systems without pulling in any false positives in theory.

The Magic Regex Code

Here is the filter regex that I’ve been currently using in my production Google analytics advanced filters since 15 Feb 2012 to cleanup these – or to roll up the load-balanced domains that you often get in referrals:

My starting point:

(messag|courrier|zimbra|imp|mail)(.*)\.(.*)\..{2,4}

This grabs any domain with say “mail” in it, but runs a check on the ending of the domain: it needs to have at least 2 dots after it and a TLD between 2 and 4 chars long. It will miss go.mail.ru and mail.com. It will also miss “Mail Campaign \ email” because this is not a proper domain. So far so good 🙂

My improvement with exceptions (This is the one to use!):

(messag|courrier|zimbra|^imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)

Domain of webmail platform is in capture group $A2.

My improved regex has a new exception section at the end to allow some special cases (go.mail.ru, 3c.web.de, service.mail.com) to get through the filter using a hard-coded approach that skips the safety net and autonomy provided by the wide open keyword matching paired with a domain name restriction ensuring “two more dots and a TLD”.

Maybe that is excessive use of regex, but at least you can be sure you can now see your word of mouth / word of email traffic nice and tidy!

How does it work?

I’ll break the formula down in sections:

(messag|courrier|zimbra|^imp|mail)

This looks for the really obvious and common keywords in webmail services, the main one being mail. This will match sn124w.snt124.mail.live.com but it will also grab emailchimp.com and a huge number of others that you really don’t want to catch with this filter. If we were looking for live (but we aren’t in this case), then a site like www.answeringoLIVEr.com would get picked up in the crossfire. I use a ^ in front of imp so that domains like dimpost.wordpress.com don’t get caught.

Basically the first part of the domain name (sn124w.snt124.mail) will be getting deleted by this filter so you could stand to lose quite a lot of data with the “mail” and “imp” keywords if this were the only parts of the expression! So the next bit of filter is designed to pass the domain through another difficult test involving the dots and TLDs…

(.*)\.(.*)\..{2,4}

This makes sure that the domain bits after mail or zimbra or whatever always have two dots and a TLD (top level domain extension eg .nz .jp). Which matches the end bit of sn124w.snt124.mail.live.com and  the end of: www.funk.co.nz.

The (.*) part means match anything including nothing, and the \. means there must be a dot, so (.*)\.(.*)\. means there gotta be at least two more dots in this domain name coming up after the live”. Which is how Answering Oliver gets through the test for live. The next part .{2,4} is all about the top level domain or TLD. These can be 2, 3, or 4 letters long like .co, .com. and .mobi. The curly braces specify how many times the previous character . (which means single char you like except nothing) can appear like {min,max}.

Then in the middle is a pipe | which cuts the regex open and allows some really hard to match exceptions through for smaller webmail systems. Only reason you see the web.de one also appear on the left of the central | is because this is such a whacky domain name that it doesn’t match the “mail” which gets most of the webmail systems on the planet. Germans aye? 🙂

* (Hi Devon! Thanks for sending traffic to Stray Travel I found you researching this post)

Screenshots

Campaign Source Filter

Advanced filter.

Field A -> Extract A: Campaign Source:

(messag|courrier|zimbra|^imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)

Field B -> Extract B: [leave both blank]

Output To -> Constructor: Campaign Source: Webmail ($A2.$A3)

GA Webmail Rollup Filter

GA Webmail Rollup Filter

Medium Rollup Filter

Advanced filter.

Field A -> Extract A: Campaign Source:

(messag|courrier|zimbra|^imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)

Field B -> Extract B: [leave both blank]

Output To -> Constructor: Campaign Medium: email

GA Webmail Rollup Filter

GA Webmail Rollup Filter

 

Additional reference domains to check:

The domains below are the really rare webmail clients that are hard to extract:

mail175-236.sinamail.sina.com.cn
go.mail.ru
service.mail.com
promail.co.nz
3d.web.de
ch1prd0310.outlook.com

This is why I needed to grab the full domain with (.*\..{2,4}) versus the first versions (.*)\.(.*)\..{2,4} which would have only grabbed “sinamail”. Now we get  sinamail.sina.com.cn.

These can be checked with: sina\.com\.cn|go\.mail\.ru|mail\.com|promail\.co\.nz|web\.de|outlook\.com

Errors and Issues

Currently the filter is incorrectly tracking as word of email the following referrals:

dailymail.co.uk

I will update the filter to address this issue at some point.

References

Thanks to Olivier Resoneo for the original inspiration (French). His code was:

Grouper tous les webmail francophones sous le nom de domaine principal
Custom filter
Advanced
Champ A : Campaign Source : (messag|courrie|zimbra|ima?p|mail|prd[0-9]+)(.*)\.(.*)\..{2,4}
Champ B : (rien) –
Output To -> Constructor : Campaign Source : Webmail – $A3
Yes
No
Yes
No
On peut aussi décliner pour forcer le medium à ’email’ quand match sur Campaign Source, ET Campaign à email-non-taggue par exemple, pour avoir le triplet medium/source/campagne