Get rid of live.com load-balanced domains with a regular expression: GA Webmail Referral Traffic Source Rollup Filter

Posted by tomachi on May 6th, 2012 filed in Google Analytics, Online Marketing

The problem? Nasty load balanced domains in Google Analytics reports like this:


Source Visits
36ohk6dgmcd1n-c.c.yom.mail.yahoo.net / referral 149
mail.google.com / referral 131
du114w.dub114.mail.live.com / referral 43
du103w.dub103.mail.live.com / referral 25
sn124w.snt124.mail.live.com / referral 23

The solution? Clean nicely segmented source lines that “roll up” into one:


Source/Medium Visits
Webmail (live.com) / email 643
Webmail (yahoo.com) / email 258
Webmail (google.com) / email 105
Webmail (aol.com) / email 13
Webmail (libero.it) / email 12
Webmail (laposte.net) / email 23

To clean these up requires two filters:

  • Webmail Source Rollup (search and replace with Webmail (brand])
  • Webmail Medium Rollup (swap / referral with / email as medium for further rolling up!)

Both these filters will use the core regex code I’ve figured out that consolidates 99% of the worlds webmail systems without pulling in any false positives in theory.

The Magic Regex Code

Here is the filter regex that I’ve been currently using in my production Google analytics advanced filters since 15 Feb 2012 to cleanup these – or to roll up the load-balanced domains that you often get in referrals:

My starting point:

(messag|courrier|zimbra|imp|mail)(.*)\.(.*)\..{2,4}

This grabs any domain with say “mail” in it, but runs a check on the ending of the domain: it needs to have at least 2 dots after it and a TLD between 2 and 4 chars long. It will miss go.mail.ru and mail.com. It will also miss “Mail Campaign \ email” because this is not a proper domain. So far so good :)

My improvement with exceptions (This is the one to use!):

(messag|courrier|zimbra|^imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)

Domain of webmail platform is in capture group $A2.

My improved regex has a new exception section at the end to allow some special cases (go.mail.ru, 3c.web.de, service.mail.com) to get through the filter using a hard-coded approach that skips the safety net and autonomy provided by the wide open keyword matching paired with a domain name restriction ensuring “two more dots and a TLD”.

Maybe that is excessive use of regex, but at least you can be sure you can now see your word of mouth / word of email traffic nice and tidy!

How does it work?

I’ll break the formula down in sections:

(messag|courrier|zimbra|^imp|mail)

This looks for the really obvious and common keywords in webmail services, the main one being mail. This will match sn124w.snt124.mail.live.com but it will also grab emailchimp.com and a huge number of others that you really don’t want to catch with this filter. If we were looking for live (but we aren’t in this case), then a site like www.answeringoLIVEr.com would get picked up in the crossfire. I use a ^ in front of imp so that domains like dimpost.wordpress.com don’t get caught.

Basically the first part of the domain name (sn124w.snt124.mail) will be getting deleted by this filter so you could stand to lose quite a lot of data with the “mail” and “imp” keywords if this were the only parts of the expression! So the next bit of filter is designed to pass the domain through another difficult test involving the dots and TLDs…

(.*)\.(.*)\..{2,4}

This makes sure that the domain bits after mail or zimbra or whatever always have two dots and a TLD (top level domain extension eg .nz .jp). Which matches the end bit of sn124w.snt124.mail.live.com and  the end of: www.funk.co.nz.

The (.*) part means match anything including nothing, and the \. means there must be a dot, so (.*)\.(.*)\. means there gotta be at least two more dots in this domain name coming up after the live”. Which is how Answering Oliver gets through the test for live. The next part .{2,4} is all about the top level domain or TLD. These can be 2, 3, or 4 letters long like .co, .com. and .mobi. The curly braces specify how many times the previous character . (which means single char you like except nothing) can appear like {min,max}.

Then in the middle is a pipe | which cuts the regex open and allows some really hard to match exceptions through for smaller webmail systems. Only reason you see the web.de one also appear on the left of the central | is because this is such a whacky domain name that it doesn’t match the “mail” which gets most of the webmail systems on the planet. Germans aye? :)

* (Hi Devon! Thanks for sending traffic to Stray Travel I found you researching this post)

Screenshots

Campaign Source Filter

Advanced filter.

Field A -> Extract A: Campaign Source:

(messag|courrier|zimbra|^imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)

Field B -> Extract B: [leave both blank]

Output To -> Constructor: Campaign Source: Webmail ($A2.$A3)

GA Webmail Rollup Filter

GA Webmail Rollup Filter

Medium Rollup Filter

Advanced filter.

Field A -> Extract A: Campaign Source:

(messag|courrier|zimbra|^imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)

Field B -> Extract B: [leave both blank]

Output To -> Constructor: Campaign Medium: email

GA Webmail Rollup Filter

GA Webmail Rollup Filter

 

Additional reference domains to check:

The domains below are the really rare webmail clients that are hard to extract:

mail175-236.sinamail.sina.com.cn
go.mail.ru
service.mail.com
promail.co.nz
3d.web.de
ch1prd0310.outlook.com

This is why I needed to grab the full domain with (.*\..{2,4}) versus the first versions (.*)\.(.*)\..{2,4} which would have only grabbed “sinamail”. Now we get  sinamail.sina.com.cn.

These can be checked with: sina\.com\.cn|go\.mail\.ru|mail\.com|promail\.co\.nz|web\.de|outlook\.com

Errors and Issues

Currently the filter is incorrectly tracking as word of email the following referrals:

dailymail.co.uk

I will update the filter to address this issue at some point.

References

Thanks to Olivier Resoneo for the original inspiration (French). His code was:

Grouper tous les webmail francophones sous le nom de domaine principal
Custom filter
Advanced
Champ A : Campaign Source : (messag|courrie|zimbra|ima?p|mail|prd[0-9]+)(.*)\.(.*)\..{2,4}
Champ B : (rien) -
Output To -> Constructor : Campaign Source : Webmail – $A3
Yes
No
Yes
No
On peut aussi décliner pour forcer le medium à ‘email’ quand match sur Campaign Source, ET Campaign à email-non-taggue par exemple, pour avoir le triplet medium/source/campagne


3 Responses to “Get rid of live.com load-balanced domains with a regular expression: GA Webmail Referral Traffic Source Rollup Filter”

  1. Olivier RESONEO Says:

    Hi,

    Thanks for these improvements based on my original regexp.

    I would however appreciate if you can *quote* my original work
    It is available here
    https://gist.github.com/b34f2920fe20cd844e63

    Regards,

    Olivier

  2. Malhar Says:

    Hey,

    I am having trouble using the filter as described. In the first case when creating camaping source filter, why is the output “campaign source”- webmail ($A2.$A3)? Could you explain?

  3. tomachi Says:

    Hi Benchprep,
    campaign source is the variable that provides the “source” parameter in GA reports. Without the rollup filter these will be like sn124w.snt124.mail.live.com / referral.

    I’ve put “webmail” at the front so they can sort together. $A2 is a regular expression “capture group” which is a temporary storage area for data while the filter is running in real time. Capture groups in regex are surrounded by (round brackets), and all the capture group variables from the first Field A -> Extract A filter will start with $A. The first capture group $A1 is discarded and not used, so you will not see “zimbra” in the reports. The second capture group $A2 get written into the output, as well as $A3…. oh hangon, I’ve discovered an error in my post.

    You might like to try this regex:
    (messag|courrier|zimbra|imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)
    and the output constructor: Webmail ($A2)
    This outputs in reports like this: Webmail (live.com) / referral