Big Data, Big Problems: 4 Major Link Indexes Compared ~ Tieranium

Posted by russangular

Given this blog's readership, chances are good you will spend some time this week looking at backlinks in one of the growing number of link data tools. We know backlinks continue to be one of, if not the most important parts of Google's ranking algorithm. We tend to take these link data sets at face value, though, in part because they are all we have. But when your rankings are on the line, is there a better way to get at which data set is the best? How should we go about assessing these different link indexes like Moz, Majestic, Ahrefs and SEMrush for quality? Historically, there have been 4 common approaches to this question of index quality...

Breadth: We might choose to look at the number of linking root domains any given service reports. We know that referring domains correlates strongly with search rankings, so it makes sense to judge a link index by how many unique domains it has discovered and indexed.
Depth: We also might choose to look at how deep the web has been crawled, looking more at the total number of URLs in the index, rather than the diversity of referring domains.
Link Overlap: A more sophisticated approach might count the number of links an index has in common with Google Webmaster Tools.
Freshness: Finally, we might choose to look at the freshness of the index. What percentage of links in the index are still live?

There are a number of really good studies (some newer than others) using these techniques that are worth checking out when you get a chance:

BuiltVisible analysis of Moz, Majestic, GWT, Ahrefs and Search Metrics
SEOBook comparison of Moz, Majestic, Ahrefs, and Ayima
MatthewWoodward study of Ahrefs, Majestic, Moz, Raven and SEO Spyglass
Marketing Signals analysis of Moz, Majestic, Ahrefs, and GWT
RankAbove comparison of Moz, Majestic, Ahrefs and Link Research Tools
StoneTemple study of Moz and Majestic

While these are all excellent at addressing the methodologies above, there is a particular limitation with all of them. They miss one of the most important metrics we need to determine the value of a link index: proportional representation to Google's link graph . So here at Angular Marketing, we decided to take a closer look.

Proportional representation to Google Search Console data

So, why is it important to determine proportional representation? Many of the most important and valued metrics we use are built on proportional models. PageRank, MozRank, CitationFlow and Ahrefs Rank are proportional in nature. The score of any one URL in the data set is relative to the other URLs in the data set. If the data set is biased, the results are biased.

A Visualization

Link graphs are biased by their crawl prioritization. Because there is no full representation of the Internet, every link graph, even Google's, is a biased sample of the web. Imagine for a second that the picture below is of the web. Each dot represents a page on the Internet, and the dots surrounded by green represent a fictitious index by Google of certain sections of the web.

Of course, Google isn't the only organization that crawls the web. Other organizations like Moz, Majestic, Ahrefs, and SEMrush have their own crawl prioritizations which result in different link indexes.

In the example above, you can see different link providers trying to index the web like Google. Link data provider 1 (purple) does a good job of building a model that is similar to Google. It isn't very big, but it is proportional. Link data provider 2 (blue) has a much larger index, and likely has more links in common with Google that link data provider 1, but it is highly disproportional. So, how would we go about measuring this proportionality? And which data set is the most proportional to Google?

Methodology

The first step is to determine a measurement of relativity for analysis. Google doesn't give us very much information about their link graph. All we have is what is in Google Search Console. The best source we can use is referring domain counts. In particular, we want to look at what we call referring domain link pairs. A referring domain link pair would be something like ask.com->mlb.com: 9,444 which means that ask.com links to mlb.com 9,444 times.

Steps

Determine the root linking domain pairs and values to 100+ sites in Google Search Console
Determine the same for Ahrefs, Moz, Majestic Fresh, Majestic Historic, SEMrush
Compare the referring domain link pairs of each data set to Google, assuming a Poisson Distribution
Run simulations of each data set's performance against each other (ie: Moz vs Maj, Ahrefs vs SEMrush, Moz vs SEMrush, et al.)
Analyze the results

Results

When placed head-to-head, there seem to be some clear winners at first glance. In head-to-head, Moz edges out Ahrefs, but across the board, Moz and Ahrefs fare quite evenly. Moz, Ahrefs and SEMrush seem to be far better than Majestic Fresh and Majestic Historic. Is that really the case? And why?

It turns out there is an inversely proportional relationship between index size and proportional relevancy. This might seem counterintuitive, shouldn't the bigger indexes be closer to Google? Not Exactly.

What does this mean?

Each organization has to create a crawl prioritization strategy. When you discover millions of links, you have to prioritize which ones you might crawl next. Google has a crawl prioritization, so does Moz, Majestic, Ahrefs and SEMrush. There are lots of different things you might choose to prioritize...

You might prioritize link discovery. If you want to build a very large index, you could prioritize crawling pages on sites that have historically provided new links.
You might prioritize content uniqueness. If you want to build a search engine, you might prioritize finding pages that are unlike any you have seen before. You could choose to crawl domains that historically provide unique data and little duplicate content.
You might prioritize content freshness. If you want to keep your search engine recent, you might prioritize crawling pages that change frequently.
You might prioritize content value, crawling the most important URLs first based on the number of inbound links to that page.

Chances are, an organization's crawl priority will blend some of these features, but it's difficult to design one exactly like Google. Imagine for a moment that instead of crawling the web, you want to climb a tree. You have to come up with a tree climbing strategy.

You decide to climb the longest branch you see at each intersection.
One friend of yours decides to climb the first new branch he reaches, regardless of how long it is.
Your other friend decides to climb the first new branch she reaches only if she sees another branch coming off of it.

Despite having different climb strategies, everyone chooses the same first branch, and everyone chooses the same second branch. There are only so many different options early on.

But as the climbers go further and further along, their choices eventually produce differing results. This is exactly the same for web crawlers like Google, Moz, Majestic, Ahrefs and SEMrush. The bigger the crawl, the more the crawl prioritization will cause disparities. This is not a deficiency; this is just the nature of the beast. However, we aren't completely lost. Once we know how index size is related to disparity, we can make some inferences about how similar a crawl priority may be to Google.

Unfortunately, we have to be careful in our conclusions. We only have a few data points with which to work, so it is very difficult to be certain regarding this part of the analysis. In particular, it seems strange that Majestic would get better relative to its index size as it grows, unless Google holds on to old data (which might be an important discovery in and of itself). It is most likely that at this point we can't make this level of conclusion.

So what do we do?

Let's say you have a list of domains or URLs for which you would like to know their relative values. Your process might look something like this...

Check Open Site Explorer to see if all URLs are in their index. If so, you are looking metrics most likely to be proportional to Google's link graph.
If any of the links do not occur in the index, move to Ahrefs and use their Ahrefs ranking if all you need is a single PageRank-like metric.
If any of the links are missing from Ahrefs's index, or you need something related to trust, move on to Majestic Fresh.
Finally, use Majestic Historic for (by leaps and bounds) the largest coverage available.

It is important to point out that the likelihood that all the URLs you want to check are in a single index increases as the accuracy of the metric decreases. Considering the size of Majestic's data, you can't ignore them because you are less likely to get null value answers from their data than the others. If anything rings true, it is that once again it makes sense to get data from as many sources as possible. You won't get the most proportional data without Moz, the broadest data without Majestic, or everything in-between without Ahrefs.

What about SEMrush? They are making progress, but they don't publish any relative statistics that would be useful in this particular case. Maybe we can hope to see more from them soon given their already promising index!

Recommendations for the link graphing industry

All we hear about these days is big data; we almost never hear about good data. I know that the teams at Moz, Majestic, Ahrefs, SEMrush and others are interested in mimicking Google, but I would love to see some organization stand up against the allure of more data in favor of better data—data more like Google's. It could begin with testing various crawl strategies to see if they produce a result more similar to that of data shared in Google Search Console. Having the most Google-like data is certainly a crown worth winning.

Credits

Thanks to Diana Carter at Angular for assistance with data acquisition and Andrew Cron with statistical analysis. Thanks also to the representatives from Moz, Majestic, Ahrefs, and SEMrush for answering questions about their indices.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

by russangular via Moz Blog

About Admin

This is dummy text. It is not meant to be read. Accordingly, it is difficult to figure out when to end it. But then, this is dummy text. It is not meant to be read. Period.

Tieranium