Monday, September 24, 2007

Wikipedia NOFOLLOW Argumentation - A View Back

This was a scrap or stub on my Cumbrowski.com site with the title "What the F**k is REL=NO FOLLOW?" for a while and I decided that I remove it from there. Wikipedia is now using nofollow and the discussion is a thing of the past. However, the argumentations are still valuable as long as there is the rel=nofollow attribute out there.

I decided to post the discussion here at my blog, where I have already made a number of posts that are related to Wikipedia, my activities at Wikipedia and Wikipedia issues and discussions.

It's a lot to read and not for everybody, but worthwhile for anybody who is interested in the NOFOLLOW debate in general.

Cheers!
Carsten aka Roy/SAC

------------------------------------------------------------------------------------------------------------------

Originally written April, 24th 2006

What the F**k is REL="NO FOLLOW"? - Original Proposition

Presented to WikiProject Spam on 4/17/2006.

Wikipedia is not the only Site that suffers from so called "Link Spam". Every Site and especially Blogs that offer anonymous Visitors the ability to interact, comment or contribute and often even encourage it have a common problem. People that use the features to their own personal advantage without the goal to contribute for the benefit of others. From being rare cases of abuse in the past to today's frequent occurrences which became by now more than just annoyances. They became a problem.

The same type of problem with similar reasons for it's existence as email SPAM. Talk was not enough anymore. Tools and mechanisms had to be developed to reduce the negative impact of SPAM. The purpose of Link SPAM is not as apparent as email SPAM though. Email SPAM is usually send with the goal to get the recipient to open and read the email which contains a commercial offer with the hope that the reader acts and buys the offered product or service. Email SPAM has the goal to generate instant revenue and profit.

The Difference between eMail Spam and Link Spam

Link SPAM does not. The Blog Comment that is completely irrelevant for the Blog Article containing a short Message and Link to a commercial offer is not intended for the Article Author nor it's readers. If they respond to the offer "great", but that was not the original intent by the Spammer. The Link is not meant to attract "humans". It is indented to attract the invisible automated programs called "Spider" or "Bot" utilized by all major Search Engines such as Google, Yahoo!, MSN (MS LIVE) and ask.com to gather Web Content which is processed and later returned to Users at the Search Engine in the Search Results (SERPS) if they are considered relevant by the Search Engine for the keyword or phrase entered by the User. The results that are considered most "relevant" are returned first. It is the goal of every search engine to RANK the Web Pages that match the users Search Query by highest Relevance to the topic the user is searching for.

How work Search Engine? What is their Goal?

Search Engines use mind boggling algorithms to calculate the "Relevance" and thus "Ranking" of every Indexed Webpage relatively to the words and phrases found on the Web Page. If two pages contain the word "science", the search engine must make the decision, which of the two pages it believes to be more important, more relevant than the other to show it as first result, if a user enters the search term "science" at the Search Engines Website. If you search for "science" at Google.com, over 4 Million!!! Web Pages are found. Google must make the decision, which of the 4+ Million Pages it should show first to the User. It tries of course to return the ones first that are most likely the ones containing the information the User is looking for. How do Search Engines determine the ranking of each page? How do they determine that Page A is shown 5th for the term "science" and Page B 4,0000,000th. Both are obviously about "science" or they would not be considered for the results at all. The actual ranking is determined by over 100 criteria by Google for example.

One of the most important criteria is the so called "Page Rank" of a Page. Page Rank was introduced by Google and made them what they are today. The Page Rank algorithm revolutionized search engines and produced fantastic accurate results. Read the original scientific paper on Page Rank "The Anatomy of a Large-Scale Hypertextual Web Search Engine" by the Google Founders Sergey Brin and Lawrence Page or Page Rank Explained by Phil Craven to learn about the mathematical background of Page Rank.

Search Engine Ranking - Google PageRank

The actual equations are very complicated, but the general concept is surprisingly simple. In simply words is the "Page Rank" of a page getting higher the more other pages and sites link to it. Every Link is Vote from one Page towards another. The Linked to page gains Page Rank while the linking page looses a bit of it's Page Rank. I believe you start to understand where I am getting at and what the intentions of the link spammer are. Right, he wants to get a "Link" or "Vote" to his Commercial Website that Search Engines like Google think that the Page is more important.

The higher the Rank of the Linking Page itself is, the stronger is the Vote. A Link from CNN.com's Homepage is certainly a stronger Vote for a Webpage than a link to it from a personal Page at Geocities.com. That is the reason why more popular sites are more targeted by Link Spammers than less popular ones. Wikipedia is obviously very popular, thus a link from Wikipedia is worth a lot more than a Link from a less popular Site. Spammers are not only targeting public sites to get inbound links they also create artificial Link Farms and purchase links from Webmasters that are willing to cash in on their sites popularity. The Search Engines became actually very smart in detecting artificial inbound link inflation making Link Farms a lot less effective and even can cause the Website that is the beneficiary of this to get penalized or even banned from the Search Engine Index.

Wikipedia is the perfect Target

Wikipedia is the perfect target for spammers to get inbound links to their site(s) without risking a penalty from the search engines, because it is almost impossible for the search engines to determine if a link at Wikipedia was added because it is really relevant for the topic or just by a Spammer to increase his Page Rank. Blogs have the same Problem and Google developed a simple to implement mechanisms for the Blogger or Webmaster to eliminate the whole benefit of having an outbound link at those sites for the sole purpose of gaining Page Rank. The only purpose why a spammer is trying to place a link in the first place. Links can still be added and used by Human Visitors that are interested and click it. Search Engine Spiders on the other hand that visit the page will simply ignore the Link, it will not count as a vote for the target website.

How is that done? Very simple. Simply add the attribute rel="nofollow" to the HTML Link Tag

<a href="http://www.website.com">Link Anchor</a> becomes
<a href="http://www.website.com" rel="nofollow">Link Anchor</a>

Conclusion


As you can see, it is not hard to do at all. The Change to the Wikipedia Code is absolute minor. The Gain and Benefits are out of the Question. Does this solve the problem completely? No, of course not! But it will significantly reduce the issue, because a huge number of Links added just because of Page Rank will not be added anymore. The benefits of having an outgoing link from Wikipedia to a site are severely reduced, but of course not completely eliminated. There remains the benefit of human traffic clicking the link. In this case is the link better highly relevant for the article or it will be removed quickly by the Wiki users anyway (without the need of an Editor to take actions).


I hope this clarifies the subject a bit more and finds some open ears somewhere and finally one of the Wiki Developers to spend the necessary minutes (few hours at the most) to implement this feature saving thousands and thousands of hours wasted by hundreds of Editors that have probably better things to do and could use the saved time for more important contributions for Wikipedia.

--Roy-SAC 11:31, 17 April 2006 (UTC)
You can find some background information about me and my professional qualifications on my Professional Homepage to enforce the credibility of my statements made in this article. My email is available there as well, if you you have any questions or anything else you would like to discuss with me outside the User Discussion Page.

The Discussion - Introduction and Summary

Copy/Backup of comments posted at Wikipedia_talk:WikiProject_Spam.

How to save hundreds or thousands of hours by spending just a few

Roy: I took the time to summarize and explain an important aspect of link spam on my user discussion page below. Some Editors expressed the opinion in the past that the proposed solution will not help to significantly reduce the problem which I vehement reject. Even if the impact is not as much as I expect will it still have enough impact to justify the necessary work to implement the solution. Being an enterprise solution developer myself gives me the authority to make the statement that the implementation of the solution can only be a matter of hours. An amount of time that will be saved multiple times over with absolute certainty in the future when it comes to link spam removal.

This will not immediately, because the word about the change has to go around and get to the potential link spammers first. Unless it will be picked up by the media and other means (bloggers etc.), a gradual impact should be expected. I invite everybody interested in this to join the discussion. Wikipedia Developers and Admins are more than welcome to join as well.
--Roy-SAC 11:55, 17 April 2006 (UTC)

Rhobite - Reasonable and respected Wikipedia Admin

Rhobite: You're acting like nofollow is a perfect solution to spam, but it isn't. Wikipedia has already had a large discussion about using nofollow. Mediawiki already has the technical ability to insert into links, but the community decided against it. See Wikipedia:Nofollow. Rhobite 15:01, 17 April 2006 (UTC)

Roy: Hi Rhobite. Thanks for the Move to the Discussion Section. It is not a perfect solution, but a working solution for one (major) part of the problem at hand.

I will go over the comments at Wikipedia:Nofollow in detail. It's been over a year now since the vote. The nofollow attribute was quite new back then and the traffic to wikipedia has also more than quadrupled since last year. I assume the issue is today also several times bigger than it was back then.

The Solution works for other Systems and Sites such as Blogs very well and reduced the issue a lot. Spammers are now creating the blogs themselves via programs (using API's) though :(. A different problem which requires a different solution.

The nofollow attribute is not diminishing the true purpose of an honest placed link. It works for a visitor who is clicking on it (and hopefully finds some more useful content) the same as a link without the attribute. This little attribute restores the original idea of hyper linking, when Links where only placed on Sites for Visitors to follow, not computer programs.

Google is the no.1 search engine worldwide with 50-60% Market Share despite the attempts of Yahoo!, MSN and ask.com to compete with Google in the Search Engine game. Yahoo threw the towel this January. ask.com was gaining, but only a bit, MSN is working on the problem to get their new search up and running. The situation did not get better, it got worse. the rel="no follow" attribute should be added automatically by the Wikipedia engine to ANY external link (URL's starting with "http://"), regardless if it is an Article, Discussion Page, User Page or System Page.

There should be NO on/off switch. This should be announced loud and clear to the public, also explaining what it does and what it NOT does. I bed with you $100 that with will reduce the amount of link spam you get here at wikipedia at least by a double digit number.

Since the current policy pretty much considers most external links as SPAM (-> see recommendation to link to the Yahoo Dir or Dmoz only and that's it)) is the total number of external links placed across Wikipedia a realistic measurement to evaluate the effects of adding the rel="no follow" attribute to all external links.
Since this is a topic I do know quite a lot about, I thought that it is a thing I am able to contribute well. Since I shoot myself into the foot with proposing and pushing for something like this, any doubt of an hidden agenda on my part can pretty much ruled out. I do believe in the need of valuable external links that enrich the content of an article at Wikipedia or provide proof for statements made in one.
I don't see any reason why the attribute should NOT be added except the reason that you want Wikipedia to be part of the Ranking Game. I can imagine that some Wikipedians do not like the idea, especially the ones that have a personal interest in some of the external links to their own personal/business websites. --Roy-SAC 15:51, 17 April 2006 (UTC)

Rhobite: My objection remains the same as a year ago: It doesn't deter spammers. Pagerank isn't the sole reason people spam Wikipedia. This is a very visible site, and if I were a spammer I would want to be linked from here, even if it didn't improve my Pagerank. A link from a prominent Wikipedia article could generate a lot of revenue for an unscrupulous person. Furthermore, Wikipedia can and should improve the Pagerank of good, relevant links. punishes operators of useful sites for the actions of spammers. Rhobite 16:57, 17 April 2006 (UTC)

Roy: It will certainly not deter all of them, probably not even the majority of them, but it will for sure deter some of them. if something is becoming less lucrative, less people will be tempted by it. That is an undeniable fact.

You are probably qualified to provide some rough numbers here. Let me ask you this? How much spam is removed by members of the SPAM project across all pages of Wikipedia every month? Lets be very pessimistic and assume that only 1% of the spammers are detered by the fact that they only have gain from a link via visitors that read the article and actually click on that link but don't gain anything else in the long run by increasing their rank in the Google SERPS and getting (a lot) more visitors from there?

How much time would 1% less spam save? Put that number next to the time it takes to implement the nofollow attribute (which is already in the code as you mentioned). And also how much LESS links that should be in the article get removed because of suspicion that the intent might be more selfish by the person that added it than it actually was?

You say that it will not deter any spammer at all which means that the amount of spam will remain the same if the nofollow attribute was added. This statement is based on what? Intuition? Facts? Show them to me. I can PROVE to you that the reduction and even better, the complete elimination of page rank of a link will deter people from adding knowingly links for selfish reasons.

If you get the chance, talk to a DMOZ Editor of an important commercial category. He will tell you, that he still gets more submissions than he can handle, but he will also tell you, that it is much less since Google de-valued links to sites that are listed at Dmoz in their Ranking Algorithm. The "punishment" of useful sites will be less of an issue than you think. Regular Sites that can not be changed by every john and joe out there will still link to those sites.

People who discover the site because of the Link from Wikipedia will also pickup the URL and link to it (I have done that myself more than one). If a sites reaches a certain popularity, Pagerank becomes less of a factor for the ranking. An increase from a Page Rank of 6 to a rank of 7 for example is huge, it gets even harder to impossible to get to a rank of 9 (There are mayby 1 or 2 dozen sites in the world that have that).

Lets summarize. It will certainly reduce spam if implemented consequently across the site and made public, it is easy to do implement, because the Wikipedia code is already ready for it and last but not least, the affect on valuable (authority or popular) sites is minimal. If you disagree, explain why. --71.195.125.110 20:49, 17 April 2006 (UTC)
...

Stevietheman - Active Wikipedian

Well, I spent the time reading the complete Nofollow page from the intro to the votes and finally the comments. There was a lot of clutter (on both sides of the argument). I "stripped" out the comments that clearly showed that the writer had no clue about the meaning/purpose of the non-W3C-standard rel=nofollow attribute, or about spamming (link spamming and spamming in general) and especially not about Search Engine Optimization (SEO), in particular Google.

The remaining "on the topic" facts and arguments for both opinions were overwhelmingly in favor of keeping the attribute enabled. I was surprised to find out that "only" 41% voted to keep the new implemented feature in Wikipedia (which was obviously "enabled" by default after the update that contained it was installed) and 61% voted for its removal (deactivation).

I have to speculate to explain this result. I guess a lot of the votes must have been based on "feelings" rather than facts or other motives must have been a factor. But hey, I am irritated by the fact that you Rhobite, somebody who is affected by the spam every single day, as one of Wikipedias first line of defense against link spam is against the use of the attribute.

Anything that makes your live easier without violating any of your basic beliefs and opinions should be welcomed and even embraced by you. Is the spam problem not that bad? You should know the best. Please tell me.

Btw, I think you did a great job fixing the grammar of my additions to the Affiliate marketing article about a month ago. You have great language skills and you should use those skills more often on article content than on wasting it on banal Link Spam removals.

I am working on improving my writing skills though (it is my second language after all). Thanks. --roy<sac> Talk! .oOo. 09:23, 18 April 2006 (UTC)

Stevietheman In a democracy, or rather, a wikicracy, no one person can decide which votes to accept and which to set aside. We all apply our own value judgments when voting. The bottom line is that the wikicracy said we're not doing nofollow, and that's that. — Stevie is the man! Talk | Work 22:39, 18 April 2006 (UTC)

Roy You are absolutely right about the democracy. The voting/election process in a democracy is essentially very simple. Everybody that is part of the society has one vote. All votes are counted equal. The value of a vote can not be reduced or increased based on qualitative criteria. Emotions and feelings influence our decisions (votes) although most people try to be as objective as possible when it comes to that.

I just noticed for that particular vote, that emotions and feelings must have played a major role, because the objective information that were available at the same time and should have played a major role during the decision making process are conflicting the actual votes.

"wikicracy said" ... "and that's that" sounds very absolute to me. Things that involve larger groups of human beings have the tendency to change over time. Those changes make it necessary for everybody to frequently check and adjust our opinions on things. Those changes can verify existing opinions, but can also make it necessary to question an opinion as a whole and change completely. Ignoring the changes and the refusal to check if the current opinion is still as valid as before lead to no good in the past.

The World History is full of cases where absolutism, ignorance and stagnatism caused a lot of pain and suffering, to eventually end very sudden and very violent.--roy<sac> Talk! .oOo. 04:22, 19 April 2006 (UTC)

Stevietheman Even as somebody who detests link spam, I have always objected to using "rel=nofollow". The central reason is that by using it, Wikipedia is basically saying "We wish to not contribute any information to search engines that may aid in people finding the material they are seeking." In short, this would be an anti-search, anti-Internet move in my opinion. The value of search comes from how web documents relate to each other. Extricating the tremendously important resource that is the Wikipedia from this overall process would in turn remove a lot of value from Internet search. And I will jump up and down and up and down again if that helps in preventing the Wikipedia from ever making such a foolhardy decision to implement nofollow.

Now, add to the above the other common reasons for being against it, including "doing this won't really deter spam", which I also agree with. — Stevie is the man! Talk | Work 22:36, 18 April 2006 (UTC)

Summarizing Statement and Conclusion

Roy I disagree on the statements that the rel=nofollow are anti-search and anti-Internet. I agree that it will have some impact on search, to be precise, search results at Google.com. It will be a positive and negative impact with the negative one further declining over time to something negligible.

The positive impact is, that the junk that is currently in Wikipedia will loose ranking and be replaced by hopefully more relevant content in the Google SERP's (I am referring to ANY part/page of the Wikipedia site that is accessible by the public, not just articles).

The negative impact is, that good content that is being linked to will drop (may be) as well, but I strongly believe that when it comes to highly relevant and good external sources linked to from active and live article pages will be marginal.
"Real" high quality content sites and pages have very often a pretty high and honest (intended) PageRank. The loss of the vote by the one link from Wikipedia will have little or no impact.

Furthermore, PageRank is very specific to Google. Ranking based on "Back Links" evaluation are a very small factor for the Yahoo! Search Engine and virtually none for MSN. Google is the only SE where it really matters, but Google has a 50-60% market share.

The rel=nofollow attribute was introduced by Google itself for sites that meet certain criteria. Wikipedia is certainly fitting the description of sites where Google recommends the use of the attribute. This contradicts the statement that the use of the rel=nofollow attribute is being anti-search.

Anti-Internet is also not being the case, on the contrary, it is as Pro-Internet as it can possibly get. Links to other Websites were never intended for programs and scripts. They were meant for human visitors from the beginning. The rel=nofollow attribute will not change this but remind people of the true purpose of linking between websites. Back to the Roots.

This Article from Gary McHugh called "Stinking Linking Thinking" from a month ago hits the Nail on the Head. It explains very well the original intentions for the use of the HREF HTML Tag. A friendly reminder for everybody who has all but forgotten this after all those years of mutilation , rape and abuse of those beautifully simple and user friendly tools.

Last but not least, I still would like to know some facts and details that made you come to the following opinion: "doing this won't really deter spam". So far does it look only like a believe or feeling to me without any objective grounds to stand on. I hope you can help me with that one. --roy<sac> Talk! .oOo. 05:39, 19 April 2006 (UTC)

Roy Here is an interesting post about the "nofollow" attribute by Matt Cutts (Who is a Senior Engineer at Google). He bloged about it here. Arguments coming from such a highly knowledgable and respected authority might convince some of you more than I was able to. --roy<sac> Talk! .oOo. 13:07, 19 May 2006 (UTC)

WikiProject:Spam Opinions and Facts wanted (Invitation)

After writing longer and longer invitations to join the discussion and provide input and idata on some Wikipedians Talk pages did I end up with this rather long one which I intend to continue to post on other Users Talk Pages from whom I believe to be able to contribute to the collection of facts and past experiences. I encourage anybody who wants to help and knows a Wikipedian who might be able to provide valuable input for this cause, to grab this paragraph and post it and the Wikipediants Talk Page or simply link to it. Here is the Link Code

[[User_talk:Cumbrowski#WikiProject:Spam_Opinions_and_Facts_wanted_(Invitation)|<u>'''WikiProject:Spam''' 
Opinions and Facts wanted (Invitation)</u>]]
The Link will look like this: WikiProject:Spam Opinions and Facts wanted (Invitation)

Hello my fellow Wikipedian!
I know the following text is long (no kiddin'), but I thought I'd rather present the details upfront than having you guessing them. There is no "Due Date" which means, that there is no need to rush and the need of dropping the things you are currently doing :). I'd rather have you take your time with it when you have it and are also in the mood for it, than rushing over it without giving it much thought and dumping it on the done pile.

Introduction and Summary

I am looking for Wikipedians that are interested in and knowledgeable about the Issue of Link Spam at Wikipedia to express their opinion about some of my recommendations to reduce it based on my research and experiences with it due to my professional background. I believe, that you one of them, that fits the "profile" perfectly :).

My Opinion and my Request to you

It seems to be an "old" and "done" subject, Even a vote about 15 months ago was conducted about it. All what I found out and collected about it makes it seem like an open issue rather than a thing that was settled for good. Too few facts were presented and not much (if any) quantifiable/measurable information were provided.
I would like you do go over the stuff I collected and consolidated so far and provide your point of view regarding this. If you have already done so in the past, simply reference to it that I can check it out.

I am also looking for some statistical information to be able to assess the real extent of the problem (and not just the felt one) as well as it's development over an extended period of time. If you have already anything like this or know how to get it, let me know. If you don't, but can point me into directions and/or people that can, let me know as well.

Tech-Stuff

It's really appreciated. You can get technical with me, I have the necessary background for it. You can check that on my User Page. I come the Microsoft/IIS/SQL Server/VB/.NET Environment, but I have some general understanding of the technology and ideas behind it which are mostly platform independent. I do know basic PHP and also installed recently the latest MediaWiki Version 1.5.8 and MySQL Server for Windows Version 5.0.19 on a Windows 2003 Server with IIS6 and PHP5 Extension. I can use this installation for some Tests or Script Development which den might be used at the Live Wikipedia. Probably Scripts for Data Collection and Assessment only. I do not intend to develop anything to make changes to processes and features of Wikipedia.org. If it happens that something that could be used in the future comes out of it, fine. I do not intend to write anything for myself, whatever comes out of it will be Public Domain (Open Source without any restriction for it's use at all).

My Intensions and Goals

I wrote similar Invitations on Talk Pages of other Wikipedia I came across, but this one is the most detailed version of it in regards to explaining my intentions and purpose of the whole thing in great length and depth. I would appreciate, if you would invite other interested Wikipedians that are authorities in this area to give their input as well. I would like to keep the ones, that only know little details and have only general/common knowledge about this kind of stuff out of the discussion for now to prevent it from getting dispersed right at the beginning and turned into a rhetoric discussion. Nothing will come out of it, if only one "belief" group argues against another, based on speculations and feeling rather than facts and solid numbers. An open for all discussion will have to happen at some point in time, but it should be later, when enough data and information are available to have some solid ground for a general discussion for everybody that gets at least a chance to end in actions that will benefit everybody at Wikipedia and its many users in the long run.

Sincerely --roy<sac> Talk! .oOo. 05:41, 20 April 2006 (UTC)

No comments:

Post a Comment

Hi, thanks for taking the time to comment at my blog.

Due to spam issues comments are not immediately posted on the site and require my manual approval first, before they become visible.

I try to approve comments as quickly as possible and usually within 24 hours.

To be notified about follow up comments that are made after yours, use the subscribe option with your email address and you will receive an email alert, if somebody else comments at this post in the future.

Also check out the rest of the website beyond this blog, visit RoySAC.com. Also see my YouTube channels, SACReleases for intros and demos.

Cheers!
Carsten aka Roy/SAC

Note: Only a member of this blog may post a comment.