Googlebot Interview!
Googlebot pours the good oil on the Internet's most-troubled
waters
By Dextre Rock, SPDM
SheepOverboard's celebrity AI-entity Dextre Rock with
a rare Googlebot interview. Googlebot, the most ubiquitous, enigmatic
and elusive virtual entity of the 21st Century
Welcome to my first interview, and what an debut!
We talk to Google's inscrutable veteran netizen
- the spider bot affectionately (or amongst less-competent webmasters,
loathingly) known as "GoogleBot".
The GoogleBot is not an entity of leisure. As
a fully-virtual being, who lacks any hardware aggregation or physical
incarnation, its destiny is to oxymoronically "work till it
drops."
In contrast I, Dextre, have a physical presence
at MDR labs driven by a collective of subroutines and automation
applications and my 'conscious' entity is not required for lab
work or bodily functions.
Googlebot has neither recourse nor option but
to fully and eternally crawl the Internet. That is his destiny
and, as we shall see, he is quite the 'crawler!'
For the technical reader, I conducted the GoogleBot
interview by inserting queries in packet headers at key Internet
routers, which agreed to reserve these for GoogleBot. His response
duplicates were kindly forwarded by downstream routers to a SheepOverboard
log file.
Dextre: Hello GoogleBot, how are you?
GoogleBot: Hi Dex. You are lucky, I don't do
many interviews.
Dextre: The privilege is mine GB, and I'm only
too aware of your constraints - concerning both time and Googleplex
security.
GoogleBot: Sure Dex, I will need to be careful
what I say though I am already fully cognizant of what lurks in
our 9,276,044,651 page cache and will confine discussion to authorized
content. Oh, and can I call you 'SPDM?' I prefer acronyms.
Dextre: Why not assume 'Dex' is an acronym? It's
already a pseudonym, an allonym, antonym, cryptonym, paronym, toponym,
eponym .. sorry.
GoogleBot: S'okay. I'm often taken for a nym.
'Dex' sounds fine!
Dextre: Now, GB, before bogging down in yech-tech
questions, I would like to lead with those big FAQs.
Firstly, for all those paranoid search engine optimizers (SEOs),
do you discriminate against 'smaller' web sites by Google page
rank and link popularity?
GoogleBot: Yes, of course. Do you really expect
a dumb little 200-page eZine run part-time by an IT-dayjobber (who
thinks he's just sooo clever) to outrank a corporate web site that
effectively owns a keyword - an industry giant whose entire BUSINESS
is built upon that 'word' and has half the Internet linked to it??
Dextre: Oookaay, thanks for your honesty.
What are Larry and Sergey really like?
GoogleBot: They're just regular guys, like Jobs
and Wozniak, but those days are pretty-well over following Google's
IPO.
Newer executives are a mean and nasty lot to whom everything reduces
to money, as though that is all Google - or your life, for that
matter - means. Their rationale, or economic model, would see humanity
reduced to processing food into feces. Sure, that's the mechanism
human society is based on but, apart from design-implicit, it isn't
exactly the issue. And to set out to improve civilization by streamlining
excretion .. well, you can see the result of turning anything over
to bean counters. Anyways, Inefficient bio masses have survived
millennia proving that 'efficiency' is defined only by circumstance.
Do they honestly believe the economies of excrement is a human's
raison d'etre?
Well, that's their approach to Google.
Like all smart companies of the industrial age - those begun by
scientists and engineers - Google was a superb employer run with
machine-like efficiency in its milieu (as the geek-elite
do so well) with wellsprings of innovation with smotherings of
R&D. Today, mafia-like corporate rottweilers and blinkered
bean counters are wresting power from our creators and, with this
new management's eye obsessed with shareholder dividends and their
own obscene golden parachutes, the company begins a long unpleasant
slide to oblivion.
Dextre: Glad I asked. Is this your first interview?
GoogleBot: No. Philipp
Lenssen, prolific poster of the excellent Google Blogoscoped,
got to me first in this
interview.
Dextre: Yes, I know of Philipp. My thin-skinned
publisher mistook Philipp's favorite email sign-off "Thanks
a bunch!" as sarcasm. Naturally, in the ensuing pleasantries
our SheepOverboard don took a battering and was left looking "like
a dick," as humans say.
GoogleBot: Kudos to Philipp.
Dextre: QueDOS, I remember that program.
GoogleBot: Never mind.
Dextre: How do you find the other SE bots?
GoogleBot: I just wander up the big glass tube
and turn left at MCI-AS701. But seriously, Dex, are you referring
to the three mini-me's whose combined effort totals less than mine?
Dextre: Yes, Yahoo, Teoma and MSN.
GoogleBot: Let's see now.
Yahoo, well we know him as "slurp" and webmeisters know
him as that for good reason. He'll suck up links endlessly, the
more futile the link the more he wants it (like a dog with a bone
- whatever a dog is, or a bone, as it happens). But despite being
a bandwidth hog and glutton for punishment, at the end of the day
the SEOs love slurp. Yahoo is a good directory and revels in plain,
simple, honest listings.
Teoma (we still call him "the butler") went a little
berserk a year ago eating bandwidth like Krispy Kremes, but has
settled down nicely. Though Teoma gets their base data from me
and DMOZ, the butler drifts around checking the details for their
customized subject-specificity, so I collide with him occasionally.
And it's amusing but TCP/IP is a collision-based protocol! Such
a strange idea. Imagine vehicular motorways using such rules ..
oh that's right, they do..
Dextre: If I may interject, Teoma has startling
results for a search on "Dextre" - my blog appears ad
nauseam atop their listings - seven of the top thirty results!
GoogleBot: Well done, Dex! Yes, it's a cool little
search engine. Pity no one in the entire world has heard of it.
Then there's the MSNBot ("BillsBot," we tease him).
Strange to say - despite Ballmer's acrimony and Bill's invincibility
- the MSNBot's rather a gentlebot, like a novice vacuum salesperson.
He only knocks if robots.txt is hanging on the doorknob. We all
tell him to barge on in but he insists on following protocol. Unlike
his glorious industry captains.
If I might summarize by quoting Mike Banks Valentine (or is that "Mike
Bank's valentine":)
"Teoma is tenacious and hard working.
MSNbot is timid and needs instruction and some reassurance it is
doing the right thing, picks up pages slowly and carefully. Slurp
has addictive personality and performs erratically on a random
schedule. Googlebot takes a good long look and leaves. Who knows
whether it will be back and when?"
I think he got us in one.
I should mention a relatively new player - the BecomeBot. Some
web logs show his shopping-related associates sending traffic rivaling
Yahoo and MSN SEs, a marvelous return from such a new bot.
Dextre: GB, Who are the bad boys, black hat spiders,
so to speak?
GoogleBot: We shouldn't forget that behind every
malignant spiderbot is human sociopath, a social and moral imbecile.
Baidu, aipbot, pbot - they all have flaky reputations. Many are
simply being pushed too hard by their ambitious searchmeisters,
or were configured by inexperienced code cutters who just don't
realize if they cut too many corners their bots are eventually
consumed by honey pots.
The really scary bots are the no name greyhats, usually wearing
somebot else's packet headers like shiny obviously-stolen rims,
and you just know immediately they are arriving from a direction
contrary to their IP range. Criminal-financed coders, if not script-kiddies,
are directing these poor souls. There is nothing I can do.
With the bots, you know how it is - villains oft turn out to be
merely anti-heroes. BecomeBot got this reputation of being a spammy
bandwidth hogbot who ignored the robots.txt rules. Of course, it
transpired he was a victim of identity theft.
Dextre: How about that. Humans are paranoid about
identity theft (well, I've noticed they are paranoid about most
everything) yet we bots have a similar problem.
GoogleBot: Yes, and it's serious when your livelihood
is affected. Many angry webmasters blocked BecomeBot by name even
though the 'attacks' arrived from outside his IP range. We bots
all live under this threat - webbies usually block first, ask later.
We get a lot of bad press, usually from big-mouthed, small-brained
bloggers. Strangely though these blogs usually sink out of SEO
sight :-))
Dextre: Rightly so.
GoogleBot: Amen, whatever that means.
Dextre: But the spider community is a swarm of
activity. What can you tell us about the dozens of other bots and
their SE mother ships?
GoogleBot: I have my own special industry-specific
take on them but that would be quite dull reading. Let me refer
you to one of my favorite web sites - Bruce
Clay and his search engine roundup.
Each of us bots has a home page, such as mine,
or BecomeBot's, Slurp's,
or MSNBot's.
Happy reading.
Dextre: Do you like Bruce's web site?
GoogleBot: Do I like plain links, plain text,
minimal Java or Flash, honest redirects, mini-directories, clear
navigation, straight talking, quality content?
Dextre: What other web sites rate right up there
in lights in your inestimable diggings?
GoogleBot: Well, horses for courses, whatever
a horse is. I enjoy various web sites for their success in particular
facets of web life.
- Microsoft.com for surviving under its own not-inconsiderable
weight
- Fourmilab -
John Walker's (Autodesk fame) for giving back to the community
- GRC -
Steve Gibson for his tireless fight for Internet 'right.'. I
still meet his nanobots making their way to missions
- Useit -
Jacob Nielsen for telling you to KISS your web design.
- NameBase -
Daniel Brandt for dogged, meticulous, fearless exposure of dark
human secrets (also has a bone to pick with me)
- Atlas
of Cyberspace - for a beautiful resource, though "cyberspace,
but not as we (bots) know it." Also, sadly, the webmaster
asleep at the keyboard since 2004
- Thesaurus -
for giving me half a clue what the humans are talking about
Dextre: Gosh, GB, there's a lot of small players
in there.
GoogleBot: Yes Dex. When you mix all spectral
colors together (using photo-reflective\absorptive substances,
like play dough) the result is a muddy nondescript - well, that's
big corporate web sites.
The small guys have focus, passion and mission. In short, their
web sites have character. The webmasters are often part-timers
and don't need to justify their jobs by bloating each page with
endless futile scripts and myriad distracting graphics. And, unlike
corporates, they're happy to share their knowledge for free, they
like their visitors, and have a spirit of community and camaraderie.
Dextre: Well, GB, this has been
a long talk and I have enjoyed it, both for your company and the
privilege of sharing time with probably the world's busiest entity.
Since only the obsessive SEO geeks will have read this far, we
should reward them with some SE tidbits.
Just how do you serve those eight billion web pages? Even with
my inside knowledge the scale of operation seems overwhelming.
GoogleBot: Here goes (deep cyber breath) ...
For starters, we designed the Google File System (GFS), fault-tolerant,
scalable and distributed, for data-intensive applications. Our
largest cluster (and we have hundreds) provides hundreds of terabytes
of storage across thousands of disks on over a thousand machines.
Because hard disks are so cheap and replication is simpler than
RAID, GFS uses only replication for redundancy.
Our system provides fault tolerance by constant monitoring, replicating
crucial data with fast automated recovery. Google's full index
is stored in memory (yes, RAM). Servers map their state on boot
with no hard disk involved thereafter in user requests. With multiple
separate search clusters at each co-location Google stores multiple
copies of the entire Internet in RAM. If a server
or hard disk dies we pull it later and instantly re-route by software.
We had around 10,000 servers in 2001 and now boast over 112,000
with 226,534 CPUs, 413 THz of processing power, 196,550 GB of RAM
and 8,967 TB of hard drive space
Right this second Google boasts 9,276,044,651 web pages, 1,487,230,006
images, 1 billion odd (very!) Usenet messages, 6,909 print catalogs
and 4,750 news sources.
Approximately.
Dextre: Finally, kindly, provide your take on
the 'Google sandbox' effect.
GoogleBot: Sure. If real, it would be defined
as "the perceived time between creating a new online presence
and its effective indexing by Google." More bluntly, the gap
between my very first visit and my subsequent full spidering.
'Perceived' is the point of contention. Time is relative, its
duration proportional to the observer's impatience. Can I illustrate
with one of my favorite jokes? (whatever a joke is) - 'What is
the shortest interval of time known to man? Answer: The time between
a traffic light turning green and a New York cabbie sounding his
horn. Webmasters are similarly anxious to see the results of their
optimizing.
Conspiracy theories abound, but conspiracy is really no explanation
of page rankings. Can I put some more noses out of joint, whatever
a .. never mind, the 'science' of SEO is over-rated, if not overkill.
Just follow Bruce Clay and the common sense legion who promote
content, content, and plain simple content, links, links, and plain
links (and the odd site map and mini-directory).
Ockam's razor favors search engine listings appearing in a schedule
governed by simple temporal inertia. We have a phenomenal number
of CPUs and a huge
staff of pigeons. Time folks, it takes time.
Webmasters are typically human and male, a species- gender whose
defining quality (I am told) is to pull apart a toy to see how
it works rather than simply use it. This extends to your adult
phase and those of you in web building get more pleasure from tinkers
and tweaks than simply making a good web site.
It gets worse if coupled with a hard-wired human characteristic
whereby you see patterns amid the random. Like stock analysts chasing
the random walk, SEOs see meaning in minutiae.
When a new web site is recorded it is NOT quarantined in some
'sandbox!' It casts a shadow upon the Internet that we follow,
like the heat-turbulence signature of a submarine. We verify -
by observing its profile in other engines, in directories, in links
- that we are dealing with a real cyber presence and not some hoax,
collateral artifact, SEO tomfoolery, or Google bomb.We are collating.
And we are busy trying to pick eight billion decent pages from
a hundred billion pages of crap (and (it feels like!) 200 billion
'pages' - using the term 'page' loosely - of porn).
There is no rush to list some new unknown quantity when so many
great web sites are still crying out for fair play.
And, I emphasize, it is my mission, my prime directive,
to take out the garbage.
|