First
of all, you should understand the basic
differences between a "spider" and an
"indexer".
Spider
Finds web pages, follows links, and the
page content does not matter, it will be indexed later. The spider won't
crawl password protected sites, it usually
avoids dynamic content due to "spider
traps", and respects robots.txt and
meta tag exclusions.
Indexer
Primary job is to evaluate and score
content (text, meta tag info, anchor
text of incoming links, etc.). Removes
duplicates and aliases and filters spam.
When the indexer detects spam, it adds
the domain to its "do not crawl list".
Navigation
Suggestions
If your navigation is JavaScript, graphics,
Flash, etc. make sure you use text links
in the footer of your page for the navigation.
Spiders love hierarchies and users prefer
a search box. Have both. Always, always,
always link back to your home page.
The reason? So the spider can find the
rest of your site and lost visitors
can re-orient themselves easily.
Settings
Don't require cookies or session IDs to browse your site.
Doing so can harm your site's ability
to be indexed. Why? Since the spider
does not accept either, your content
management system may start to feed
the spider the same page over and over
again with different session IDs. This
causes a "spider trap" and an endless
loop which evenutally
times out and the spider backs out. A strong warning
is if one engine has a page saturation
that is out of proportion to the others.
That is usually a sign that a "spider trap"
has occurred. You risk having your site
dropped by that search engine within
the next 90 days due to "indexing bloat" if it isn't corrected.
If you
use a 404 error page, use only one generic
paged designed for that purpose, but
do not, for any reason, redirect the
404 to your home page. Ensure that the
title of this page is named "Error 404"
this will alert the spider and it will
move on without indexing the error page.
If you
have moved your site or moved content,
use a permanent 301 redirect to point to the
new content/domain. This will inform
the spider and it will make note and
crawl the site properly on its next
visit. Doing so also has benefits as
it will transfer all of the link credit
from the old site to the new one. User
bookmarks still function, as well as
all old links. 301 redirects can be left for
long periods of time which is why they are known as "permanent redirects".
Avoid
Excessive URL Depth
Having a deep site structure decreases the chance
that the spider will find all of your
pages. Very deep URLs tend not to rank
as well.
www.mybooks.com/order-of-the-phoenix.html
This
would be classified as Depth One. It
is NOT suggested here to have all pages
on the root of the domain. Having depth
of two or three levels would be acceptable.
www.mybooks.com/uk/fiction/childrens/jkrawlings/harry-
potter/order-of-the-phoenix
This
is six levels and probably would NOT
be crawled.
Data-base
Driven Sites
Static URLs get crawled, and dymanic
pages that have incoming links from
static pages will get crawled. However,
links between dynamic pages are often
problematic and sometimes do not get
crawled. Limit the "URL Depth" when
using a dynamic-to-static internal linking
strategy. It is suggested to use the
"Trusted Feed" program for Yahoo! Search.
I highly recommend Evelyn Hepner at
Position
Technologies. Your site will need
a minimum of 250 indexable pages. While
you do have to pay for every click,
by having the XML feed it could be more
cost effective to do it this way than
to change your entire architecture of
your site.
Index
Friendly Pages
It is vital that your site has unique
content. The titles that you use should
be page specific, meaning that the title
of each page should be unique to the
content on that page. This is also true
with your meta tags, specifically with
the description and keywords. If you
update the page's content in the future,
review your title, description and keyword
tags to ensure they are still relevant
to the content. Only separate pages
when there is separate content. Yahoo!
would rather see one long page than
five small pages.
You
should avoid "spam" at all costs. This
includes using "doorway pages" and "doorway
domains". Keyword stuffing is another
area that should be avoided. Hidden
text (text that is the same color as
the background), hidden links, and even
deceptive CSS can be detected by the
indexer and viewed as spam. Link Farms,
massive domain interlinking, which includes
off-topic links (which tend to dilute
valuable links), and cloaking are also
areas to avoid.
Report
"Spam" to Yahoo!
To report Spam sent an email with as
much information as possible (the keywords
used in the search, offending URL, and
why it is considered spam) to: reportsearchspam@yahoo-inc.com.
Review
Yahoo! Content Guidelines
It would be highly recommended to keep
up-to-date on the content
guidelines from Yahoo! at least
once per quarter.
Designing
in Flash
Yahoo! does not currently crack open
Flash files to either follow links or
extract textual elements. Even with
the SDK from Macromedia the extracting
text from SWF files provided little,
if any, value. It was found that content
providers weren't optimizing the content
for the search engines.
Test
Your Site
You can use programs like Anawave's
WebSnake to mimic a "crawl" through
your website to determine if a spider
could deep crawl your site.
Search Phrases: Yahoo! Search Engine Friendly Web Design
