Inktomi, otherwise known as "Slurp" is the search bot for Yahoo! It finds pages, follows links, and the quality of page content does not matter to Slurp, it just crawls pages. The quality of the content comes later - when the algorithm kicks in. It won't crawl password protected sites, usually avoids dynamic content due to "spider traps", and respects robots.txt and meta tag exclusions..
Spider
Finds pages, follows links, and the page content does not matter. It won't crawl password protected sites, usually avoids dynamic content due to "spider traps", and respects robots.txt and meta tag exclusions.
Indexer
Primary job is to evaluate and score content (text, meta tag info, anchor text of incoming links, etc.). Removes duplicates and aliases and filters spam. When the indexer detects spam, it adds the domain to its "do not crawl list".
Navigation
Suggestions
If your navigation is JavaScript, graphics, Flash, etc. make sure you use text links in the footer of your page for the navigation. Spiders love hierarchies and users prefer a search box. Have both. Always, always, always link back to your home page. The reason? So the spider can find the rest of your site and lost visitors can re-orient themselves.
Settings
Don't require cookies or session IDs. Doing so can harm your site's ability to be indexed. Why? Since the spider does not accept either, your content management system may start to feed the spider the same page over and over again with different session IDs. This causes a "spider trap" and an endless loop for the spider which evenutally times out and backs out. A strong warning is if one engine has a page saturation that is out of proportion to the others, that is a red flag that a spider trap occurred. You risk having your site dropped by that search engine within the next 90 days due to "indexing bloat".
If you use a 404 error page, use only one generic paged designed for that purpose, but do not, for any reason, redirect the 404 to your home page. Ensure that the title of this page is named "Error 404" this will alert the spider and it will move on without indexing.
If you have moved your site or moved content, use a permanent 301 to point to the new content/domain. This will inform the spider and it will make note and crawl the site properly on its next visit. Doing so also has benefits as it will transfer all of the link credit from the old site to the new one. User bookmarks still function, as well as all old links. 301s can be left for long periods of time.
Avoid
Excessive URL Depth
Having a deep site decreases the chance that the spider will find all of your pages. Very deep URLs tend not to rank as well, and make it difficult for visitors to email to others.
www.mybooks.com/order-of-the-phoenix.html
This
would be classified as Depth One. It
is NOT suggested here to have all pages
on the root of the domain. Having depth
of two or three levels would be acceptable.
www.mybooks.com/uk/fiction/childrens/jkrawlings/harry-
potter/order-of-the-phoenix
This
is six levels and probably would NOT
be crawled.
Data-base
Driven Sites
Static URLs get crawled, and dymanic
pages that have incoming links from
static pages will get crawled. However,
links between dynamic pages are often
problematic and sometimes do not get
crawled. Limit the "URL Depth" when
using a dynamic-to-static internal linking
strategy. It is suggested to use the
"Trusted Feed" program for Yahoo! Search.
I highly recommend Evelyn Hepner at
Position
Technologies. Your site will need
a minimum of 250 indexable pages. While
you do have to pay for every click,
by having the XML feed it could be more
cost effective to do it this way than
to change your entire architecture of
your site.
Index
Friendly Pages
It is vital that your site has unique
content. The titles that you use should
be page specific, meaning that the title
of each page should be unique to the
content on that page. This is also true
with your meta tags, specifically with
the description and keywords. If you
update the page's content in the future,
review your title, description and keyword
tags to ensure they are still relevant
to the content. Only separate pages
when there is separate content. Yahoo!
would rather see one long page than
five small pages.
You
should avoid "spam" at all costs. This
includes using "doorway pages" and "doorway
domains". Keyword stuffing is another
area that should be avoided. Hidden
text (text that is the same color as
the background), hidden links, and even
deceptive CSS can be detected by the
indexer and viewed as spam. Link Farms,
massive domain interlinking, which includes
off-topic links (which tend to dilute
valuable links), and cloaking are also
areas to avoid.
Report
"Spam" to Yahoo!
To report Spam sent an email with as
much information as possible (the keywords
used in the search, offending URL, and
why it is considered spam) to: reportsearchspam@yahoo-inc.com.
Review
Yahoo! Content Guidelines
It would be highly recommended to keep
up-to-date on the content
guidelines from Yahoo! at least
once per quarter.
Designing
in Flash
Yahoo! does not currently crack open
Flash files to either follow links or
extract textual elements. Even with
the SDK from Macromedia the extracting
text from SWF files provided little,
if any, value. It was found that content
providers weren't optimizing the content
for the search engines.
Test
Your Site
You can use programs like Anawave's
WebSnake to mimic a "crawl" through
your website to determine if a spider
could deep crawl your site.
