Announcement

**pinbrook** · 20-Apr-2007, 05:50 PM

I think all you can do is disallow the cgi bin

Code:

User-agent: *
Disallow: /cgi-bin/

i've had this in place since i'm not sure when, all i ever want to have indexed is straight HTML pages - which if your sitemap is stategically placed will happen without having to allow spiders index via the cgi-bin

**Duncan Rounding** · 20-Apr-2007, 05:57 PM

Is it possible to be more specific say:
/cgi-bin/ss*
or similar?

**pinbrook** · 20-Apr-2007, 06:09 PM

I have

Disallow: /acatalog/*.cat
Disallow: /acatalog/*.fil
Disallow: /*.gif$
Disallow: /*.jpg$

in mine so you might be able to say

Disallow: /cgi-bin/ss*.pl

But I'm guessing.....

**Bruce** · 21-Apr-2007, 06:20 PM

I would be interested to know if that works.

Kind regards,

**acompton** · 04-May-2007, 07:07 PM

Originally posted by pinbrook

I have in mine so you might be able to say

Disallow: /cgi-bin/ss*.pl

But I'm guessing.....

I've just found this in www.robotstxt.org/wc/exclusion-admin.htm

Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

Alan

**leehack** · 04-May-2007, 07:15 PM

I do:

User-agent: *
Disallow: /*.cat
Disallow: /*.fil
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /cgi-bin/
Sitemap: INSERT URL

**janep** · 08-May-2007, 08:44 AM

Is this in .htaccess???

Is this code in .htaccess file or am i being really thick this morning???

It seems to be one of those days today!!

thanks

**leehack** · 08-May-2007, 09:24 AM

No in 'robots.txt' as per the original thread title.

**pinbrook** · 08-May-2007, 04:13 PM

Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

This page from Google webmaster tools contradicts the above... so beleive who you choose

Code:

Yes, Googlebot interprets some pattern matching. This is an extension of the standard, so not all bots may follow it.

Matching a sequence of characters using *
You can use an asterisk (*) to match a sequence of characters. For instance, to block access to all subdirectories that begin with private, you could use the following entry:

User-Agent: Googlebot
Disallow: /private*/

To block access to all URLs that include a question mark (?), you could use the following entry:

User-agent: *
Disallow: /*?*

Matching the end characters of the URL using $
You can use the $ character to specify matching the end of the URL. For instance, to block an URLs that end with .asp, you could use the following entry:

User-Agent: Googlebot
Disallow: /*.asp$

You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

User-agent: *
Allow: /*?$
Disallow: /*?

The Disallow:/ *? line will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

The Allow: /*?$ line will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

http://www.google.com/support/webmas...y?answer=40367

Announcement

Robots.txt and Search Results

Robots.txt and Search Results

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment