Announcement

Collapse
No announcement yet.

Robots.txt and Search Results

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Robots.txt and Search Results

    I understand that the Google terms of service have now been changed to include:
    Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.
    How can a robots.txt file be set up so that the Actinic search results are not spidered?

    #2
    I think all you can do is disallow the cgi bin
    Code:
    User-agent: *
    Disallow: /cgi-bin/
    i've had this in place since i'm not sure when, all i ever want to have indexed is straight HTML pages - which if your sitemap is stategically placed will happen without having to allow spiders index via the cgi-bin

    Comment


      #3
      Is it possible to be more specific say:
      /cgi-bin/ss*
      or similar?

      Comment


        #4
        I have
        Disallow: /acatalog/*.cat
        Disallow: /acatalog/*.fil
        Disallow: /*.gif$
        Disallow: /*.jpg$
        in mine so you might be able to say

        Disallow: /cgi-bin/ss*.pl

        But I'm guessing.....

        Comment


          #5
          I would be interested to know if that works.

          Kind regards,
          Bruce King
          SellerDeck

          Comment


            #6
            Originally posted by pinbrook
            I have in mine so you might be able to say

            Disallow: /cgi-bin/ss*.pl

            But I'm guessing.....
            I've just found this in www.robotstxt.org/wc/exclusion-admin.htm

            Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

            Alan

            Comment


              #7
              I do:

              User-agent: *
              Disallow: /*.cat
              Disallow: /*.fil
              Disallow: /*.gif$
              Disallow: /*.jpg$
              Disallow: /cgi-bin/
              Sitemap: INSERT URL

              Comment


                #8
                Is this in .htaccess???

                Is this code in .htaccess file or am i being really thick this morning???

                It seems to be one of those days today!!

                thanks
                Jane

                Comment


                  #9
                  No in 'robots.txt' as per the original thread title.

                  Comment


                    #10
                    Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".
                    This page from Google webmaster tools contradicts the above... so beleive who you choose

                    Code:
                    Yes, Googlebot interprets some pattern matching. This is an extension of the standard, so not all bots may follow it.
                    
                    Matching a sequence of characters using *
                    You can use an asterisk (*) to match a sequence of characters. For instance, to block access to all subdirectories that begin with private, you could use the following entry:
                    
                    User-Agent: Googlebot
                    Disallow: /private*/
                    
                    To block access to all URLs that include a question mark (?), you could use the following entry:
                    
                    User-agent: *
                    Disallow: /*?*
                    
                    Matching the end characters of the URL using $
                    You can use the $ character to specify matching the end of the URL. For instance, to block an URLs that end with .asp, you could use the following entry:
                    
                    User-Agent: Googlebot
                    Disallow: /*.asp$
                    
                    You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
                    
                    User-agent: *
                    Allow: /*?$
                    Disallow: /*?
                    
                    The Disallow:/ *? line will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
                    
                    The Allow: /*?$ line will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
                    http://www.google.com/support/webmas...y?answer=40367

                    Comment

                    Working...
                    X