EARLIER FEATURES

FEATURES CONTENTS

LATER FEATURES

8th February 2003

HTTP FUNCTIONS AND GOOGLE SEARCHING

Brian Grainger

brian@grainger1.freeserve.co.uk

There are times when I get exasperated by the Internet and wonder if the whole network will fall down around us. There are other times when I cannot praise it highly enough. This article followed from an occurrence of exasperation that was quickly followed by a period of investigation using Google. Then came the period of elation when my recovery technique, based on the information found, was found to be successful. The article discusses some functionality of HTTP and also discusses how one might quickly home in on information required, by use of a judicious search technique.

I guess most of the exasperation of the Internet to me is caused by one of two happenings. In the first instance I receive too much unwanted guff around the core information that I actually want to view. Of course, Spam e-mail is the main source of this, but web pages prettied up by endless gifs, jpegs and worse, Macromedia Flash items, also contribute. The second instance is when I have spent a session downloading pages to view at my leisure offline, only to find when I try to view offline the pages have not been found in the cache. Usually, it is just one or two pages that are lost, except when using Netscape when all seems to be lost when you exit the Navigator/Communicator. Because I pay for my Internet access on a 'pay as you go' basis offline reading is of primary importance to keep costs down. However, this is not the only purpose for offline viewing. If I want to copy the information for use elsewhere it is handy to have an offline copy rather than copy everything viewed online just in case you might need it.

All sorts of web design strategies seem to be used by, (usually business), information providers that compromise my ability to view offline.

I use Silicon.com to get computer news information. They have, until recently, wanted to include lots of useless adverts amongst the smallest amount of text information that I really want. Consequently, I devised a strategy of stopping download when I think the text is there and it is only adverts coming down. This usually works, but sometimes the pages are not in the cache - is it because of my strategy?

Problems also occur with sites that use frames. The pages of the frame set are quite often not cached, so I cannot view offline. Sometimes, if I follow the exact same path to the pages that I did when online, they magically appear.

Imagine my consternation recently when, after having had a good look round one site, I found every single page was unavailable when viewing offline. All that time had been wasted and I was not going to waste more by viewing everything online. However, it did get me thinking that it cannot be accidental that ALL pages of a site were unavailable offline. It looked like a deliberate design ploy to avoid caching pages - something that I did not know was possible. I was determined to find out more and, as is normal now, see what information I could find on the web to fill this gap in my knowledge.

The biggest problem that people, especially novices, have with the web is finding information amongst the millions of pages out there. AOL and the like gain their revenue by making it 'simple' for their customers. They do this by making it simple for customers to access the pages in AOL's favour, such as those who has given AOL some money to put their pages in the AOL channels. This technique is no good for web researching, where a good search technique is needed. How does one gain technique. I don't know, although I do seem to have a knack of finding things I want quickly. Here is what I did in this case.

I tried to frame a question, in natural language, of what I wanted. It came out as:

How do I create web pages that are not cached?

Now the first tip for searching is to use jargon, if you know it! I guess this is one reason novices fail. The word cache is crucial in the above question. Would 'able to be viewed offline' have the same effect?

Having framed the question determine what the essential items are. Determine what cannot be removed without changing the question completely. Also, what cannot be removed without adding many more possibilities that the resulting question will cover. In this example I considered 'web pages' and 'not cached' were crucial to the question. However, just these two items would also cover software bugs or set up problems, such as having too small a cache size. I did not wish the data I wanted swamped by irrelevant answers. I deemed that 'create' was therefore an essential word.

Having got the essential words follow these guidelines:

Always use singular rather than plural (if an 's' is added to get the plural both will get returned by asking for the singular>

Similarly, use present tense rather than past. ('not cache' will cover 'not cached' as well).

Avoid the use of abbreviations. They tend to pick up all sorts of things you would not believe possible.

Know your search engine. Know how to tell it to search for exact phrases such as "not cache" rather than 'not' and 'cache'. Know how to search for ALL the words and phrases, rather than any one of them. Google makes this pretty easy on the advanced search page. If you look at the result on the Google search line of the results page you will soon see how to do it without going to the advanced search page.

In my research, therefore, I asked Google to search for: create "web page" "not cache"

It is interesting to note that the addition of the word 'create' reduced the number of hits from 17,600 to 1570, which confirmed my feeling on its importance.

Most of the items that were found on the first two pages were relevant and provided various pieces of information. Item 13, which pointed to the 'Most Valuable Professionals' site, provided a page with various links into the Microsoft Knowledge Base. One of these links:

http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q234067

provided the most complete answer to the question, all in the one web page.

It is fair to say that my research had shown some valid uses for not caching pages, especially when the pages change information rapidly. In this case you would want the user to collect the latest information from the web server, rather than what was in their cache which may out of date. In saying this I've just realised I should have done this with the ICPUG front page and What's New page! However, some responses to my query also say it is not a good idea to apply the no cache rule to a whole site! In any event, what does Microsoft say about how to avoid pages being cached.

A couple of ways to do this are enacted on the server side, via the format of an active server page, (.asp), and one way is enacted on the client side using HTML meta tags.

You can use .asp pages to define commands that are sent via the HTTP protocol from server to browser. I used to think that all that was sent between server and browser was the HTML page, but the medium used to send the page, Hypertext Transfer Protocol (HTTP), also sends information to the browser.

The first thing that can be done to force the browser to get the page from the server is to put an expiry time on the page. In this case the page is still cached, but if the page has expired it will not take the page from the cache to view it. The Expires header control, sent via the .asp page, performs this task. In the special case of the Expires header set to -1 the browser will always go back to the server to get the page required. In all these cases you would could still view the cached page in offline viewing mode. The web site I was trying to access was clearly not using this method.

The second thing that can be done is to tell the browser NOT to cache the page by setting the Cache-Control header. This is also done via the .asp page. If the header is set to 'no-cache' then the page will not be stored in the cache at all. Microsoft says that because of this the Cache-Control header should be used sparingly and setting the Expires header control to -1 is preferred.

The Cache-Control header was introduced with HTTP 1.1, which means that servers using HTTP 1.0 will not be able to use this technique. For this reason Internet Explorer supported a Pragma: No-cache header. This would work exactly like the Cache Control header, but only when communicating over a secure connection (https://). This was a work around really, as the HTTP specification does not define this usage for the Pragma: no-cache header but as a means of telling proxy servers not to prevent important requests reaching the destination web server!

The final method of telling the browser to act on web pages in a certain way is via meta tags in the HTML page itself. I am not sure if this method is restricted to Microsoft browsers but I suspect it is. Here is an example of their use, in the HEAD section of a web page.

<HTML>

</HTML>

The meta tags work in the same way as described earlier for the appropriate HTTP header equivalent. In this example the page would not be cached when requested over a secure connection. Over a non-secure connection the page is cached but treated as immediately expired.

This final method of communicating with the browser is not implemented in Internet Explorer 4 or later.

Cache Control headers are the recommended method of communication and this has to be done on the server side, either via .asp pages or by setting the server configuration appropriately. Both of these means are outside my control so I am afraid the ICPUG front page and What's New page will continue to be cached!

To get back to my initial problem it seems clear to me that the web site I was accessing had set the server so that the cache control header was added automatically to any of its pages when requested by a browser. The problem now revolved around what I could do.

Here a vague memory came back into my head. Cache Control headers are, as I said above, only applicable to HTTP 1.1. I vaguely remembered something about HTTP 1.1 in the settings for Internet Explorer, so I had a look at the setting of my IE again. I fired up IE 4 and under the View menu went to the Internet Options entry before clicking on the Advanced tab. (In later versions of IE Internet Options appears on the Tools Menu). There it was. At the end of the list of options was a group of HTTP 1.1 settings and 'Use HTTP 1.1' was selected by default.

What would happen if I deselected 'Use HTTP 1.1'? Does this mean using HTTP 1.0, in which case Cache Control headers have no effect? Only one way to find out. I made the adjustments to the options and logged on to the Internet.

I went to the web site, downloaded a few pages and then disconnected from the Internet. I then went into IE in offline mode. Yes, the pages were still there in the cache!

Elation - problem solved - and a little more knowledge of web communication gained. The perfect ending to what started out as a very irritating experience.

TOP