You are here

HTTrack

Submitted by Peter on Wed, 2009-12-16 14:35

HTTrack copies Web sites to your local disk for backup and offline viewing. WebHTTrack is the Linux version and WinHTTrack is the Windows version. Both are free so you do not have to juggle licences and open source for your safety.

This article is based on the Linux version and the Windows version is almost the same. There are minor installation differences. After installation, they work the same. You can download both from www.httrack.com

There are a small number of legitimate uses for Web site copying.

  • Backup
  • Present
  • Read
  • Study
  • Test

Backup

If you have an old HTML based Web site, you can back it up with HTTrack. After you upgrade the site to Drupal, there is no reason to backup the HTML representation of the files. Instead backup the database and the image upload directory.

Present

You have your new Drupal site and want to present it to people with sections highlighted. You could backup using HTTrack then edit the HTML to add highlighting.

You can also create a version to be shipped on CD as a product catalogue.

A public site could be backed up regularly then used as point-in-time evidence.

Read

You want to read a free online manual while sitting in a park but not using expensive wireless access. You can copy the manual at home or in the office using cheap broadband then read at your leisure in a different location.

Is there a free online manual read frequently by a lot of your staff? You could set up a proxy server to cache the site containing the manual or you could copy the manual to your local server using HTTrack. Think about the site owner missing out on advertising revenue because of your copy. The owner might not make enough money to continue working on the manual. Consider a donation to keep the manual development active.

Study

Copying a small part of a book or Web site is fair use if you are a student studying the subject presented by the material you are copying. Copying a small part is also legitimate for reviews.

Copying the whole site might be legitimate if the site owner asked you to quote on redeveloping the site. Copying the site is legitimate when you are paid to redevelop the site and the copy will be good proof that you transferred everything from the existing site.

Test

A site uses a particular theme or navigation and you want to test a change. You could make a copy then edit the copy. In most cases you will need only a small sample to test your change, not the whole site.

Install

Start Ubuntu 9.10. Other distributions of Linux and earlier versions of Ubuntu might require a more complicated installation.

Select Applications > Ubuntu Software Center.

Ubuntu software center showing a search box for finding applications

Search for httrack.

Search result showing WinHTTrack logo and description

Select WebHTTrack.

WebHTTrack selected with right arrow displayed

Select the right arrow.

The rest is based on WebHTTrack version 3.43.5-1ubuntu1

webhttrack screen with version and Install button

Select Install.

The authentication screen pops up. Type in your password and select Authenticate.

Authentication screen with space for your password

WebHTTrack is now installed. Close the Ubuntu Software Center.

webhttrack installed message

Select Applications > Internet.
There are two entries for HTTrack and both are duplicated..
Browse Mirrored Websites is the link to see the Web site copies and all it does is open Firefox at the index page of a directory named websites in your home directory.

WebHTTrack Website Copier is the program to copy Web sites. Select the link to open WebHTTrack in Firefox and to start copying Web sites.

httrack welcome page with language selection

You can select a language and English is the default. Select Next >>.

Select an existing project or create a new project. You can give a project a category. You can change the base path, which defaults to /home/example/websites. Select Next >>.

project and catagory selection boxes

Type in the URL of the Web site you want to copy. I will use the example of the Inkscape keyboard and mouse reference version 0.46 at http://www.inkscape.org/doc/keys046.html.

url entry screen

Look in the options to check that they fit what you want.

httrack options with several options and many links to other option pages

I like to select [*] Get HTML files first!. Under Browser ID, you can change the browser identity from the default:
Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
There is a HTML footer you might like to remove or change and the default is:

Scan Rules lets you include and exclude files. The default excludes doubleclick but not Google or any other usage trackers. You will want to expand the default:
+*.png +*.gif +*.jpg +*.css +*.js -ad.doubleclick.net/*

There are options to limit the depth of the copy, the inclusion of external files, the speed of the copy, and a lot of other things to keep you out of trouble with various sites.

Select Next >>.

httrack-start.png

There is an option to save your settings but not start the copying.

Select Start >>.

At the end of the process there is a link to browse the Web site copy.

Experiment with a small site of your own. Copy only free open material.

Conclusion

HTTrack is a useful tool for Web developers, may be useful for Web site owners, and may be useful for people placing reference material on intranets.