Saturday, March 01, 2003

CodeBit; Using Perl to Generate a Table of Contents for HTML Pages

This script is designed to create a Table of Contents page for HTML documents. It reads any files listed on the command line (wildcards are OK) and searches for the HTML <Hn> tags.

To index an entire directory use: perl toc.pl *.html

NOTE: The files will be read in alphabetical order which may not be the order that you need. Simply cut and paste the resulting HTML until the order is correct.

#!/usr/bin/perl -w
#
# To index an entire directory use: 
#     perl toc.pl *.html
#
use strict;

# holds the name of each file
# as it is being processed.
my($file);       

# holds the text of the heading
# (from the anchor tag).
my($heading);   
                
# holds the last heading level
# for comparision.
my($oldLevel);   
                
# holds each line of the file 
# as it is being processed.
my($line);      
                
# used as temporary variables 
# to shorten script line widths
my($match);     
my($href);      

# holds the name of the heading 
# from the anchor tag.
my($name);      
                
# holds the level of the current heading.
my($newLevel);  

# First, I open an output file and print the 
# beginning of the HTML that is needed.
#
$outputFile = "fulltoc.htm";
open(OUT, ">$outputFile");
print OUT ("<HTML><HEAD><TITLE>");
print OUT ("Detailed Table of Contents\n");
print OUT ("</TITLE></HEAD><BODY>\n");

# Now, loop through every file in the command 
# line looking for Header tags. When found, Look 
# for an Anchor tag so that the NAME attribute can 
# be used. The NAME attribute might be different
# from the actual heading.
#
foreach $file (sort(@ARGV)) {
    next if $file =~ m/^\.htm$/i;
    print("$file\n");
    open(INP, "$file");
    print OUT ("<UL>\n");
    $oldLevel = 1;
    while (<INP>) {
        if (m!(<H\d>.+?</H\d>)!i) {
            # remove anchors from header.
            $line = $1;
            $match = '<A NAME="(.+?)">(.+?)</A>';
            if ($line =~ m!$match!i) {
                $name = $1;
                $heading = $2;
            }
            else {
                $match = '<H\d>(.+?)</H\d>';
                $line =~ m!$match!i;
                $name = $1;
                $heading = $1;
            }
            m!<H(\d)>!;
            $newLevel = $1;
            if ($oldLevel > $newLevel) {
                print OUT ("</UL>\n");
            }
            if ($oldLevel < $newLevel) {
                print OUT ("<UL>\n");
            }
            $oldLevel = $newLevel;
            my($href) = "\"$file#$name\"";
            print OUT ("<LI>");
            print OUT ("<A HREF=$href>");
            print OUT ("$heading</A>\n");
        }
    }
    while ($oldLevel--) {
        print OUT ("</UL>\n");
    }
    close(INP);
}

# End the HTML document and close the output file.
#
print OUT ("</BODY></HTML>");
close(OUT);
Post a Comment