Classical Gas

Return To My Blog Page Return To My Programming Page

I recently uploaded some new music to my website; Country Comforts, by Rod Stewart. When this song came out I thought it was amazing that Rod Stewart was able to get somebody of John Hartford's stature to sing backing vocals. John Hartford wrote Gentle on my Mind; a huge hit for Glen Campbell. John Hartford was also on the Glen Campbell show, and played Classical Gas on his guitar while the TV audience watched a slide show of world events unfold on their TV screens at about 10 frames/second. This was a popular segment on the Glen Campbell show, since it was already being shown on TV news shows and the song, Classical Gas had already been a big hit for Mason Williams. I recreated that segment from the Glen Campbell show, but with some differences; the background music is the Mason Williams version of Classical Gas, and the slide show consists of images that are coming off the Internet, live.

The program (or web crawler or web spyder) that gets these images is called crawler.pl and runs on my server all day long. crawler.pl gathers the url of image files and stores them in a file called data.txt This web page displays those pictures using a technique called AJAX; Asynchronous Javascript And XML. AJAX lets you update the web page without having to reload it; only the images are reloaded.

I thought I could just download some scripts off of the Internet, that could do all the work for me. I did a search for a web crawler written in Perl, and found one with amazingly simple code, but it didn't quite work. I re-wrote the web crawler and named it crawler.pl; here's my code :

use strict;	
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use Time::HiRes qw (sleep);

#Now we will define variables, “links” is the list we are using. “cur_link” is the link of current page
# “var” is used to take in the page content.
my(@links,%crawled,$cur_link,$var,$link_var,$temp,$pic,$ua);
$ua = new LWP::UserAgent;
$ua->timeout(120);

my $ArrCntr = -1;
my @UrlArr;
$UrlArr[0] = "http://marvel.com/";				# In case we hit a page without any links, we might run out of images
$UrlArr[1] = "http://cwtv.com/";				# In that case, we can use one of these "links"
$UrlArr[2] = "http://exoticspotter.com/";
$UrlArr[3] = "http://huffingtonpost.com/";
$UrlArr[4] = "http://relax.ru/";
$UrlArr[5] = "http://stylebistro.com/";
$UrlArr[6] = "http://soulnaked.com/";
$UrlArr[7] = "http://blingee.com/";
$UrlArr[8] = "http://www.world-of-waterfalls.com/";

for (my $indexCntr = 0; $indexCntr < scalar(@UrlArr); $indexCntr++)
{
   unshift(@links, $UrlArr[$indexCntr]);
} 
unshift(@links, $ARGV[0]);

TOP:                
                $cur_link = shift(@links);
                ++$ArrCntr if not defined $cur_link;
                if ($ArrCntr > scalar(@UrlArr))
                {
                    $ArrCntr = 0;
                }
                $cur_link = $UrlArr[$ArrCntr] if not defined $cur_link;                                
               
                print"Just shifted out ".$cur_link."\n";                               
                $crawled{$cur_link} = 1 if defined $cur_link;
                 #print "Just got crawled value\n";               
	        my $request = new HTTP::Request('GET', $cur_link);
                #print "Just made request to web page: $!\n";
                my $response;
                if ($response = $ua->request($request))
                {
                    print "Just got a response from the web page\n";
                }
                else
                {
	  print "$!\n";
                }           
                print "Get the page contents\n";     
                $var = $response->content();
                $link_var = $var;   
                print "parse the image tags out of the content\n";
                my @p_pics =$var =~ /<img src=\"[^>]+>/g;
                #if ther are are no images on this page, skip it.
                my $arraySize = @p_pics;
                
                my $source = "";
                foreach $temp(@p_pics)          
                {       
                     my $local_temp = substr $temp, 10;  
                     my $char_pos = index($local_temp, '"');
                     $temp = substr $local_temp, 0, $char_pos;  
                     if(index($temp, "http") == -1)
                       {
		          my $first = substr($temp, 0, 1);
                          if ($first eq '/')
                          {
                             $temp=$cur_link.$temp;
                          }
                          elsif ($first eq '.')
                          {
                              $temp = substr($temp, 3);
                              my $result = rindex($temp, '/');
                              $temp = substr($temp, 0, $result);
                              $temp = $cur_link.$temp;
                          }
                          else
                          {
                             $temp=$cur_link.'/'.$temp;
                          }
                     } 
                     $temp =~ /\bhttps?:[^)''"\s]+\.(?:jpg|JPG|jpeg|JPEG|gif|GIF|png|PNG)/;         
                     # Only interested in files that are > 64K in size
                     my($type,$size);
                     $size = 0;
                     $type, $size = head($temp);                     
                     #print temp to a file so a web page can use it as the src for an img tag.   
                     if ((defined $size) && ($size > 65536))
                     {               
                        open (MYFILE, '>http://lorraineprofeta.us/cgi-bin/data.txt'); 
                        print MYFILE $temp;
                        close (MYFILE);  
                        print "Just wrote ".$temp." to data.txt\n";
                        sleep(0.1);
                     }
                }                
               
	        # In the next line we extract all links in the current page
                my @p_links= $link_var=~/<a href=\"(.*?)\">/g;
                foreach $temp(@p_links)          
                {                                                  
                        if((!($temp=~/^http/))&&($temp=~/^\//))
                        {
			        #This part of the code lets us correct internal addresses like “/index.aspx”
                                $temp=$cur_link.$temp;
                        }
                        if ($temp =~ m/([\d\w-.]+?\.(a[cdefgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmnoz]|e[ceghrst]|f[ijkmnor]|g[abdefghilmnpqrstuwy]|h[kmnrtu]|i[delmnoqrst]|j[emop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrstwy]|qa|r[eouw]|s[abcdeghijklmnortuvyz]|t[cdfghjkmnoprtvwz]|u[augkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|aero|arpa|biz|com|coop|edu|info|int|gov|mil|museum|name|net|org|pro|ru|us|xxx)(\b|\W(?<!&|=)(?!\.\s|\.{3}).*?))(\s|$)/)
                        {                                                                                 
		           # In the next line we add the links to the main “links” list.
                           print"Just found a link: ".$temp."\n"; 
                           my $interog_spot= index($temp,'?');
                           $temp = substr($temp,0,$interog_spot) if $interog_spot != -1;
                           $interog_spot= index($temp,' ');
                           $temp = substr($temp,0,$interog_spot) if $interog_spot != -1; 
                           my $colons = index($temp,':',6);
                           my $arraySize = @links;
                           print "array size = ".$arraySize."\n";
                           
                           #if ($colons == -1)
                           {          
                              print "adding ".$temp." to the front of the links array\n";    
                              unshift(@links,chop($temp)) if not defined $crawled{$temp};
                           }
                           print "Already traversed ".$temp." \n" if defined $crawled{$temp};  
                        }                      
                }                 
BOT:      
goto TOP;

The above Perl script runs continuously on my PC and keeps updating a file called data.txt with the url of of an image file. The program is run from the command prompt like so:
perl crawler.pl http://www.louvre.fr/ # this yields pictures from the Louvre web site.

To get the pictures from the website, this web page opens up the data.txt file, reads the url and updates the <img> tag with the new url as the src value. As I mentioned earlier in this blog, I use a technique called AJAX, that lets me update just the image file without having to reload the web page. The code for this web page is mostly from a Youtube video I watched about using AJAX; a few modifications were still necessary (naturally), but the code I got from http://www.youtube.com/watch?v=IZGO13Sx_c8 was great. You can view the source code for this web page too see the AJAX code.

Return To My Blog Page Return To My Programming Page