Not too long ago I created a blog on how I was using iThreads to make sure my little girl's memorial would'nt hang up. Since I was only crawling one page, it was pretty simple; the main routine is the main thread (if you like it or not, because that's the way Perl is running your program) and it keeps creating a new thread when the old one finishes. The new thread runs the code in a subroutine that goes through all the pictures of Citron in a folder I created just for this purpose. Perl takes care of all the cleanup, thread management, etc. However, it doesn't seem to be a good design for multiple folders, located in different URL locations.
Take a look at my blog CrawlingWithThreads.html too see the code I was talking about in the last paragraph. In the code I have a statement thr->join(). "This will wait for the corresponding thread to complete it's execution. When the thread finished ->join() will return the return values of the entry point function"; http://perldoc.perl.org/threads.html That sounds like exactly what I want to happen, and it does for Citron's memorial page, but there are websites with crawl traps and other problems with web pages that cause a premature death of the crawling thread; main keeps waiting for it to return, so the crawler hangs. Fortunately iThreads have methods that let you check on what they are doing, so I can just check on the thread in a loop too see if it's still running or not. ...but, if I'm going to do that, why go through the overhead of using ->join()? iThreads have a method ->detach(); "Makes the thread unjoinable, and causes any eventual return value to be discarded. When the program exits, any detached threads that are still running are silently terminated.".
use strict; use warnings; use LWP::Simple; use LWP::UserAgent; use Time::HiRes qw (sleep); use threads();
#Now we will define variables, “links” is the list we are using. “cur_link” is the link of current page # “var” is used to take in the page content. my(@links,%crawled,$cur_link,$var,$link_var,$temp,$pic,$ua); $ua = new LWP::UserAgent; $ua->timeout(10);
our $request; our $response; my $ArrCntr = -1; my @UrlArr; $UrlArr[0] = "http://www.marvel.com"; $UrlArr[1] = "http://cwtv.com/"; $UrlArr[2] = "http://exoticspotter.com/"; $UrlArr[3] = "http://maserati.com/"; $UrlArr[4] = "http://formula1.ferrari.com/history"; $UrlArr[5] = "http://stylebistro.com/"; $UrlArr[6] = "http://soulnaked.com/"; $UrlArr[7] = "http://blingee.com/blingee"; $UrlArr[8] = "http://www.world-of-waterfalls.com/"; $UrlArr[9] = "http://www.dailymail.co.uk/femail/article-2629235/Beautiful-Burlesque-Dancers-captured-camera-stunning-series-New-York-based-photographer.html"; $UrlArr[10] = "http://ilovegermanshepherds.tumblr.com/"; $UrlArr[11] = "http://readme.ru/"; $UrlArr[12] = "http://www.chevrolet.com/"; $UrlArr[13] = "http://racermag.kinja.com/the-most-beautiful-racecars-of-all-time-its-your-choi-1506825620"; $UrlArr[14] = "http://www.airliners.net/search/photo.search?album=20939"; $UrlArr[15] = "http://simplylingerie.tumblr.com/"; $UrlArr[16] = "http://www.cutestpaw.com/articles/50-cute-cats-make-your-life-happier/"; $UrlArr[17] = "http://www.cutestpaw.com/tag/cats/"; $UrlArr[18] = "http://halloffame.hooters.com/"; $UrlArr[19] = "http://copypast.ru/2007/09/11/krasivye_zhenshhiny_87_foto.html"; $UrlArr[20] = "http://dccomics.com"; $UrlArr[21] = "http://relax.ru/post/106604/Poslednie-fotografii-bespodobnoy-Dzhennifer-Eniston-ot-zhurnala-Hollywood-Reporter.html?feed=new"; $UrlArr[22] = "http://just-lingerie.tumblr.com/"; $UrlArr[23] = "DavePics/pics.html"; $UrlArr[24] = "http://relax.ru"; $UrlArr[25] = "http://qip.ru"; $UrlArr[26] = "http://ferrari.com"; $UrlArr[27] = "http://swimsuit.si.com/swimsuit/models/kate-upton/photos/1"; $UrlArr[28] = "http://www.elle.com/"; $UrlArr[29] = "http://www.cybernetplaza.com/formal-dresses.asp?gclid=CJf3jNzw074CFZBxOgodAU4Adw"; $UrlArr[30] = "http://www.racingsportscars.com/make/photo/Maserati.html"; $UrlArr[31] = "http://www.pinterest.com/kravitzt/pin-up-cheesecake-photos/"; $UrlArr[32] = "http://www.refinery29.com/53717?utm_source=taboola&utm_medium=adsales&utm_content=beauty_slideshows#slide"; $UrlArr[33] = "http://bendelaney.me"; $UrlArr[34] = "http://www.bugatti.com/en/tradition/100-years-of-bugatti/stories-of-a-century/from-the-racetrack-to-the-road.html"; $UrlArr[35] = "http://deviantart.com"; $UrlArr[36] = "http://www.bwotd.com/category/clothed/"; $UrlArr[37] = "http://huffingtonpost.com"; $UrlArr[38] = "http://sportscarheaven.tumblr.com/"; $UrlArr[39] = "CitronGallery/AllPics.html"; $UrlArr[40] = "http://brasonly.tumblr.com/"; $UrlArr[41] = "http://lovefrenchbulldogs.tumblr.com/";
while (++$ArrCntr > -1 ) { if ($ArrCntr > scalar(@UrlArr)) { $ArrCntr = 0; } unshift(@links, $UrlArr[$ArrCntr]); my ($thr) = threads->create(\&crawl); $thr->detach(); while ($thr->is_running()) { sleep(1); } }
sub crawl { $cur_link = shift(@links); ++$ArrCntr if not defined $cur_link; if ($ArrCntr > scalar(@UrlArr)) { $ArrCntr = 0; } $cur_link = $UrlArr[$ArrCntr] if not defined $cur_link;
$crawled{$cur_link} = 1 if defined $cur_link;
if ($request = new HTTP::Request('GET', $cur_link)) { #print "Get worked\n"; } else { #print "Get failed\n"; threads->exit(); return 0; } #print "Now get a response request\n"; sleep(1.00); $response = $ua->simple_request($request); if ($response->is_success) { #print "Got response\n"; #$response->decoded_content; } else { #print "Request failed\n"; threads->exit(); } #print "Get the page contents\n"; $var = $response->content(); $link_var = $var; #print "parse the image tags out of the content\n"; my @p_pics =$var =~ /<img src=\"[^>]+>/g; #if there are are no images on this page, skip it. my $arraySize = scalar(@p_pics);
my $source = ""; my $cntr = 0; foreach $temp(@p_pics) { my $local_temp = substr $temp, 10; my $char_pos = index($local_temp, '"'); $temp = substr $local_temp, 0, $char_pos; if(index($temp, "http") == -1) { my $first = substr($temp, 0, 1); if ($first eq '/') { $temp=$cur_link.$temp; } elsif ($first eq '.') { $temp = substr($temp, 3); my $result = rindex($temp, '/'); $temp = substr($temp, 0, $result); $temp = $cur_link.$temp; } else { $temp=$cur_link.'/'.$temp; } } $temp =~ /\bhttps?:[^)''"\s]+\.(?:jpg|JPG|jpeg|JPEG|gif|GIF|png|PNG)/; # Only interested in files that are > 64K in size my $size = 0; $size = head($temp); #print temp to a file so a web page can use it as the src for an img tag. if ((defined $size) && ($size > 65536)) { open (MYFILE, '>data.txt'); print MYFILE $temp; close (MYFILE); print "Just wrote ".$temp." to data.txt\n"; sleep(0.25); #print "Just slept for 0.25 seconds\n"; } else { #print "file is to small too use\n"; } #print "At the bottom of the loop\n"; #print "$cntr\n"; #print "$arraySize\n"; if (++$cntr >= scalar(@p_pics)) { #print "Exiting loop\n"; last; } } threads->exit(); }
Notice that at the end of the crawling subroutine I kill the thread with the statement threads->exit(). The main routine keeps monitoring the thread to see if it's still running or not, so once it sees that the thread had dies it creates a new thread. The other thing to notice about my code is that I check to see if the "GET" statement is successful or not; kills the thread and exits the subroutine if it isn't. Next my code check to see if it gets a response from the web page; i.e., it makes sure that it can get the source code from the page, and if it doesn't then it kills the thread and returns to main. I've had this code running for several days now without incident, so it seems to be the way to go.