Web crawler links/page logic in PHP#1
I'm writing a basic crawler that simply caches pages with PHP.

All it does is use get_file_contents to get contents of a webpage and regex to get all the links out <a href="URL">DESCRIPTION</a> - at the moment it returns:

The problem I'm having is figuring out the logic behind determining whether the page link is local or sussing out whether it may be in a completely different local directory.

It could be any number of combinations: i.e. href="../folder/folder2/blah/page.html" or href="google.com" or href="page.html" - the possibilities are endless.

What would be the correct algorithm to approach this? I don't want to lose any data that could be important.

posted date: 2008-12-11 14:45:00

Re: Web crawler links/page logic in PHP#2
I had made out the solution of this problem. click to view my topic...

hope that hepls.

posted date: 2008-12-11 14:45:01

Re: Web crawler links/page logic in PHP#3
First of all, regex and HTML don't mix. Use:foreach(DOMDocument::loadHTML($source)->getElementsByTagName('a') as $a){ $a->getAttribute('href');}Links that may go outside your site start with protocol or //, i.e.http://example.com//example.com/href="google.com" is link to a local file.But if you want to create static copy of a site, why not just use wget?

posted date: 2008-12-11 14:55:00

Re: Web crawler links/page logic in PHP#4
You would have to look for http:// in the href. Else, you could determine if it starts with ./ or any combination of "./". If you don't find a "/" then you would have to assume that its a file. Would you like a script for this?

posted date: 2008-12-11 17:38:00

Re: Web crawler links/page logic in PHP#5
sure that would be a great help! :)

posted date: 2008-12-11 17:53:00

Re: Web crawler links/page logic in PHP#6
Let's first consider the properties of local links.These will either be:relative with no scheme and no host, orabsolute with a scheme of 'http' or 'https' and a host thatmatches the machine from which thescript is runningThat's all the logic you'd need to identify if a link is local.Use the parse_url function to separate out the different components of a URL to identify the scheme and host.

posted date: 2008-12-15 11:16:00

Re: Web crawler links/page logic in PHP#7
be careful with parse_url it fails really easily :P

posted date: 2008-12-26 19:23:00

