Thursday, 25 April 2013

How to Scrape a Web Page Using PHP and CURL

There are myriad reasons why a PHP programmer might wish to grab content from a Web page by "scraping" the site -- that is, using a script to capture the page's HTML (and, optionally, the images and other content that appear on the page when it is displayed in a browser). For instance, the programmer might want to monitor a certain page and send out an alert when the page has been modified, or index the page for use in a search engine. This article will describe how to scrape a Web page using PHP (a programming language) and cURL (a tool for accessing URLs through PHP). Note that not all installations of PHP have cURL enabled. Call the function phpinfo() to determine whether cURL is available to your script.

First, a few words of warning:

1) The content of a Web page may be copyrighted, and it may be illegal for you to use or reproduce that content without the consent of the copyright holder. It is your responsibility to determine whether scraping a particular Web page could get you into legal trouble.

2) Because scraping a Web page uses up the bandwidth of the site on which the page appears, you should avoid any scraping activity that might consume the site's available resources or expend its available bandwidth. Generally, scraping a site several times a day will not cause a problem. But scraping a site several times a second assuredly will.

A simple scraping script

PHP has several functions that interface with cURL to scrape Web pages.

Below is a simple script that captures the HTML from a specified URL and stores it in a variable called "data":

$url = "http://www.example.com";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
curl_close($ch);

?>

The function curl_setopt() is extremely useful. It allows you to set options that affect how your cURL request is processed. For instance, you can use curl_setopt() to specify whether you want the Web page's headers to be returned along with the HTML, or whether your script should pass a certain referrer header.

For a complete rundown of the options that can be set using curl_setopt(), view the entry for this function in the PHP manual.

Once you have captured the Web page's HTML in the "data" variable, you can parse it.

For instance, you can use regular expressions to isolate a specific portion of the Web page, or you can use a simple string-matching function such as strpos() to determine whether a certain term appears anywhere in the content.

Source: http://voices.yahoo.com/how-scrape-web-page-using-php-curl-5442987.html

Note:

Delta Ray is experienced web scraping consultant and writes articles on Yelp Data Scraping, Linkedin Profile Scraping, Yellowpages Data Scraping, eBay Product Scraping,  Website Harvesting, IMDb Data Scraping, Yelp Review Scraping, Tripadvisor Data Scraping, Linkedin Email Scraping, Screen Scraping Services and yellowpages data scraping.

No comments:

Post a Comment