A compilation of various scripts to help data mine the web.
WebScrapingScripts.com offers ready to use scripts you can copy directly into your source code.
Getting started
Preface
This website is not a tutorial. While I do my best to document the code, this page is meant to be a repository of popular web scraping needs. A reference guide.
Note: All scripts and snippets of code here are provided as-is. Though I do confirm that at time of publishing they have been confirmed to work. The web is a dynamic place and things do change.
This script makes use of the cURL library to scrape the title of this website.
<?php
// set vars
$url = "http://webscrapingscripts.com/";
$extract_from = "<title>";
$extract_to = "</title>";
function extractString($string, $start, $end){
$ini = strpos($string, $start);
if ($ini === false) return false;
$ini += strlen($start);
$len = strpos($string, $end, $ini);
if ($len === false) return false;
return substr($string, $ini, $len-$ini);
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$output = curl_exec ($ch);
$info = curl_getinfo($ch);
curl_close ($ch);
if ($info['http_code'] == 200){
$title = extractString($output, $extract_from, $extract_to);
if ($title !== false){
print "The scraped title is: $title\r\n";
//TODO: do something with the scraped data
}else{
print "Error: Page could not find extract params\r\n";
}
}else{
print "Error: Page did not return expected 200 OK\r\n";
}
?>
Should return:
The scraped title is: Web Scraping Scripts
Scrape from behind login
This script will scrape the title from behind a login form.
TODO: Coming soon
Scrape Various Twitter Data
This script will scrape various data sets from Twitter
TODO: Coming soon
Python
Here lies all the Python scripts.
Simple Scrape (requests)
In this basic example we scrape the title using the requests library.
import requests
url = "http://webscrapingscripts.com/"
extract_from = "<title>";
extract_to = "</title>";
page = requests.get(url)
if page.status_code == 200:
print page.content
print "\r\n"
title = page.content[page.content.find(extract_from)+len(extract_from):page.content.find(extract_to)]
print "The scraped title is: " + title + "\r\n"
else:
print "Error: Page did not return expected 200 OK\r\n"
Should return:
The scraped title is: Web Scraping Scripts
Perl
Here lies all the Perl scripts.
Simple Scrape (LWP module)
In this basic example we scrape the title using the Perl LWP modeul.
#!/usr/bin/perl
use LWP::UserAgent;
my $request_url = "http://webscrapingscripts.com/";
my $extract_from = '<title>';
my $extract_to = '</title>';
my $title = '';
my $req = HTTP::Request->new(GET => $request_url);
$req->header("User-Agent" => "Perl Scraper 1.0");
my $ua = LWP::UserAgent->new;
my $response = $ua->request($req);
my $respContent = $response->{_content};
if ( (my $from_index = index($respContent, $extract_from)) != -1){
if ( (my $to_index = index($respContent, $extract_to)) != -1){
$title = substr($respContent, $from_index+length($extract_from),
$to_index-$from_index-length($extract_from));
printf("Title is %s\r\n", $title);
#
#TODO do something with the data
#
}else{
print "Error: extract_to string not found in response.\r\n";
}
}else{
print "Error: extract_from string not found in response.\r\n";
}
Should return:
Title is Web Scraping Scripts
Request / Contact
Don't see a possible scraping implementation here?