Getting started

Preface

This website is not a tutorial. While I do my best to document the code, this page is meant to be a repository of popular web scraping needs. A reference guide.

Note: All scripts and snippets of code here are provided as-is. Though I do confirm that at time of publishing they have been confirmed to work. The web is a dynamic place and things do change.

E-mail me at conwebscrapingscripts.com.

Connect with me on LinkedIn here.

Simple Scrape (cURL)

This script makes use of the cURL library to scrape the title of this website.

<?php
// set vars
$url = "http://webscrapingscripts.com/";
$extract_from = "<title>";
$extract_to = "</title>";

function extractString($string, $start, $end){
    $ini = strpos($string, $start);
    if ($ini === false) return false;
    $ini += strlen($start);
    $len = strpos($string, $end, $ini);
    if ($len === false) return false;

    return substr($string, $ini, $len-$ini);
}

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$output = curl_exec ($ch);
$info = curl_getinfo($ch);
curl_close ($ch);
if ($info['http_code'] == 200){
    $title = extractString($output, $extract_from, $extract_to);
    if ($title !== false){
        print "The scraped title is: $title\r\n";
        //TODO: do something with the scraped data
    }else{
        print "Error: Page could not find extract params\r\n";
    }
}else{
    print "Error: Page did not return expected 200 OK\r\n";
}
?>

Should return:

The scraped title is: Web Scraping Scripts

Scrape from behind login

This script will scrape the title from behind a login form.

TODO: Coming soon

Scrape Various Twitter Data

This script will scrape various data sets from Twitter

TODO: Coming soon

Python

Here lies all the Python scripts.

Simple Scrape (requests)

In this basic example we scrape the title using the requests library.

import requests

url = "http://webscrapingscripts.com/"
extract_from = "<title>";
extract_to = "</title>";

page = requests.get(url)

if page.status_code == 200:
    print page.content
    print "\r\n"
    title = page.content[page.content.find(extract_from)+len(extract_from):page.content.find(extract_to)]
    print "The scraped title is: " + title + "\r\n"
else:
    print "Error: Page did not return expected 200 OK\r\n"

Should return:

The scraped title is: Web Scraping Scripts

Perl

Here lies all the Perl scripts.

Simple Scrape (LWP module)

In this basic example we scrape the title using the Perl LWP modeul.

#!/usr/bin/perl

use LWP::UserAgent;

my $request_url = "http://webscrapingscripts.com/";

my $extract_from = '<title>';
my $extract_to = '</title>';
my $title = '';

my $req = HTTP::Request->new(GET => $request_url);
$req->header("User-Agent" => "Perl Scraper 1.0");

my $ua = LWP::UserAgent->new;
my $response = $ua->request($req);
my $respContent = $response->{_content};

if ( (my $from_index = index($respContent, $extract_from)) != -1){
  if ( (my $to_index = index($respContent, $extract_to)) != -1){
    $title = substr($respContent, $from_index+length($extract_from),
            $to_index-$from_index-length($extract_from));
    printf("Title is %s\r\n", $title);
    #
    #TODO do something with the data
    #
  }else{
    print "Error: extract_to string not found in response.\r\n";
  }
}else{
  print "Error: extract_from string not found in response.\r\n";
}

Should return:

Title is Web Scraping Scripts

Request / Contact

Don't see a possible scraping implementation here?

Request it by emailing conwebscrapingscripts.com.

WEB SCRAPING SCRIPTS

A compilation of various scripts to help data mine the web.

WebScrapingScripts.com offers ready to use scripts you can copy directly into your source code.

Getting started

Preface

Simple Scrape (cURL)

Scrape Various Twitter Data

Python

Simple Scrape (requests)

Perl

Simple Scrape (LWP module)

Request / Contact