Getting started

Preface

This website is not a tutorial. While I do my best to document the code, this page is meant to be a repository of popular web scraping needs. A reference guide.

Note: All scripts and snippets of code here are provided as-is. Though I do confirm that at time of publishing they have been confirmed to work. The web is a dynamic place and things do change.

E-mail me at conwebscrapingscripts.com.

Connect with me on LinkedIn here.

Simple Scrape (cURL)

This script makes use of the cURL library to scrape the title of this website.

<?php
// set vars
$url = "http://webscrapingscripts.com/";
$extract_from = "<title>";
$extract_to = "</title>";

function extractString($string, $start, $end){
    $ini = strpos($string, $start);
    if ($ini === false) return false;
    $ini += strlen($start);
    $len = strpos($string, $end, $ini);
    if ($len === false) return false;

    return substr($string, $ini, $len-$ini);
}

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$output = curl_exec ($ch);
$info = curl_getinfo($ch);
curl_close ($ch);
if ($info['http_code'] == 200){
    $title = extractString($output, $extract_from, $extract_to);
    if ($title !== false){
        print "The scraped title is: $title\r\n";
        //TODO: do something with the scraped data
    }else{
        print "Error: Page could not find extract params\r\n";
    }
}else{
    print "Error: Page did not return expected 200 OK\r\n";
}
?>

Should return:

The scraped title is: Web Scraping Scripts

Scrape from behind login

This script will scrape the title from behind a login form.

TODO: Coming soon

Scrape Various Twitter Data

This script will scrape various data sets from Twitter

TODO: Coming soon

Python

Here lies all the Python scripts.

Simple Scrape (requests)

In this basic example we scrape the title using the requests library.

import requests

url = "http://webscrapingscripts.com/"
extract_from = "<title>";
extract_to = "</title>";

page = requests.get(url)

if page.status_code == 200:
    print page.content
    print "\r\n"
    title = page.content[page.content.find(extract_from)+len(extract_from):page.content.find(extract_to)]
    print "The scraped title is: " + title + "\r\n"
else:
    print "Error: Page did not return expected 200 OK\r\n"

Should return:

The scraped title is: Web Scraping Scripts

Perl

Here lies all the Perl scripts.

Simple Scrape (LWP module)

In this basic example we scrape the title using the Perl LWP modeul.

#!/usr/bin/perl

use LWP::UserAgent;

my $request_url = "http://webscrapingscripts.com/";

my $extract_from = '<title>';
my $extract_to = '</title>';
my $title = '';

my $req = HTTP::Request->new(GET => $request_url);
$req->header("User-Agent" => "Perl Scraper 1.0");

my $ua = LWP::UserAgent->new;
my $response = $ua->request($req);
my $respContent = $response->{_content};

if ( (my $from_index = index($respContent, $extract_from)) != -1){
  if ( (my $to_index = index($respContent, $extract_to)) != -1){
    $title = substr($respContent, $from_index+length($extract_from),
            $to_index-$from_index-length($extract_from));
    printf("Title is %s\r\n", $title);
    #
    #TODO do something with the data
    #
  }else{
    print "Error: extract_to string not found in response.\r\n";
  }
}else{
  print "Error: extract_from string not found in response.\r\n";
}

Should return:

Title is Web Scraping Scripts

Request / Contact

Don't see a possible scraping implementation here?

Request it by emailing conwebscrapingscripts.com.

Copyright (c) 2019 Web Scraping Scripts / Tom Chmielarz