Maine Web Scraping: September 2014

How to login to website and extract data using PHP

I have installed the tiny tiny rss on to my computer (Windows) and also have Xampp installed (localhost).

I want to be able to use PHP to extract data from the Tiny tiny RSS webpage.

I have tried this it which just opens the front page:

<?php
$homepage = file_get_contents('my install tiny tiny rss url');
echo $homepage;
?>

But how do I login and extract the data.

You can use cURL to send post data and headers. To login you need to replicate the exact data exchange between the client and the server.

Source: http://stackoverflow.com/questions/20611918/how-to-login-to-website-and-extract-data-using-php

PDF Scraping: Making Modern File Formats More Accessible

Data scraping is the process of automatically sorting through information contained on the internet inside html, PDF or other documents and collecting relevant information to into databases and spreadsheets for later retrieval. On most websites, the text is easily and accessibly written in the source code but an increasing number of businesses are using Adobe PDF format (Portable Document Format: A format which can be viewed by the free Adobe Acrobat software on almost any operating system. See below for a link.). The advantage of PDF format is that the document looks exactly the same no matter which computer you view it from making it ideal for business forms, specification sheets, etc.; the disadvantage is that the text is converted into an image from which you often cannot easily copy and paste. PDF Scraping is the process of data scraping information contained in PDF files. To PDF scrape a PDF document, you must employ a more diverse set of tools.

There are two main types of PDF files: those built from a text file and those built from an image (likely scanned in). Adobe's own software is capable of PDF scraping from text-based PDF files but special tools are needed for PDF scraping text from image-based PDF files. The primary tool for PDF scraping is the OCR program. OCR, or Optical Character Recognition, programs scan a document for small pictures that they can separate into letters. These pictures are then compared to actual letters and if matches are found, the letters are copied into a file. OCR programs can perform PDF scraping of image-based PDF files quite accurately but they are not perfect.

Once the OCR program or Adobe program has finished PDF scraping a document, you can search through the data to find the parts you are most interested in. This information can then be stored into your favorite database or spreadsheet program. Some PDF scraping programs can sort the data into databases and/or spreadsheets automatically making your job that much easier.

Quite often you will not find a PDF scraping program that will obtain exactly the data you want without customization. Surprisingly a search on Google only turned up one business, (the amusingly named ScrapeGoat.com http://www.ScrapeGoat.com) that will create a customized PDF scraping utility for your project. A handful of off the shelf utilities claim to be customizable, but seem to require a bit of programming knowledge and time commitment to use effectively. Obtaining the data yourself with one of these tools may be possible but will likely prove quite tedious and time consuming. It may be advisable to contract a company that specializes in PDF scraping to do it for you quickly and professionally.

Let's explore some real world examples of the uses of PDF scraping technology. A group at Cornell University wanted to improve a database of technical documents in PDF format by taking the old PDF file where the links and references were just images of text and changing the links and references into working clickable links thus making the database easy to navigate and cross-reference. They employed a PDF scraping utility to deconstruct the PDF files and figure out where the links were. They then could create a simple script to re-create the PDF files with working links replacing the old text image.

A computer hardware vendor wanted to display specifications data for his hardware on his website. He hired a company to perform PDF scraping of the hardware documentation on the manufacturers' website and save the PDF scraped data into a database he could use to update his webpage automatically.

PDF Scraping is just collecting information that is available on the public internet. PDF Scraping does not violate copyright laws.

PDF Scraping is a great new technology that can significantly reduce your workload if it involves retrieving information from PDF files. Applications exist that can help you with smaller, easier PDF Scraping projects but companies exist that will create custom applications for larger or more intricate PDF Scraping jobs.

Source:http://ezinearticles.com/?PDF-Scraping:-Making-Modern-File-Formats-More-Accessible&id=193321

Scraping dynamic content from query

The results from timed events used to be downloadable as a single file, but the host company doesn't provide this service anymore. Instead, they post the results on their website: a query is made to their database and the results are displayed 50 lines at a time. When you choose to view the next 50 results, from what I can tell, no new SQL request is sent, but the next 50 lines are displayed.

How can I scrape the entire set of results from (for example): http://www.bluechipresults.com.au/results.aspx?CId=11&RId=599

IE or Firefox

Source:http://stackoverflow.com/questions/25395554/scraping-dynamic-content-from-query

Data Scraping vs. Data Crawling

One of our favorite quotes has been- ‘If a problem changes by an order, it becomes a totally different problem’ and in this lies the answer to- what’s the difference between scraping and crawling?

Crawling usually refers to dealing with large data-sets where you develop your own crawlers (or bots) which crawl to the deepest of the web pages. Data scraping on the other hand refers to retrieving information from any source (not necessarily the web). It’s more often the case that irrespective of the approaches involved, we refer to extracting data from the web as scraping (or harvesting) and that’s a serious misconception.

Below are some differences in our opinion- both evident and subtle

Scraping data does not necessarily involve the web. Data scraping could refer to extracting information from a local machine, a database, or even if it is from the internet, a mere “Save as” link on the page is also a subset of the data scraping universe. Crawling on the other hand differs immensely in scale as well as in range. Firstly, crawling = web crawling which means on the web, we can only “crawl” data. Programs that perform this incredible job are called crawl agents or bots or spiders (please leave the other spider in spiderman’s world). Some web spiders are algorithmically designed to reach the maximum depth of a page and crawl them iteratively (did we ever say scrape?).
Web is an open world and the quintessential practising platform of our right to freedom. Thus a lot of content gets created and then duplicated. For instance, the same blog might be posted on different pages and our spiders don’t understand that. Hence, data de-duplication (affectionately dedup) is an integral part of data crawling. This is done to achieve two things- keep our clients happy by not flooding their machines with the same data more than once, and saving our own servers some space. However, dedup is not necessarily a part of data scraping.
One of the most challenging things in the web crawling space is to deal with coordination of successive crawls. Our spiders have to be polite with the servers that they hit so that they don’t piss them off and this creates an interesting situation to handle. Over a period of time, our intelligent spiders have to get more intelligent (and not crazy!) and learn to know when and how much to hit a server in order to crawl data on its web pages while complying with its politeness policies.
Finally, different crawl agents are used to crawl different websites and hence you need to ensure they don’t conflict with each other in the process. This situation never arises when you intend to just scrape data.

On a concluding note, scraping represents a very superficial node of crawling which we call extraction and that again requires few algorithms and some automation in place.

P.S. This post does not intend to offend anyone who uses the terms ‘scraping’ and ‘crawling’ interchangeably, but purely wishes to create awareness for those interested in the Big Data domain. And sorry! We couldn’t help being biased towards the word “crawl” because that’s what feeds us :).

Source:http://promptcloud.com/blog/data-scraping-vs-data-crawling/

Monday, 15 September 2014

How to login to website and extract data using PHP

PDF Scraping: Making Modern File Formats More Accessible

Scraping dynamic content from query

Data Scraping vs. Data Crawling