One of our favorite quotes has been- ‘If a problem changes by an order,
it becomes a totally different problem’ and in this lies the answer to-
what’s the difference between scraping and crawling?
Crawling
usually refers to dealing with large data-sets where you develop your
own crawlers (or bots) which crawl to the deepest of the web pages. Data
scraping on the other hand refers to retrieving information from any
source (not necessarily the web). It’s more often the case that
irrespective of the approaches involved, we refer to extracting data
from the web as scraping (or harvesting) and that’s a serious
misconception.
Below are some differences in our opinion- both evident and subtle
-
Scraping data does not necessarily involve the web. Data scraping could
refer to extracting information from a local machine, a database, or
even if it is from the internet, a mere “Save as” link on the page is
also a subset of the data scraping universe. Crawling on the other hand
differs immensely in scale as well as in range. Firstly, crawling = web
crawling which means on the web, we can only “crawl” data. Programs that
perform this incredible job are called crawl agents or bots or spiders
(please leave the other spider in spiderman’s world). Some web spiders
are algorithmically designed to reach the maximum depth of a page and
crawl them iteratively (did we ever say scrape?).
- Web is an
open world and the quintessential practising platform of our right to
freedom. Thus a lot of content gets created and then duplicated. For
instance, the same blog might be posted on different pages and our
spiders don’t understand that. Hence, data de-duplication
(affectionately dedup) is an integral part of data crawling. This is
done to achieve two things- keep our clients happy by not flooding their
machines with the same data more than once, and saving our own servers
some space. However, dedup is not necessarily a part of data scraping.
-
One of the most challenging things in the web crawling space is to deal
with coordination of successive crawls. Our spiders have to be polite
with the servers that they hit so that they don’t piss them off and this
creates an interesting situation to handle. Over a period of time, our
intelligent spiders have to get more intelligent (and not crazy!) and
learn to know when and how much to hit a server in order to crawl data
on its web pages while complying with its politeness policies.
-
Finally, different crawl agents are used to crawl different websites
and hence you need to ensure they don’t conflict with each other in the
process. This situation never arises when you intend to just scrape
data.
On a concluding note, scraping represents a very
superficial node of crawling which we call extraction and that again
requires few algorithms and some automation in place.
P.S. This
post does not intend to offend anyone who uses the terms ‘scraping’ and
‘crawling’ interchangeably, but purely wishes to create awareness for
those interested in the Big Data domain. And sorry! We couldn’t help
being biased towards the word “crawl” because that’s what feeds us :).
Source:http://promptcloud.com/blog/data-scraping-vs-data-crawling/