The Ultimate Tutorial On How To Do Web Scraping – hackernoon.com

image

Aurken Bilbao Hacker Noon profile picture

@aurkenbAurken Bilbao

Founder @ ZenRows.com. Entrepreneur with deep technical background, with 15+ years in startups, security & banking.

Web Scraping is the process of automatically collecting web data with specialized software.

Every day trillions of GBs are created, making it impossible to keep track of every new data point. Simultaneously, more and more companies worldwide rely on various data sources to nurture their knowledge to gain a competitive advantage. It’s not possible to keep up with the pace manually.

That’s where Web Scraping comes into play.

What is Web Scraping Used For?

As communication between systems is becoming critical, APIs are increasing in popularity. An API is a gate a website exposes to communicate with other systems. They open up functionality to the public. Unfortunately, many services don’t provide an API. Others only allow limited functionality.

Web Scraping overcomes this limitation. It collects information all around the internet without the restrictions of an API.

Therefore web scraping is used in varied scenarios:

Price Monitoring

  • E-commerce: tracking competition prices and availability.
  • Stocks and financial services: detect price changes, volume activity, anomalies, etc.

Lead Generation

  • Extract contact information: names, email addresses, phones, or job titles.
  • Identify new opportunities, i.e., in Yelp, YellowPages, Crunchbase, etc.

Market Research

  • Real Estate: supply/demand analysis, market opportunities, trending areas, price variation.
  • Automotive/Cars: dealers distribution, most popular models, best deals, supply by city.
  • Travel and Accommodation: available rooms, hottest areas, best discounts, prices by season.
  • Job Postings: most demanded jobs. Industries on the rise. Biggest employers. Supply by sector, etc.
  • Social Media: brand presence and growing influencers tracking. New acquisition channels, audience targeting, etc.
  • City Discovery: track new restaurants, commercial streets, shops, trending areas, etc.

Aggregation:

  • News from many sources.
  • Compare prices between, i.e., insurance services, traveling, lawyers.
  • Banking: organize all information into one place.

Inventory and Product Tracking:

  • Collect product details and specs.
  • New products.

SEO (Search Engine Optimization): Keywords’ relevance and performance. Competition tracking, brand relevance, new players’ rank.

ML/AI – Data Science: Collect massive amounts of data to train machine learning models; image recognition, predictive modeling, NLP.

Bulk downloads: PDFs or massive Image extraction at scale.

Web Scraping Process

Web Scraping works mainly as a standard HTTP client-server communication.

The browser (client) connects to a website (server) and requests the content. The server then returns HTML content, a markup language both sides understand. The browser is responsible for rendering HTML to a graphical interface.
That’s it. Easy, isn’t it?

There are more content types, but let’s focus on this one for now. Let’s dig deeper on how the underlying communication works – it’ll come in handy later on.

Request – made by the browser

A request is a text the browser sends to the website. It consists of four elements:

  • URL: the specific address on the website.
  • Method: there are two main types: GET to retrieve data. And POST to submit data (usually forms).
  • Headers. User-Agent, Cookies, Browser Language, all go here. It is one of the most important and tricky parts of communication. Websites strongly focus on this data to determine whether a request comes from a human or a bot.
  • Body: commonly user-generated input. Used when submitting forms.

Response – returned by the server

When a website responds to a browser, it returns three items.

  • HTTP Code: a number indicating the status of the request. 200 means everything went OK. The infamous 404 means URL not found. 500 is an internal server error. You can learn more about HTTP codes.
  • The content: HTML. Responsible for rendering the website. Auxiliary content types include: CSS styles (appearance), Images, XML, JSON or PDF. They improve the user experience.
  • Headers. Just like Request Headers, these play a crucial role in communication. Amongst others, it instructs the browser to “Set-Cookie”s. We will get back to that later.

Up to this point, this reflects an ordinary client-server process. Web Scraping, though, adds a new concept: data extraction.

Data Extraction – Parsing

HTML is just a long text. Once we have the HTML, we want to obtain specific data and structure it to make it usable. Parsing is the process of extracting selected data and organizing it into a well-defined structure.

Technically, HTML is a tree structure. Upper elements (nodes) are parents, and the lower are children. Two popular technologies facilitate walking the tree to extract the most relevant pieces:

  • CSS Selectors: broadly used to modify the look of websites. Powerful and easy to use.
  • XPath: they are more powerful but harder to use. They’re not suited for beginners.

The extraction process begins by analyzing a website. Some elements are valuable at first sight. For example, Title, Price, or Description are all easily visible on the screen. Other information, though, is only visible in the HTML code:

  • Hidden inputs: it commonly contains information such as internal IDs that are pretty valuable.
image

Hidden Inputs on Amazon Products

  • XHR: websites execute requests in the background to enhance user experience. They regularly store rich content already structured in JSON format.
image

Asynchronous Request on Instagram

  • JSON inside HTML: JSON is a commonly used data-interchange format. Many times it’s within the HTML code to serve other services – like Analytics or Marketing.
image

JSON within HTML on Alibaba

  • HTML attributes: add semantical meaning to other HTML elements.
image
image

HTML attributes on Craiglist

Once data is structured, databases store it for later use. At this stage, we can export it to other formats such as Excel, PDF or transform it to make them available to other systems.

Web Scraping Challenges

Such a valuable process does not come free of obstacles, though.

Websites actively avoid being tracked/scraped. It’s common for them to build protective solutions. High traffic websites put advanced industry-level anti-scraping solutions into place. This protection makes the task extremely challenging.

These are some of the challenges web scrapers face when dealing with relevant websites (low traffic websites are usually low value and thus have weak anti-scraping systems):

IP Rate Limit

All devices connected to the internet have an identification address, called IP. It’s like an ID Card. Websites use this identifier to measure the number of requests of a device and try to block it. Imagine an IP requesting 120 pages per minute. Two requests per second. Real users cannot browse at such a pace. So to scrape at scale, we need to bring a new concept: proxies.

Rotating Proxies

A proxy, or proxy server, is a computer on the internet with an IP address. It intermediates between the requestor and the website. It permits hiding the original request IP behind a proxy IP and tricks the website into thinking it comes from another place. They’re typically used as vast pools of IPs and switched between them depending on various factors. Skilled scrapers tune this process and select proxies depending on the domain, geolocation, etc.

Headers / Cookies validation

Remember Request/Response Headers? A mismatch between the expected and resulting values tells the website something is wrong. The more headers shared between browser and server, the harder it gets for automated software to communicate smoothly without being detected. It gets increasingly challenging when websites return the “Set-Cookie” header that expects the browser to use it in the following requests.

Ideally, you’d want to make requests with as few headers as possible. Unfortunately, something it’s not possible leading to another challenge:

Reverse Engineering Headers / Cookies generation

Advanced websites don’t respond if Headers and Cookies are not in place, forcing us to reverse-engineering. Reverse engineering is the process of understanding how a process’ built to try to simulate it. It requires tweaking IPs, User-Agent (browser identification), Cookies, etc.

Javascript Execution

Most websites these days rely heavily on Javascript. Javascript is a programming language executed on the browser. It adds extra difficulty to data collection as a lot of tools don’t support Javascript. Websites do complex calculations in Javascript to ensure a browser is really a browser. Leading us to:

Headless Browsers

A headless browser is a web browser without a graphical user interface controlled by software. It requires a lot of RAM and CPU, making the process way more expensive. Selenium and Puppeteer (created by Google) are two of the most used tools for the task. You guessed: Google is the largest web scraper in the world.

Captcha / reCAPTCHA (Developed by Google)

Captcha is a challenge test to determine whether or not the user is human. It used to be an effective way to avoid bots. Some companies like Anti-Captcha and 2Captcha offer solutions to bypass Captchas. They offer OCR (Optical Character Recognition) services and even human labor to solve the puzzles.

Pattern Recognition

When collecting data, you may feel tempted to go the easy way and follow a regular pattern. That’s a huge red flag for websites. Arbitrary requests are not reliable either. How’s someone supposed to land on page 8? It should’ve certainly been on page 7 before. Otherwise, it indicates that something’s weird. Nailing the right path is tricky.

Conclusion

Hopefully, this grasps the overview of how data automation looks. We could stay forever talking about it, but we will get deeper into details in the coming posts.

Data collection at scale is full of secrets. Keeping up the pace is arduous and expensive. It’s hard, very hard.

A preferred solution is to use batteries included services like ZenRows that turn websites into data. We offer a hassle-free API that takes care of all the work, so you only need to care about the data. We urge you to try it for FREE.

We are delighted to help and even tailor-made a custom solution that works for you.

Disclaimer: Aurken Bilbao is Founder of Zenrows.com

Previously published at https://www.zenrows.com/blog/what-is-web-scraping/

by Aurken Bilbao @aurkenb. Founder @ ZenRows.com. Entrepreneur with deep technical background, with 15+ years in startups, security & banking.Try ZenRows for FREE

Tags

Join Hacker Noon

Create your free account to unlock your custom reading experience.

Source of this news: https://hackernoon.com/the-ultimate-tutorial-on-how-to-do-web-scraping-tu1433od

Related posts:

How to set up a proxy server in Edge for Windows 10 - Windows Central
In a time of restrictions and eroding privacy, many people are using a proxy while they browse the internet. A proxy is essentially a secondary hub that your internet traffic is pushed through. Inste...
Rotating Proxies for Scraping - London Post
The truth is, most websites have a limit to the number of requests sent from the same IP address within a given time frame. Exceeding the rate limit will get your address blocked, and the connection...
Frequent API RoundUp: Amazon Giving away Partner, Datachip COVID-19 Shot Status, Findl - Programmabl...
Day-after-day, the ProgrammableWeb team typically is busy, updating its 3 or more primary directories for APIs , happy clientele (language-specific the library or SDKs for using or providing ...
Dallas Invents: 127 Patents Granted for Week of Nov. 16 » Dallas Innovates - dallasinnovates.com
Dallas Invents is a weekly look at U.S. patents granted with a connection to the Dallas-Fort Worth-Arlington metro area. Listings include patents granted to local assignees and/or those with a N...
Proxy Virus time: http=localhost:8000;https=localhost:8000 - Virus, Trojan, Spyware, and Malware Rem...
same problem that Phideous was having in this post:  https://www.bleepingcomputer.com/forums/t/742727/proxy-virus-time-httplocalhost8000;httpslocalhost8000/ I have done the same anti viral measu...
Valley National Bancorp to Acquire Bank Leumi USA Creating a Premier Commercial Bank With ... - KULR...
NEW YORK, Sept. 23, 2021 (GLOBE NEWSWIRE) -- Valley National Bancorp (“Valley”) (NASDAQ: VLY) and Bank Leumi Le-Israel Corporation (“Leumi”) announced today that they have entered into a definitive ...
Rose McGowan Says She Won’t Return To USA To Live After Move To Mexico - Deadline
Outspoken actress Rose McGowan has moved to Mexico and says she will never return to live in the USA. Speaking on the YouTube series The Dab Roast, McGowan said she moved to Mexico in early 2020. ...
Will likely ISPs, Websites, and Your Master Tell If You’re Using a VPN? - Lifehacker Australia
VPNs keep your internet service activity hidden, but if a player knows what they’re attempting to, they can tell when you happen to be using one. That might solid alarming, but as long in the for...
GL Enhances Session Initiation Protocol Emulator - GlobeNewswire
GAITHERSBURG, Doctor., April 14, 2022 (GLOBE NEWSWIRE) -- GL Mailings Inc., a global leader around telecom test and measurement answer, addressed the press considering their enhanced MAPS™ Session...
Dallas Invents: 149 Patents Granted for Week of March 23 - dallasinnovates.com
Dallas Invents is a weekly look at U.S. patents granted with a connection to the Dallas-Fort Worth-Arlington metro area. Listings include patents granted to local assignees and/or those with a N...
Subspace will launch its parallel and real-time internet for gaming and the metaverse - VentureBeat
Subspace is officially launching its parallel and real-time internet service for gaming and the metaverse on November 16.In the past couple of years, Subspace has built out its parallel network using...
Mutual TLS: Vital for Securing Microservices in a Service Mesh - Security Boulevard
Mutual TLS: Vital for Securing Microservices in a Service Meshbrooke.crothersThu, 04/28/2022 – 16:10 Why do you need mTLS? While TLS is being used to secure traffic between clients and servers on t...
North America Cloud Content Delivery Network (CDN) Market is the key contributor to the global marke...
A cloud content delivery network (CDN) is a cloud-based globally distributed network of proxy servers installed in multiple data centers. The goal of cloud CDN is to ensure faster delivery of conten...
Top 8 Ways to Fix Microsoft Store Freezing on Windows 10 - Guiding Tech
The issue mostly occurs when the user selects Settings or clicks on Downloads and updates to check app updates on the Store.While Microsoft is aware of the issue, Microsoft Store is being re...
How to Fix the Microsoft Store Acquiring License Error 2021 Tips - Bollyinside - BollyInside
This tutorial is about the How to Fix the Microsoft Store Acquiring License Error. We will try our best so that you understand this guide. I hope you like this blog How to Fix the Microsoft Store Acq...
Trump Tweet On ‘Chinese Virus’ Sparked Rising Use Of Anti-Asian Hashtags, UCSF Study Discoveries - C...
SAN FRANCISCO (CBS SF) – A tweet before former President Donald Trump in the early days of the COVID-19 pandemic has been attributed to one precipitous rise in anti-Asian hashtags on the social m...
Using Proxies for Instagram: Is it safe? - Media Update
From a marketer's perspective, a proxy server's job is to hide the IP address of its users from third parties. These are other websites or apps, for example. The simplest way of seeing what...
Russia's Attempts to Ban Twitter, Telegram, and Other Sites Keep Failing - Foreign Policy
On March 16, Russia’s internet and media regulator, Roskomnadzor, threatened to block access to Twitter from within Russia in 30 days if the platform failed to comply with government demands to dele...

IP Rotating Proxy Onsale

SPECIAL LIMITED TIME OFFER

00
Months
00
Days
00
Hours
00
Minutes
00
Seconds
First month free with coupon code FREE30