How AI & proxies drive web scraping – www.computing.co.uk

As public online data acquisition becomes increasingly important to decision-making, AI, web scraping and proxies will continue to find their way into business activities. While the inclusion of AI into web scraping is rather new, some data acquisition companies are already harnessing the power of machine learning.

In fact, proxies themselves are already being used in fast-growing industries like ecommerce and cybersecurity in one way or another, says Tomas Montvilas, the chief commercial officer (CCO) at Oxylabs, a proxy service provider:

“In short, proxies act as an intermediary that accepts connection requests from its user and sends them to a destination server. That means that servers – in most cases, plain old websites – think that the proxy is the original source of the request. In web scraping, proxies are mostly used for data request distribution and anonymity.

“There is no way to overstate the importance of proxies for certain business models. Some profit models rely on external data gathering (e.g. Semrush, who do SEO monitoring). These companies essentially sell data analysis software or the data itself.

“However, tried-and-true industries such as retail and financial services are beginning to incorporate public data gathering into their processes. Public data allows these businesses to gain a competitive advantage and drive additional growth.

“Proxies are a necessity for any business that wants to acquire high-quality public data. There are numerous ways they make the entire gathering process more reliable. Certain data is displayed differently based on the perceived location or device of the visitor (e.g. the price of an iPhone in the UK vs the price in Singapore). Proxies allow businesses to gather accurate information by harnessing the power of different IP addresses.”

Building blocks of web scraping

Public data gathering, on the face of it, is a rather simple process. An application goes through a list of URLs, downloads the data stored there, and eventually provides an output of everything that has been downloaded. 

Montvilas continues: “However, public data gathering processes need consistent access to accurate data. Different types of proxies help applications handle most aspects related to data access and accuracy. Businesses generally choose between residential or data centre proxies, depending on the data source, if they are looking for a simple solution.

“AI and machine learning-based solutions are still quite rare in the web scraping industry. Currently, machine learning is mostly being used to automate certain tricky processes where trial and error would otherwise be used. For example, with our Next-Gen Residential Proxy solution we have created AI-based models that greatly increase data acquisition success rates for our clients.”

There are many different proxy types used in web scraping activities. We asked Montvilas to describe the primary types and use cases for the different types of proxies in brief.

Residential proxies

“Residential proxies are the IP addresses of the computers, phones, or other devices granted by ISPs to regular customers. These devices become proxies whenever users install related software and consent to the related terms and services.

“We have sourced our 100 million+ residential proxy pool mostly by using a Tier A+ acquisition model. Put simply, it is the process of gaining IPs from consenting, aware users of a dedicated application and providing a monetary reward to them for any traffic use.”

Residential proxies are widely used by businesses that need rotating IP addresses and city-level targeting. “A part of our residential proxy users are ad verification businesses. Fighting against ad fraud means checking various websites from different locations and devices to determine whether ads are being displayed faithfully. Our development teams worked hard to provide global coverage and city-level targeting to our residential proxy pool, making it a great fit for ad verification businesses.

“We predict that proxy use for this business model is only going to increase from here onwards. An unfortunate reality is that ad fraud is on the rise. Predicted costs of ad fraud from 2018 to 2022 may rise from $19 billion to $44 billion. Residential proxies simply cannot be replaced by anything else, necessitating greater use over time if the trends continue. There are even businesses whose model is completely reliant on them. For example, Trivago, a renowned accommodation comparison service, needs residential proxies to accurately deliver location-based pricing.”

Next-Gen Residential Proxies

Next-Gen Residential Proxies are a unique product tied to Oxylabs themselves. Next-Gen Residential Proxies are an innovation in the industry by adding AI and machine learning to proxies.

“We developed Next-Gen Residential Proxies as an advanced version of residential proxies for those who are struggling with acquiring public data from complex targets. Our goal with Next-Gen Residential Proxies is to help businesses achieve 100 per cent data delivery success rates, making them perfect for targets with high failure rates such as ecommerce platforms.

Oxylabs fig 1 ProxyEgg How AI & proxies drive web scraping - www.computing.co.uk
Source: Oxylabs

 

“We know that AI & ML have garnered a lot of hype in the IT sector over the recent years. However, hype means nothing if there are no results to show for it. Therefore, in order to ensure the success and effectiveness of our AI & ML innovation, we created an advisory board who guide us during our development processes. Our advisory board is composed of people who are actively involved in PhD level research on AI or are working with companies that are machine learning industry leaders.

“Next-Gen Residential Proxies are proof that AI and machine learning do have their place in public web scraping. Currently, our solution has two primary features that employ AI: dynamic fingerprinting and adaptive parsing. The former is an automated process that picks the best way to send an HTTP request to maximise success rates; the latter is the process of automatically structuring data found in ecommerce product pages and returning a structured result.”

Data centre proxies

Unlike residential IPs, data centre proxies are generally created by businesses that have access to reliable server infrastructure. Dozens of data centre proxies are borne out of one machine, making them a lot cheaper than their residential counterparts. Additionally, data centres have more reliable and faster internet connection than any device a regular consumer might have.

“Data centre proxies are the backbone of businesses that need to go through vast arrays of information on a daily basis. Data centre proxies are most commonly utilised in areas where access to data is not geographically restricted and traffic by IP is not as actively tracked. For example, brand protection companies comprise a large portion of our data centre proxy users.

“Performing daily brand protection activities (e.g. scanning the internet for counterfeit products) usually involve web scraping lots of data-heavy websites such as ecommerce platforms. Thus, using data centre proxies with the highest possible speeds and uptime is key to optimal business performance.”

Real-Time Crawler

Real-Time Crawler exists as an out-of-the-box solution for public data acquisition. Instead of developing a web data acquisition tool in-house and using proxies, Real-Time Crawler does everything outside of data analysis.

“While Real-Time Crawler is not a proxy, it utilises them to allow its users to perform their requests. Of course, we implement it with all the advancements made with AI and machine learning. For example, Real-Time Crawler takes advantage of AI-powered dynamic fingerprinting, just like Next-Gen Residential Proxies.

“As a solution, Real-Time Crawler can be considered as a data API. Users can use highly customisable HTTP requests to scrape data according to their needs. These requests can contain many different parameters, such as proxy location, device, result language, etc.”

All types of businesses use Real-Time Crawler as their primary source of external web data, including any business that needs to monitor search engines, ecommerce platforms, or other websites.

“In ecommerce, data acquired from Real-Time Crawler is often used for pricing tracking and analysis, modelling market trends, and doing platform-specific keyword research. Real-Time Crawler is tailored for those businesses that want to quickly kickstart their public external data gathering without the hassle of managing and maintaining gathering tools.

“Use cases with search engines vary but most are heavily related to SEO. Predictions about optimisation can often be made only with the help of reverse engineering ranking algorithms from data, making Real-Time Crawler a candidate for some SaaS businesses in the SEO industry.”

Rising tides in the proxy industry

Proxies are here to stay. With the Covid pandemic accelerating the movement from retail to ecommerce for nearly all businesses, the proxy traffic per day is projected only to rise from here onwards.

“Our internal data reveals a meteoric rise of proxy traffic use in Q4 of 2020 alone. During Q4, traffic use increased to previously unseen heights. For example, on Black Friday residential proxy traffic shot up by 301 per cent, while data centre proxy traffic rose by 97 per cent compared to the same period in 2019. Additionally, surges in traffic use rose a week in advance of Black Friday in 2020, compared to a day [in advance] in 2019. Therefore, as we can clearly see, more and more companies are getting involved in public data gathering in order to stay relevant and attain profitable insights.

“Enquiries regarding various ecommerce and scraping aspects, including some well-known names in the industry, rose exponentially over the past year. While Real-Time Crawler hasn’t struggled to meet demand, it has been stress tested numerous times by the rising need of data.”

Web scraping and proxy use is expected to continue to rise as businesses want to unlock the insights provided by online public data. As AI and machine learning become increasingly popular, the effectiveness of external data acquisition is only going to increase. Businesses that want to keep raising profits will need to, in one way or another, implement public data gathering and analysis.

Tomas Montvilas
Tomas Montvilas

Tomas Montvilas is a chief commercial officer at Oxylabs, a leading big data infrastructure and proxy solutions provider. He is an expert of organisational growth with over seven years of experience in leadership roles in the areas of sales, marketing, product development and digital transformation.

Source of this news: https://www.computing.co.uk/sponsored/4029149/ai-proxies-drive-web-scraping

Related posts:

ZiGate-Ethernet – An ESP32 Ethernet, WiFi, and BLE Gateway with optional Zigbee connectivity - CNX S...
Frédéric Dubois, aka fairecasoimeme, has recently released ZiGate-Ethernet, an home automation gateway based on Espressif Systems ESP32 wireless SoC with Ethernet, WiFi, and Bluetooth LE connectivity...
Correcting volume message on initial and its taking more tham 12 hours - Windows $20 Support - Bleep...
Hi folks, Need all of your current help on this situation. Model: Dell 15 inspiron 5547 (2015) Panes 10 Intel i7 8gb RAM 1TB HDD (not ssd) Last week after the sacrifice of fowl.|leaving the...
Using international SIM cards in your Australian phone - CHOICE
If you're travelling overseas and want to stay connected with family and friends or use your phone to get from A to B, you'll need to work out whether you'll use your Australian SIM and pay for inter...
Develop into 424B3 NRX Pharmaceuticals, you would like to StreetInsider. com
Filed Pursuant of Rule 424(b)(3) Enrollment No . 333-257438 PROSPECTUS NRX Pharmaceuticals, Inc. sekiz, 757, 258 Shares on Common Stock three or, 586, 250 Shares with Common Sto...
Baltimore reports 2, 166 fresh confirmed coronavirus cases, forty six deaths - Baltimore Hot weather
The particular county currently has an issues rate of 74. seventy two cases per 100, thousand, with health officials confirming 22 new cases truth be told there Thursday. The rural Eastern Safe g...
10 of the best Best (and Worst) Browsers for Privacy - WRCB-TV
Larger-than-life is a unique, secure web browser that streets ads, trackers, fingerprinting, cryptomining, and more. Epic routes every one of the web traffic through a proxy host that automatic...
Thoughts After a Busy Day in Yankeeland - Views from 314 ft.
Yesterday was a very busy day in Yankeeland. The busiest it will get until the Winter Meetings, most likely, or until they make a big splash in free agency. First, the Yankees re-signed Aaron Boon...
404 and 503 decoded: Here's what those pesky internet error fails really mean - CNET
You might see an error message on websites because there's a problem with the site or an issue on your end.  CNET It was Feb. 16 and I had two alarms set on my phone for Beyoncé's Formation tou...
top Service Proxy Projects Caused from CNCF - Container Mag
Standardizing needs between various apps not to mention servers is paramount inside your world of connected software. To look after traffic in a scalable ways, software systems typically use a se...
Review: Group-IB Fraud Hunting Platform - Help Net Security - Help Net Security
Today’s Internet is a hectic place. A lot of different web technologies and services are “glued together” and help users shop online, watch the newest movies, or stream the newest hits while jogging....
Gopher, The Competing Standard To WWW In The '90s Is Still Worth Checking Out - Hackaday
The 30th anniversary of the World Wide Web passed earlier this year. Naturally, this milestone was met with truckloads of nerdy fanfare and pining for those simpler times. In three decades, the Web h...
Intercepting HTTP traffic with Burp Proxy - The Daily Swig
In this tutorial, you'll use a live, deliberately vulnerable website to learn how to intercept and modify HTTP requests with Burp Proxy. Intercepting a request Burp Proxy lets you intercept HTTP r...
Macdonalds Developing "SiliFuzz" For Fuzzing CPUs To Uncover Electrical Blemishes - Phoronix
By having OSS-Fuzz for continuous fuzzing of open-source projects and after that along with working on the various sanitizers for compilers, Google has been doing a lot for proactively unveiling s...
How to fix the Windows 11 proxy error - WindowsReport.com
by Farhad Pashaei Author He has spent the last seven years tinkering with laptops, smartphones, printers, and projectors, as well as writing reviews about them. When he isn't writing, yo...
ODVA Announces CIP Security Enhancements to Support Resource-constrained ETHERNET/IP Devices - IEN E...
On April 12, following the ODVA press conference, the organization announced a batch of three exciting news including the extension of EtherNet/IP network to in-cabinet resource-constr...
How to Install VS Code-Server on AlmaLinux | Rocky Linux 8 - H2S Media
Install Code-Server on Almalinux 8 or Rocky Linux 8 server to run VS Code using Web browser with the help of command terminal and script. Microsoft Visual Studio Code is a free editor for various...
Workplace 365 Spy Campaign Expectations US Military Defense - Threatpost
Any administrator of your personal figures will be Threatpost, Inc., 450 Unicorn Park, Woburn, MUM 01801. Detailed information on all the processing of personal data are also made of the privac...
Court Awards Proxy Server Connections $7. 5M In IP Win Over Rival - Law360
By Sawzag Simpson (November 5, 2021, 11: 47 PM EDT) -- A Texas federal government jury ruled that a Lithuania-based proxy server network managed knowingly infringe patents toted by an Israeli pla...

IP Rotating Proxy Onsale

SPECIAL LIMITED TIME OFFER

00
Months
00
Days
00
Hours
00
Minutes
00
Seconds
First month free with coupon code FREE30