Sophisticated Web Scraping with Bright Data

December 14, 2022

It’s easy to make requests for structured data served by REST or GraphQL APIs. Scraping arbitrary data from any web page is more of a chore, but it opens further opportunities. Bright Data provides services to make scraping easier, reliable, and practical.

We created this article in partnership with Bright Data. Thank you for supporting the partners who make SitePoint possible.

Scraping data is web developer super power that puts you above the capabilities of ordinary web users. Do you want to find the cheapest flight, the most discounted hotel room, or the last remaining next generation games console? Mortal users must manually search at regular intervals, and they need a heavy dose of luck to bag a bargain. But web scraping allows you to automate the process. A bot can scrape data every few seconds, alert you when thresholds are exceeded, and even auto-buy a product in your name.

For a quick example, the following bash command uses Curl to fetch the HTML content returned by the SitePoint blog index page. It pipes the result through Grep to return links to the most recent articles:

curl 'https://www.sitepoint.com/blog/' | \
  grep -o '<article[^>]*>\s*<a href="[^"]*"'

A program could run a similar process every day, compare with previous results, and alert you when SitePoint publishes a new article.

Before you jump in and attempt to scrape content from all your favorite sites, try using curl with a Google search or Amazon link. The chances are you’ll receive an HTTP 503 Service Unavailable with a short HTML error response. Sites often place barriers to prevent scraping, such as:

checking the user agent, cookies, and other HTTP headers to ensure a request originates from a user’s browser and not a bot
generating content using JavaScript-powered Ajax requests so the HTML has little information
requiring the user to interact with the page before displaying content — such as scrolling down
requiring a user to log in before showing content — such as most social media sites

You can fix most issues using a headless browser — a real browser installation that you control using a driver to emulate user interactions such as opening a tab, loading a page, scrolling down, clicking a button, and so on.

Your code will become more complex, but that’s not the end of your problems. Some sites:

are only available on certain connections, such as a mobile network
limit content to specific countries by checking the requester’s IP address (for example, bbc.co.uk is available to UK visitors but will redirect those from other countries to bbc.com which has less content and advertisements)
block repeated requests from the same IP address
use CAPTCHAs or similar techniques to identify bots
use services such as Cloudflare, which can prevent bots detected on one site infiltrating another

You’ll now need proxy servers for appropriate countries and networks, ideally with a pool of IP addresses to evade detection. We’re a long way from the simplicity of curl combined with a regular expression or two.

Fortunately, Bright Data provides a solution for these technical issues, and it promises to “convert websites into structured data”. Bright Data offers reliable scraping options over robust network connections, which you can configure in minutes.

No-code Bright Data Datasets

Bright Data datasets are the easiest way to get started if you require data from:

ecommerce platforms such as Walmart and various Amazon sites (.com, .de, .es, .fr, .it, .in or .co.uk)
social media platforms including Instagram, LinkedIn, Twitter, and TikTok
business sites including LinkedIn, Crunchbase, Stack Overflow, Indeed, and Glassdoor
directories such as Google Maps Business
other sites such as IMDB

Typical uses of a dataset are:

monitoring of competitor pricing
tracking your best-selling products
investment opportunities
competitive intelligence
analyzing customer feedback
protecting your brands

In most cases, you’ll want to import the data into databases or spreadsheets to perform your own analysis.

Datasets are priced according to complexity, analysis, and the number of records. A site such as Amazon.com provides millions of products, so grabbing all records is expensive. However, you’re unlikely to require everything. You can filter datasets using custom subsets to return records of interest. The following example searches for SitePoint book titles using the string Novice to Ninja. This returns far fewer records, so it’s available for a few pennies.

dataset custom subset

You can receive the resulting data by email, webhook, Amazon S3, Google Cloud Storage, Microsoft Azure Storage, and SFTP either on a one-off or timed basis.

Custom Datasets and the Web Scraper IDE

You can scrape custom data from any website using a collector — a JavaScript program which controls a web browser on Bright Data’s network.

The demonstration below illustrates how to search Twitter for the #sitepoint hashtag and return a list of tweets and metadata in JSON format. This collector will be started using an API call, so you first need to head to your account settings and create a new API token.

create API token

Bright Data will send you an email with a confirmation number. Enter it into the panel and you’ll see your token (a 36-character hex GUID). Copy it and ensure you’ve stored it safely: you won’t see it again and will need to generate a new token if you lose it.

Head to the Collectors panel in the Data collection platform menu and choose a template. We’re using Twitter in this example but you can select any you require or create a custom collector from scratch:

Bright Data collector

This leads to the Web Scraper IDE where you can view and edit the collector JavaScript code. Bright Data provides API commands such as:

country(code) to use a device in a specific country
emulate_device(device) to emulate a specific phone or tablet
navigate(url) to open a URL in the headless browser
wait_network_idle() to wait for outstanding requests to finish
wait_page_idle() to wait until no further DOM requests are being made
click(selector) to click a specific element
type(selector, text) to enter text into an input field
scroll_to(selector) to scroll to an element so it’s visible
solve_captcha() to solve any CAPTCHAs displayed
parse() to parse the page data
collect() to add data to the dataset

A help panel is available, although the code will be familiar if you’ve programmed a headless browser or written integration tests.

In this case, the Twitter template code needs no further editing.

Bright Data code editor

Scroll to the bottom and click the Input panel to delete example hashtags and define your own (such as #SitePoint). Now click the Preview button to watch the code execute in a browser. It will take a minute or two to fully load Twitter and scroll down the page to render a selection of results.

Bright Data collector preview

The Output panel displays the captured and formatted results once execution is complete. You can download the data as well as examine the run log, browser console, network requests, and errors.

Bright Data output

Return to the Collectors panel using the menu or the back arrow at the top. Your new collector is shown.

Bright Data integration

Click the Integrate to your system button and choose these options:

the Realtime (single request) collection frequency
JSON as the format
API download as the delivery

Bright Data JSON API settings

Click Update to save the integration settings and return to the Collectors panel.

Now, click the three-dot menu next to the collector and choose Initiate by API.

Bright Data API initiation

The Initiate by API panel shows two curl request commands.

Bright Data curl example

The first command executes the Twitter hashtag collector. It requires the API token you created above. Add it at the end of the Authorization: Bearer header. For example:

curl \
  -H "Authorization: Bearer 12345678-9abc-def0-1234-56789abcdef0" \
  -H "Content-Type: application/json" \
  -d '{"Hashtag - #":"#SitePoint"}' \
  "https://api.brightdata.com/dca/trigger_immediate?collector=abc123"

It returns a JSON response with a job response_id:

{
  "response_id": "c3910b166f387775934ceb4e8lbh6cc",
  "how_to_use": "https://brightdata.com/api/data-collector#real_time_collection"
}

You must pass the job response_id to the second curl command on the URL (as well as your API token in the authorization header):

curl \
  -H "Authorization: Bearer 12345678-9abc-def0-1234-56789abcdef0" \
  "https://api.brightdata.com/dca/get_result?response_id=c3910b166f387775934ceb4e8lbh6cc"

The API returns a pending message while the collector is executing:

{
  "pending": true,
  "message": "Request is pending"
}

It will eventually return a JSON result containing tweet data when the collector has finished executing. You can import this information into your own systems as necessary:

[
  {
    "post": "https://twitter.com/UserOne/status/111111111111",
    "date": "2022-10-17T19:09:00.000Z",
    "Author": "UserOne",
    "post body": "Tweet one content",
    "likes": 0,
    "comments": 0,
    "Shares": 0,
    "input": {
      "Hashtag - #": "#SitePoint"
    }
  },
  {
    "post": "https://twitter.com/UserTwo/status/2222222222222",
    "date": "2022-10-08T13:28:16.000Z",
    "Author": "UserTwo",
    "post body": "Tweet two content",
    "likes": 0,
    "comments": 0,
    "Shares": 0,
    "input": {
      "Hashtag - #": "#SitePoint"
    }
  },...
]

The result is also available from the Bright Data panels.

Bright Data result

Bright Data Proxies

You can leverage Bright Data’s proxy network if your requirements go further than scraping websites. Example use cases:

you have an Android app you want to test on a mobile network in India
you have a server app which needs to download data as if it’s a user in one or more countries outside the server’s real location

A range of proxies are available, including these:

Residential proxies: a rotating set of IPs on real devices installed in residential homes
ISP proxies: static and rotating high-speed residential IPs in high-speed data centers
Datacenter proxies: static and rotating datacenter IPs
Mobile proxies: rotating IPs on real mobile 3G, 4G, and 5G devices
Web Unlocker proxy: an automated unlocking system using the residential network, which includes CAPTCHA solving
SERP API proxy: an option for collecting data from search engine results

Each offers options such as auto-retry, request limiting, IP rotation, IP blocking, bandwidth reduction, logging, success metrics, and proxy bypassing. Prices range from $0.60 to $40 per GB depending on the network.

The easiest way to get started is to use the browser extension for Chrome or Firefox. You can configure the extension to use any specific proxy network, so it’s ideal for testing websites in specific locations.

Bright Data proxy extension

For more advanced use, you require the Proxy Manager. This is a proxy installed on your device which acts as a middleman between your application and the Bright Data network. It uses command-line options to dynamically control the configuration before it authenticates you and connects to a real proxy.

Versions are available for Linux, macOS, Windows, Docker, and as a Node.js npm package. The source code is available on Github. Example scripts on the Bright Data site illustrate how you can use the proxy in shell scripts (curl), Node.js, Java, C#, Visual Basic, PHP, Python, Ruby, Perl, and others.

Proxy use can become complicated, so Bright Data suggests you to contact your account manager to discuss requirements.

Conclusion

Scraping data has become increasingly difficult over the years as websites attempt to thwart bots, crackers, and content thieves. The added complication of location, device, and network-specific content makes the task more challenging.

Bright Data offers a cost-effective route to solve scraping. You can obtain useful data immediately and adopt other services as your requirements evolve. The Bright Data network is reliable, flexible, and efficient, so you only pay for data you successfully extract.

Visual Studio Code: End-to-End Editing and Debugging Tools for Web Developers