Start Your Project with Us

Whatever your project size is, we will handle it well with all the standards fulfilled! We are here to give 100% satisfaction.

  • Any feature, you ask, we develop
  • 24x7 support worldwide
  • Real-time performance dashboard
  • Complete transparency
  • Dedicated account manager
  • Customized solutions to fulfill data scraping goals
Careers

For job seekers, please visit our Career Page or send your resume to hr@actowizsolutions.com

How-to-Scrape-10-Online-Shops-within-30-Minutes-Using-Scrapy-and

Get sourced data you want to kick-start an App project

Get-sourceddata-you-want-to-kick-start-an-App-project
  • You are a developer
  • You would love to create a wonderful Web Application
  • You are completely dedicated towards your project

Although you tick all these points, you still require a domain associated dataset before you write a one line code. It is because contemporary applications use a huge amount of data at the same time or in batches for providing value for users.

In this blog, we will explain our workflow to generate such datasets. You would see how we deal with automated data scraping of different websites with no manual intervention.

Our objective is to produce a dataset to make a Price Comparison WebApp. A product category we will be utilizing as an example is hand- bags. For this application, price and product data of hand-bags need to be collected from various online-sellers every day. Though some sellers offer an API to access all the required details, not all follow the similar route. Therefore, using web scrapping is certain!

In this example, we will create web spiders for 10 sellers using Scrapy and Python. Then, we will automate the procedure using Apache Airflow and there will be no requirement for manual involvements to execute the whole procedure periodically.

A Live Demo Web App with Source Code

You can get all associated source code in the GitHub repository.

Our Web Scraping Workflow

Before we start any web data scraping project, we need to define which sites will get covered in this project. We have decided to include 10 websites that are the most stayed online stores in Turkey for hand bags. You can observe them in our GitHub repository.

Step1: Install Scrapy and Set Up Project Folders

Step1-Install-Scrapy-and-Set-Up-Project-Folders

You need to install Scrapy in your computer and create a Scrapy project before making any Scrapy spiders.

Project Files & Folders

Project-Files-&-Folders

We made a folder structure in the local computer for neatly placing project files in separate folders.

A ‘csvFiles’ folder has a CSV file for all websites extracted. Spiders would be reading from the CSV files to find ‘starting URLs’ for initiating scraping as we do not need to hard-code them in spiders.

‘fashionWebScraping’ folder has Scrapy spiders with helper scripts including ‘pipelines.py’, ‘settings.py’, and ‘item.py’. We need to modify a few of Scrapy helper scripts for executing the scraping procedure successfully.

Product images extracted will get saved in an ‘images_scraped’ folder.

Product-images-extracted-will

During the procedure of web data scraping, all the product data like pricing, name, product links and image links would be saved in JSON files within ‘jsonFiles’ folder.

There would be utility scripts to execute some tasks like;

  • ‘deldub.py’ for detecting and removing duplicate product data in JSON files after data extraction ends.
  • ‘deleteFiles.py’ for deleting all the JSON files produced at prior scrapping session.
  • ‘jsonPrep.py’ is one more utility script for detecting and deleting null line objects in JSON files after data extraction ends.
  • ‘jsonToes.py’ for populating ElasticSearch clusters in the remote location reading from JSON files. It provides a full-text real-time search experience.
  • ‘sitemap_gen.py’ is to generate a site map that covers different product links.

Step2: Understand Particular Site’s URL Structure with Settling CSV Files for Preliminary URLs

Step2-Understand-Particular-Sites-URL-Structure-with-Settling-CSV-Files-for-Preliminary-URLs

After creation of project folders, the next step is populating the CSV files with starting URLs for every website we like to extract.

Nearly every e-commerce site provides pagination for navigating users through product list. Each time you navigate for next page, a page parameter within URL increases. Just go through the example URL given below, where a ‘page’ parameter gets used.

We will utilize {} placeholder to iterate URLs by incrementing values of ‘page’. We will utilize a ‘gender’ column within CSV file for defining gender categories of a particular URL.

Therefore, the last CSV file would look like that:

The similar principles applied to rest of sites in a project.

Therefore-the-last-CSV

Step3: Modify ‘settings.py’ and ‘items.py’

Step3-Modify-settings-py-and

Step 1: Installing and Setting Up packages

To do web scraping, we need to modify ‘items.py’ for defining ‘item objects’ that are used for storing the extracted data.

To describe general output data formats Scrapy offers an Item class. These item objects are easy containers used for collecting the extracted data. They offer dictionary-like APIs with an easy syntax to declare their accessible fields.

using scrapy.org

using-scrapy-org

After that, we need to modify ‘settings.py’. It is necessary to customize image pipelines and spiders’ behavior.

These Scrapy settings permit you in customizing the behavior of different Scrapy components like the core, pipelines, extensions, and spiders.

using scrapy.org

‘settings.py’ and ‘item.py’ are valid for different spiders in the project.

Step4: Making Spiders

Step4-Making-Spiders

Spiders from Scrapy are the classes that define how certain sites (or groups of websites) will get scraped, together with how to do crawling (i.e. follow links) as well as how to scrape structured data using their pages (i.e. extracting items). Spiders are a place where you can define crawling’s custom behavior and parsing the pages for any particular website (or in a few cases, one group of websites).

using scrapy.org

The given shell command makes a clear spider file. It’s time to write codes in the fashionBOYNER.py file:

The-given-shell-command-makes-a-clear

The spider class has 2 functions including ‘start_requests’ as well as ‘parse_product_pages’.

In function ‘start_requests’, we read from definite CSV file that we have already produced to get starting URL data. Then we repeat the placeholder {} for passing URLs of product pages into a ‘parse_product_pages’ function.

We could also pass ‘gender’ meta-data into ‘parse_product_pages’ function with ‘Request’ method using ‘meta={‘gender’: row[‘gender’]}’ stricture.

we-could-see-status-of-the-we-could-see-status-of-the

In ‘parse_product_pages’ function, we do the real web extraction and populate Scrapy items using the extracted data.

We use Xpath for locating HTML sections containing product data on a web page.

The initial Xpath expression given scrapes the entire product listing from current pages getting scrapped. All the necessary product data is contained within ‘div’ content elements.

The-initial-Xpath-expression

We have to loop in ‘content’ for reaching individual products as well as storing them in Scrapy items. Using XPath expressions, we could easily find the essential HTML elements in ‘content’.

We-have-to-loop-in-content-for-reaching

We have to loop in

Step 5: Run Spiders and Store the Extracted Data in the JSON File

Step-5-Run-Spiders-and-Store-the-Extracted-Data-in-the-JSON-File

With this scraping procedure, every product item is saved in the JSON file. Every website has a particular JSON file occupied with data in every spider run.

Use of jsonlines format can be more memory-efficient in comparison to JSON format, particularly if you scrape many web pages at one session.

Note that a JSON file name begins with ‘rawdata’ indicating that next step is checking and validating the extracted raw data before utilizing them in the application.

Step 6: Clean and Validate the Extracted Data in the JSON Files

Step-6-Clean-and-Validate-the-Extracted-Data-in-the-JSON-Files

After the extraction procedure ends, you might have some items you need to remove from JSON files, before utilizing them in the application.

You might have some line items having duplicate values or null fields. Both cases need a correction procedure which we handle using ‘deldub.py’ and ‘jsonPrep.py’.

‘jsonPrep.py’ is looking for line items having null values as well as removes them if detected. You could find a code having explanations given below:

The results are saved with the file name begins with ‘prepdata’ in ‘jsonFiles’ folder after null line items get removed.

The-results-are-saved-with-the-file-name

‘deldub.py’ needs duplicate line items as well as removes them if detected. You could find a code having explanations given below:

‘deldub.py’ needs duplicate line items as well as removes them if detected. You could find a code having explanations given below:

Automate the Entire Scraping Workflow Using an Apache Airflow

When we define the scraping procedure, we can jump into workflow automation. We will utilize Apache Airflow that is a Python-based automation tool made by Airbnb.

We will offer terminal commands to install and configure Apache Airflow.

Generating a DAG file

In the Airflow, a DAG (Directed Acyclic Graph) is the collection of different tasks you need to run, well-organized in the way which reflects their dependencies and relationships.

For instance, an easy DAG can include three jobs: A, B, & C. This might indicate that A need to successfully run before B could run, however, C could run anytime. This indicates that job A times out afterwards 5 minutes, and B could get restarted around 5 times if it fails. This might also indicate that workflow would run each night at 10 pm, however shouldn’t begin until any certain date.

DAG’s that are defined in the Python file, is to organize a task flow. We would not define real tasks within a DAG file.

Let’s make a DAG folder with an empty Python file and start defining workflow using Python codes.

Let-s-make-a-DAG-folder-with-an

Many operators are there given by Airflow for describing the job within a DAG file. We have listed commonly utilized ones given below.

implements-a-bash-command sends-emails sends-HTTP-requests waits-for-certain-files-time-database-rows-S3-key-and-more

We are planning to utilize only ‘BashOperator’ as we would be completing different tasks using Python scripts.

We-are-planning-to-utilize-only

By following the tutorial, we generated bash scripts to do every task. You could find them in the Github repository.

To begin a DAG workflow, we have to run an Airflow Scheduler. It will execute a scheduler using a specified configuration in the ‘airflow.cfg’ file. A scheduler monitors every task in every DAG positioned in a ‘dags’ folder as well as triggers the task execution if dependencies are met.

When we run an airflow scheduler, we could see status of the tasks through visiting http://0.0.0.0:8080 on the browser. Airflow offers a user interface in which we could see and observe scheduled dags.

When-we-run-an-airflow

Conclusion

We have shown here the web scraping workflow from starting till end.

Hopefully, it will assist you grasp the fundamentals of web scrapping with workflow automation.

For more details, contact Actowiz Solutions. You can also reach us for all your mobile app scraping and web scraping services requirements.

RECENT BLOGS

View More

Beyond Basic Price Monitoring - How to Detect Competitor Stockouts and Win Market Share

Learn how Beyond Basic Price Monitoring helps you detect competitor stockouts in real-time and gain market share with smarter pricing and inventory strategies.

Extracting Public Dating Profiles for User Behavior & Trend Analysis

Explore how Actowiz Solutions extracts public dating profiles to analyze user behavior and trends with web scraping and data intelligence for smarter matchmaking insights.

RESEARCH AND REPORTS

View More

Number of Whataburger restaurants in the US 2025

Discover the total number of Whataburger restaurants in the US 2025, including state-wise data, top cities, and regional growth trends.

Research Report - Decathlon 2024 Sales Analysis - Key Metrics and Consumer Behavior

An in-depth Decathlon 2024 sales analysis, exploring key trends, consumer behavior, revenue growth, and strategic insights for future success.

Case Studies

View More

Case Study - Scrape Coupang Product Listings for Better Pricing Strategies: A Real-World Case Study

Discover how businesses can scrape Coupang product listings to gain competitive pricing insights, optimize strategies, and boost sales. A real-world case study example.

Cracking the Code - How Actowiz Solved Glovo’s Data Volatility with Precision Glovo Data Scraping

Discover how Actowiz Solutions used smart Glovo Data Scraping to overcome data volatility, ensuring accurate store listings and real-time delivery insights.

Infographics

View More

City-Wise Grocery Cost Index in the USA – Powered by Real-Time Data

Discover real-time grocery price trends across U.S. cities with Actowiz. Track essentials, compare costs, and make smarter decisions using live data scraping.

2025 Rental Price Insights from 99acres, MagicBricks & NoBroker

Explore 2025 rental trends with real-time data from 99acres, MagicBricks & NoBroker. Actowiz reveals top areas, price shifts & smart market insights.