Actowiz Metrics Real-time
logo
analytics dashboard for brands! Try Free Demo
How-to-Scrape-10-Online-Shops-within-30-Minutes-Using-Scrapy-and

Get sourced data you want to kick-start an App project

Get-sourceddata-you-want-to-kick-start-an-App-project
  • You are a developer
  • You would love to create a wonderful Web Application
  • You are completely dedicated towards your project

Although you tick all these points, you still require a domain associated dataset before you write a one line code. It is because contemporary applications use a huge amount of data at the same time or in batches for providing value for users.

In this blog, we will explain our workflow to generate such datasets. You would see how we deal with automated data scraping of different websites with no manual intervention.

Our objective is to produce a dataset to make a Price Comparison WebApp. A product category we will be utilizing as an example is hand- bags. For this application, price and product data of hand-bags need to be collected from various online-sellers every day. Though some sellers offer an API to access all the required details, not all follow the similar route. Therefore, using web scrapping is certain!

In this example, we will create web spiders for 10 sellers using Scrapy and Python. Then, we will automate the procedure using Apache Airflow and there will be no requirement for manual involvements to execute the whole procedure periodically.

A Live Demo Web App with Source Code

You can get all associated source code in the GitHub repository.

Our Web Scraping Workflow

Before we start any web data scraping project, we need to define which sites will get covered in this project. We have decided to include 10 websites that are the most stayed online stores in Turkey for hand bags. You can observe them in our GitHub repository.

Step1: Install Scrapy and Set Up Project Folders

Step1-Install-Scrapy-and-Set-Up-Project-Folders

You need to install Scrapy in your computer and create a Scrapy project before making any Scrapy spiders.

Project Files & Folders

Project-Files-&-Folders

We made a folder structure in the local computer for neatly placing project files in separate folders.

A ‘csvFiles’ folder has a CSV file for all websites extracted. Spiders would be reading from the CSV files to find ‘starting URLs’ for initiating scraping as we do not need to hard-code them in spiders.

‘fashionWebScraping’ folder has Scrapy spiders with helper scripts including ‘pipelines.py’, ‘settings.py’, and ‘item.py’. We need to modify a few of Scrapy helper scripts for executing the scraping procedure successfully.

Product images extracted will get saved in an ‘images_scraped’ folder.

Product-images-extracted-will

During the procedure of web data scraping, all the product data like pricing, name, product links and image links would be saved in JSON files within ‘jsonFiles’ folder.

There would be utility scripts to execute some tasks like;

  • ‘deldub.py’ for detecting and removing duplicate product data in JSON files after data extraction ends.
  • ‘deleteFiles.py’ for deleting all the JSON files produced at prior scrapping session.
  • ‘jsonPrep.py’ is one more utility script for detecting and deleting null line objects in JSON files after data extraction ends.
  • ‘jsonToes.py’ for populating ElasticSearch clusters in the remote location reading from JSON files. It provides a full-text real-time search experience.
  • ‘sitemap_gen.py’ is to generate a site map that covers different product links.

Step2: Understand Particular Site’s URL Structure with Settling CSV Files for Preliminary URLs

Step2-Understand-Particular-Sites-URL-Structure-with-Settling-CSV-Files-for-Preliminary-URLs

After creation of project folders, the next step is populating the CSV files with starting URLs for every website we like to extract.

Nearly every e-commerce site provides pagination for navigating users through product list. Each time you navigate for next page, a page parameter within URL increases. Just go through the example URL given below, where a ‘page’ parameter gets used.

We will utilize {} placeholder to iterate URLs by incrementing values of ‘page’. We will utilize a ‘gender’ column within CSV file for defining gender categories of a particular URL.

Therefore, the last CSV file would look like that:

The similar principles applied to rest of sites in a project.

Therefore-the-last-CSV

Step3: Modify ‘settings.py’ and ‘items.py’

Step3-Modify-settings-py-and

Step 1: Installing and Setting Up packages

To do web scraping, we need to modify ‘items.py’ for defining ‘item objects’ that are used for storing the extracted data.

To describe general output data formats Scrapy offers an Item class. These item objects are easy containers used for collecting the extracted data. They offer dictionary-like APIs with an easy syntax to declare their accessible fields.

using scrapy.org

using-scrapy-org

After that, we need to modify ‘settings.py’. It is necessary to customize image pipelines and spiders’ behavior.

These Scrapy settings permit you in customizing the behavior of different Scrapy components like the core, pipelines, extensions, and spiders.

using scrapy.org

‘settings.py’ and ‘item.py’ are valid for different spiders in the project.

Step4: Making Spiders

Step4-Making-Spiders

Spiders from Scrapy are the classes that define how certain sites (or groups of websites) will get scraped, together with how to do crawling (i.e. follow links) as well as how to scrape structured data using their pages (i.e. extracting items). Spiders are a place where you can define crawling’s custom behavior and parsing the pages for any particular website (or in a few cases, one group of websites).

using scrapy.org

The given shell command makes a clear spider file. It’s time to write codes in the fashionBOYNER.py file:

The-given-shell-command-makes-a-clear

The spider class has 2 functions including ‘start_requests’ as well as ‘parse_product_pages’.

In function ‘start_requests’, we read from definite CSV file that we have already produced to get starting URL data. Then we repeat the placeholder {} for passing URLs of product pages into a ‘parse_product_pages’ function.

We could also pass ‘gender’ meta-data into ‘parse_product_pages’ function with ‘Request’ method using ‘meta={‘gender’: row[‘gender’]}’ stricture.

we-could-see-status-of-the-we-could-see-status-of-the

In ‘parse_product_pages’ function, we do the real web extraction and populate Scrapy items using the extracted data.

We use Xpath for locating HTML sections containing product data on a web page.

The initial Xpath expression given scrapes the entire product listing from current pages getting scrapped. All the necessary product data is contained within ‘div’ content elements.

The-initial-Xpath-expression

We have to loop in ‘content’ for reaching individual products as well as storing them in Scrapy items. Using XPath expressions, we could easily find the essential HTML elements in ‘content’.

We-have-to-loop-in-content-for-reaching

We have to loop in

Step 5: Run Spiders and Store the Extracted Data in the JSON File

Step-5-Run-Spiders-and-Store-the-Extracted-Data-in-the-JSON-File

With this scraping procedure, every product item is saved in the JSON file. Every website has a particular JSON file occupied with data in every spider run.

Use of jsonlines format can be more memory-efficient in comparison to JSON format, particularly if you scrape many web pages at one session.

Note that a JSON file name begins with ‘rawdata’ indicating that next step is checking and validating the extracted raw data before utilizing them in the application.

Step 6: Clean and Validate the Extracted Data in the JSON Files

Step-6-Clean-and-Validate-the-Extracted-Data-in-the-JSON-Files

After the extraction procedure ends, you might have some items you need to remove from JSON files, before utilizing them in the application.

You might have some line items having duplicate values or null fields. Both cases need a correction procedure which we handle using ‘deldub.py’ and ‘jsonPrep.py’.

‘jsonPrep.py’ is looking for line items having null values as well as removes them if detected. You could find a code having explanations given below:

The results are saved with the file name begins with ‘prepdata’ in ‘jsonFiles’ folder after null line items get removed.

The-results-are-saved-with-the-file-name

‘deldub.py’ needs duplicate line items as well as removes them if detected. You could find a code having explanations given below:

‘deldub.py’ needs duplicate line items as well as removes them if detected. You could find a code having explanations given below:

Automate the Entire Scraping Workflow Using an Apache Airflow

When we define the scraping procedure, we can jump into workflow automation. We will utilize Apache Airflow that is a Python-based automation tool made by Airbnb.

We will offer terminal commands to install and configure Apache Airflow.

Generating a DAG file

In the Airflow, a DAG (Directed Acyclic Graph) is the collection of different tasks you need to run, well-organized in the way which reflects their dependencies and relationships.

For instance, an easy DAG can include three jobs: A, B, & C. This might indicate that A need to successfully run before B could run, however, C could run anytime. This indicates that job A times out afterwards 5 minutes, and B could get restarted around 5 times if it fails. This might also indicate that workflow would run each night at 10 pm, however shouldn’t begin until any certain date.

DAG’s that are defined in the Python file, is to organize a task flow. We would not define real tasks within a DAG file.

Let’s make a DAG folder with an empty Python file and start defining workflow using Python codes.

Let-s-make-a-DAG-folder-with-an

Many operators are there given by Airflow for describing the job within a DAG file. We have listed commonly utilized ones given below.

implements-a-bash-command sends-emails sends-HTTP-requests waits-for-certain-files-time-database-rows-S3-key-and-more

We are planning to utilize only ‘BashOperator’ as we would be completing different tasks using Python scripts.

We-are-planning-to-utilize-only

By following the tutorial, we generated bash scripts to do every task. You could find them in the Github repository.

To begin a DAG workflow, we have to run an Airflow Scheduler. It will execute a scheduler using a specified configuration in the ‘airflow.cfg’ file. A scheduler monitors every task in every DAG positioned in a ‘dags’ folder as well as triggers the task execution if dependencies are met.

When we run an airflow scheduler, we could see status of the tasks through visiting http://0.0.0.0:8080 on the browser. Airflow offers a user interface in which we could see and observe scheduled dags.

When-we-run-an-airflow

Conclusion

We have shown here the web scraping workflow from starting till end.

Hopefully, it will assist you grasp the fundamentals of web scrapping with workflow automation.

For more details, contact Actowiz Solutions. You can also reach us for all your mobile app scraping and web scraping services requirements.

Social Proof That Converts

Trusted by Global Leaders Across Q-Commerce, Travel, Retail, and FoodTech

Our web scraping expertise is relied on by 4,000+ global enterprises including Zomato, Tata Consumer, Subway, and Expedia — helping them turn web data into growth.

4,000+ Enterprises Worldwide
50+ Countries Served
20+ Industries
Join 4,000+ companies growing with Actowiz →
Real Results from Real Clients

Hear It Directly from Our Clients

Watch how businesses like yours are using Actowiz data to drive growth.

1 min
★★★★★
"Actowiz Solutions offered exceptional support with transparency and guidance throughout. Anna and Saga made the process easy for a non-technical user like me. Great service, fair pricing!"
TG
Thomas Galido
Co-Founder / Head of Product at Upright Data Inc.
2 min
★★★★★
"Actowiz delivered impeccable results for our company. Their team ensured data accuracy and on-time delivery. The competitive intelligence completely transformed our pricing strategy."
II
Iulen Ibanez
CEO / Datacy.es
1:30
★★★★★
"What impressed me most was the speed — we went from requirement to production data in under 48 hours. The API integration was seamless and the support team is always responsive."
FC
Febbin Chacko
-Fin, Small Business Owner
icons 4.8/5 Average Rating
icons 50+ Video Testimonials
icons 92% Client Retention
icons 50+ Countries Served

Join 4,000+ Companies Growing with Actowiz

From Zomato to Expedia — see why global leaders trust us with their data.

Why Global Leaders Trust Actowiz

Backed by automation, data volume, and enterprise-grade scale — we help businesses from startups to Fortune 500s extract competitive insights across the USA, UK, UAE, and beyond.

icons
7+
Years of Experience
Proven track record delivering enterprise-grade web scraping and data intelligence solutions.
icons
4,000+
Projects Delivered
Serving startups to Fortune 500 companies across 50+ countries worldwide.
icons
200+
In-House Experts
Dedicated engineers across scrapers, AI/ML models, APIs, and data quality assurance.
icons
9.2M
Automated Workflows
Running weekly across eCommerce, Quick Commerce, Travel, Real Estate, and Food industries.
icons
270+ TB
Data Transferred
Real-time and batch data scraping at massive scale, across industries globally.
icons
380M+
Pages Crawled Weekly
Scaled infrastructure for comprehensive global data coverage with 99% accuracy.

AI Solutions Engineered
for Your Needs

LLM-Powered Attribute Extraction: High-precision product matching using large language models for accurate data classification.
Advanced Computer Vision: Fine-grained object detection for precise product classification using text and image embeddings.
GPT-Based Analytics Layer: Natural language query-based reporting and visualization for business intelligence.
Human-in-the-Loop AI: Continuous feedback loop to improve AI model accuracy over time.
icons Product Matching icons Attribute Tagging icons Content Optimization icons Sentiment Analysis icons Prompt-Based Reporting

Connect the Dots Across
Your Retail Ecosystem

We partner with agencies, system integrators, and technology platforms to deliver end-to-end solutions across the retail and digital shelf ecosystem.

icons
Analytics Services
icons
Ad Tech
icons
Price Optimization
icons
Business Consulting
icons
System Integration
icons
Market Research
Become a Partner →

Popular Datasets — Ready to Download

Browse All Datasets →
icons
Amazon
eCommerce
Free 100 rows
icons
Zillow
Real Estate
Free 100 rows
icons
DoorDash
Food Delivery
Free 100 rows
icons
Walmart
Retail
Free 100 rows
icons
Booking.com
Travel
Free 100 rows
icons
Indeed
Jobs
Free 100 rows

Latest Insights & Resources

View All Resources →
thumb
Blog

How to Scrape Shopify Store Data: Product Prices, Reviews & Inventory (2026 Guide)

Complete guide to scraping Shopify store data in 2026. Extract product prices, reviews, and inventory from Shopify stores for competitive intelligence.

thumb
Case Study

How Natural Grocers Achieved 23% Higher Promotional ROI Using Real-Time Organic Product Pricing Intelligence

Discover how Natural Grocers achieved a 23% increase in promotional ROI using real-time organic product pricing intelligence. Learn how data-driven pricing strategies enhance promotions and retail performance.

thumb
Report

Track UK Grocery Products Daily Using Automated Data Scraping to Monitor 50,000+ UK Grocery Products from Morrisons, Asda, Tesco, Sainsbury’s, Iceland, Co-op, Waitrose, Ocado

Track UK Grocery Products Daily Using Automated Data Scraping across Morrisons, Asda, Tesco, Sainsbury’s, Iceland, Co-op, Waitrose, and Ocado for insights.

Start Where It Makes Sense for You

Whether you're a startup or a Fortune 500 — we have the right plan for your data needs.

icons
Enterprise
Book a Strategy Call
Custom solutions, dedicated support, volume pricing for large-scale needs.
icons
Growing Brand
Get Free Sample Data
Try before you buy — 500 rows of real data, delivered in 2 hours. No strings.
icons
Just Exploring
View Plans & Pricing
Transparent plans from $500/mo. Find the right fit for your budget and scale.
Get in Touch
Let's Talk About
Your Data Needs
Tell us what data you need — we'll scope it for free and share a sample within hours.
  • icons
    Free Sample in 2 HoursShare your requirement, get 500 rows of real data — no commitment.
  • icons
    Plans from $500/monthFlexible pricing for startups, growing brands, and enterprises.
  • icons
    US-Based SupportOffices in New York & California. Aligned with your timezone.
  • icons
    ISO 9001 & 27001 CertifiedEnterprise-grade security and quality standards.
Request Free Sample Data
Fill the form below — our team will reach out within 2 hours.
+1
Free 500-row sample · No credit card · Response within 2 hours

Request Free Sample Data

Our team will reach out within 2 hours with 500 rows of real data — no credit card required.

+1
Free 500-row sample · No credit card · Response within 2 hours