How to Scrape 10 Online Shops within 30 Minutes Using Scrapy and Python?

Get sourced data you want to kick-start an App project

You are a developer
You would love to create a wonderful Web Application
You are completely dedicated towards your project

Although you tick all these points, you still require a domain associated dataset before you write a one line code. It is because contemporary applications use a huge amount of data at the same time or in batches for providing value for users.

In this blog, we will explain our workflow to generate such datasets. You would see how we deal with automated data scraping of different websites with no manual intervention.

Our objective is to produce a dataset to make a Price Comparison WebApp. A product category we will be utilizing as an example is hand- bags. For this application, price and product data of hand-bags need to be collected from various online-sellers every day. Though some sellers offer an API to access all the required details, not all follow the similar route. Therefore, using web scrapping is certain!

In this example, we will create web spiders for 10 sellers using Scrapy and Python. Then, we will automate the procedure using Apache Airflow and there will be no requirement for manual involvements to execute the whole procedure periodically.

A Live Demo Web App with Source Code

You can get all associated source code in the GitHub repository.

Our Web Scraping Workflow

Before we start any web data scraping project, we need to define which sites will get covered in this project. We have decided to include 10 websites that are the most stayed online stores in Turkey for hand bags. You can observe them in our GitHub repository.

Step1: Install Scrapy and Set Up Project Folders

You need to install Scrapy in your computer and create a Scrapy project before making any Scrapy spiders.

Project Files & Folders

We made a folder structure in the local computer for neatly placing project files in separate folders.

A ‘csvFiles’ folder has a CSV file for all websites extracted. Spiders would be reading from the CSV files to find ‘starting URLs’ for initiating scraping as we do not need to hard-code them in spiders.

‘fashionWebScraping’ folder has Scrapy spiders with helper scripts including ‘pipelines.py’, ‘settings.py’, and ‘item.py’. We need to modify a few of Scrapy helper scripts for executing the scraping procedure successfully.

Product images extracted will get saved in an ‘images_scraped’ folder.

During the procedure of web data scraping, all the product data like pricing, name, product links and image links would be saved in JSON files within ‘jsonFiles’ folder.

There would be utility scripts to execute some tasks like;

‘deldub.py’ for detecting and removing duplicate product data in JSON files after data extraction ends.
‘deleteFiles.py’ for deleting all the JSON files produced at prior scrapping session.
‘jsonPrep.py’ is one more utility script for detecting and deleting null line objects in JSON files after data extraction ends.
‘jsonToes.py’ for populating ElasticSearch clusters in the remote location reading from JSON files. It provides a full-text real-time search experience.
‘sitemap_gen.py’ is to generate a site map that covers different product links.

Step2: Understand Particular Site’s URL Structure with Settling CSV Files for Preliminary URLs

After creation of project folders, the next step is populating the CSV files with starting URLs for every website we like to extract.

Nearly every e-commerce site provides pagination for navigating users through product list. Each time you navigate for next page, a page parameter within URL increases. Just go through the example URL given below, where a ‘page’ parameter gets used.

We will utilize {} placeholder to iterate URLs by incrementing values of ‘page’. We will utilize a ‘gender’ column within CSV file for defining gender categories of a particular URL.

Therefore, the last CSV file would look like that:

The similar principles applied to rest of sites in a project.

Step3: Modify ‘settings.py’ and ‘items.py’

Step 1: Installing and Setting Up packages

To do web scraping, we need to modify ‘items.py’ for defining ‘item objects’ that are used for storing the extracted data.

To describe general output data formats Scrapy offers an Item class. These item objects are easy containers used for collecting the extracted data. They offer dictionary-like APIs with an easy syntax to declare their accessible fields.

using scrapy.org

After that, we need to modify ‘settings.py’. It is necessary to customize image pipelines and spiders’ behavior.

These Scrapy settings permit you in customizing the behavior of different Scrapy components like the core, pipelines, extensions, and spiders.

using scrapy.org

‘settings.py’ and ‘item.py’ are valid for different spiders in the project.

Step4: Making Spiders

Spiders from Scrapy are the classes that define how certain sites (or groups of websites) will get scraped, together with how to do crawling (i.e. follow links) as well as how to scrape structured data using their pages (i.e. extracting items). Spiders are a place where you can define crawling’s custom behavior and parsing the pages for any particular website (or in a few cases, one group of websites).

using scrapy.org

The given shell command makes a clear spider file. It’s time to write codes in the fashionBOYNER.py file:

The spider class has 2 functions including ‘start_requests’ as well as ‘parse_product_pages’.

In function ‘start_requests’, we read from definite CSV file that we have already produced to get starting URL data. Then we repeat the placeholder {} for passing URLs of product pages into a ‘parse_product_pages’ function.

We could also pass ‘gender’ meta-data into ‘parse_product_pages’ function with ‘Request’ method using ‘meta={‘gender’: row[‘gender’]}’ stricture.

we-could-see-status-of-the-we-could-see-status-of-the

In ‘parse_product_pages’ function, we do the real web extraction and populate Scrapy items using the extracted data.

We use Xpath for locating HTML sections containing product data on a web page.

The initial Xpath expression given scrapes the entire product listing from current pages getting scrapped. All the necessary product data is contained within ‘div’ content elements.

We have to loop in ‘content’ for reaching individual products as well as storing them in Scrapy items. Using XPath expressions, we could easily find the essential HTML elements in ‘content’.

We have to loop in

Step 5: Run Spiders and Store the Extracted Data in the JSON File

With this scraping procedure, every product item is saved in the JSON file. Every website has a particular JSON file occupied with data in every spider run.

Use of jsonlines format can be more memory-efficient in comparison to JSON format, particularly if you scrape many web pages at one session.

Note that a JSON file name begins with ‘rawdata’ indicating that next step is checking and validating the extracted raw data before utilizing them in the application.

Step 6: Clean and Validate the Extracted Data in the JSON Files

After the extraction procedure ends, you might have some items you need to remove from JSON files, before utilizing them in the application.

You might have some line items having duplicate values or null fields. Both cases need a correction procedure which we handle using ‘deldub.py’ and ‘jsonPrep.py’.

‘jsonPrep.py’ is looking for line items having null values as well as removes them if detected. You could find a code having explanations given below:

The results are saved with the file name begins with ‘prepdata’ in ‘jsonFiles’ folder after null line items get removed.

‘deldub.py’ needs duplicate line items as well as removes them if detected. You could find a code having explanations given below:

Automate the Entire Scraping Workflow Using an Apache Airflow

When we define the scraping procedure, we can jump into workflow automation. We will utilize Apache Airflow that is a Python-based automation tool made by Airbnb.

We will offer terminal commands to install and configure Apache Airflow.

Generating a DAG file

In the Airflow, a DAG (Directed Acyclic Graph) is the collection of different tasks you need to run, well-organized in the way which reflects their dependencies and relationships.

For instance, an easy DAG can include three jobs: A, B, & C. This might indicate that A need to successfully run before B could run, however, C could run anytime. This indicates that job A times out afterwards 5 minutes, and B could get restarted around 5 times if it fails. This might also indicate that workflow would run each night at 10 pm, however shouldn’t begin until any certain date.

DAG’s that are defined in the Python file, is to organize a task flow. We would not define real tasks within a DAG file.

Let’s make a DAG folder with an empty Python file and start defining workflow using Python codes.

Many operators are there given by Airflow for describing the job within a DAG file. We have listed commonly utilized ones given below.

waits-for-certain-files-time-database-rows-S3-key-and-more

We are planning to utilize only ‘BashOperator’ as we would be completing different tasks using Python scripts.

By following the tutorial, we generated bash scripts to do every task. You could find them in the Github repository.

To begin a DAG workflow, we have to run an Airflow Scheduler. It will execute a scheduler using a specified configuration in the ‘airflow.cfg’ file. A scheduler monitors every task in every DAG positioned in a ‘dags’ folder as well as triggers the task execution if dependencies are met.

When we run an airflow scheduler, we could see status of the tasks through visiting http://0.0.0.0:8080 on the browser. Airflow offers a user interface in which we could see and observe scheduled dags.

Conclusion

We have shown here the web scraping workflow from starting till end.

Hopefully, it will assist you grasp the fundamentals of web scrapping with workflow automation.

For more details, contact Actowiz Solutions. You can also reach us for all your mobile app scraping and web scraping services requirements.