Website Feature Engineering at Scale: PySpark, Python & Snowflake
Introduction and Problem Imagine you’re staring at a database containing thousands of merchants across multiple countries, each with its own website. Your goal? Identify the top candidates to partner with in a new business proposal. Manually browsing each site is impossible at scale, so you need an automated way to gauge “how good” each merchant’s […] The post Website Feature Engineering at Scale: PySpark, Python & Snowflake appeared first on Towards Data Science.

Introduction and Problem
Imagine you’re staring at a database containing thousands of merchants across multiple countries, each with its own website. Your goal? Identify the top candidates to partner with in a new business proposal. Manually browsing each site is impossible at scale, so you need an automated way to gauge “how good” each merchant’s online presence is. Enter the website quality score: a numeric feature (0-10) that captures key aspects of a site’s professionalism, content depth, navigability, and visible product listings with prices. By integrating this score into your machine learning pipeline, you gain a powerful signal that helps your model distinguish the highest-quality merchants and dramatically improve selection accuracy.
Table of Contents
- Introduction and Problem
- Technical Implementation
- Legal & Ethical Considerations
- Getting Started
- Fetch HTML Script in Python
- Assign a Quality Score Script in Pyspark
- Conclusion
- Disclaimer
Technical Implementation
Legal & Ethical Considerations
Be a good citizen of the web.
- This scraper only counts words, links, images, scripts and simple “contact/about/price” flags, it does not extract or store any private or sensitive data.
- Throttle responsibly: use modest concurrency (e.g. CONCURRENT_REQUESTS ≤ 10), insert small pauses between batches, and avoid hammering the same domain.
- Retention policy: once you’ve computed your features or scores, purge raw HTML within a reasonable window (e.g. after 7-14 days).
- For very large runs, or if you plan to share extracted HTML, consider reaching out to site owners for permission or notifying them of your usage.
Getting Started
Here’s your folder structure once you clone the repository https://github.com/lucasbraga461/feat-eng-websites/ :
Code block 1. GitHub repository folder structure
├── src
│ ├── helpers
│ │ └── snowflake_data_fetch.py
│ ├── p1_fetch_html_from_websites.py
│ └── process_data
│ ├── s1_gather_initial_table.sql
│ └── s2_create_table_with_website_feature.sql
├── notebooks
│ └── ps_website_quality_score.ipynb
├── data
│ └── websites_initial_table.csv
├── README.md
├── requirements.txt
└── venv
└── .gitignore
└── .env
Your dataset should be ideally in Snowflake, here’s to give an idea on how you should prepare it, in case it comes from different tables, refer to src/process_data/s1_gather_initial_table.sql, here’s a snippet of it:
Code block 2. s1_gather_initial_table.sql
CREATE OR REPLACE TABLE DATABASE.SCHEMA.WEBSITES_INITIAL_TABLE AS
(
SELECT
DISTINCT COUNTRY, WEBSITE_URL
FROM DATABASE.SCHEMA.COUNTRY_ARG_DATASET
WHERE WEBSITE_URL IS NOT NULL
) UNION ALL (
SELECT
DISTINCT COUNTRY, WEBSITE_URL
FROM DATABASE.SCHEMA.COUNTRY_BRA_DATASET
WHERE WEBSITE_URL IS NOT NULL
) UNION ALL (
[...]
SELECT
DISTINCT COUNTRY, WEBSITE_URL
FROM DATABASE.SCHEMA.COUNTRY_JAM_DATASET
WHERE WEBSITE_URL IS NOT NULL
)
;
Here’s what this initial table should look like:
Fetch HTML Script in Python
Having the data ready, this is how you call it, let’s say you have your data in Snowflake:
Code block 3. p1_fetch_html_from_websites.py using Snowflake dataset
cd ~/Document/GitHub/feat-eng-websites
python3 src/p1_fetch_html_from_websites.py -c BRA --use_snowflake
- The python script is expecting the snowflake table to be in DATABASE.SCHEMA.WEBSITES_INITIAL_TABLE which can be adjusted for your use case on the code itself.
That will open a window on your browser asking you to authenticate to Snowflake. Once you authenticate it, it’ll pull the data from the designated table and proceed with fetching the website content.
If you choose to pull this data from a CSV file then don’t use the flag at the end and call it this way:
Code block 4. p1_fetch_html_from_websites.py using CSV dataset
cd ~/Document/GitHub/feat-eng-websites
python3 src/p1_fetch_html_from_websites.py -c BRA
GIF 1. Running p1_fetch_html_from_websites.py
Here’s why this script is powerful at fetching website content comparing to a more basic approach, see Table 1:
Table 1. Advantages of this Fetch HTML script comparing with a basic implementation
Technique Basic Approach This script p1_fetch_html_from_websites.py HTTP fetching Blocking requests.get() calls one‐by‐one Async I/O with asyncio + aiohttp to issue many requests in parallel and overlap network waits User-Agent Single default UA header for all requests Rotate through a list of real browser UA strings to evade basic bot‐detection and throttling Batching Load & process the entire URL list in one go Split into chunks via BATCH_SIZE so you can checkpoint, limit memory use, and recover mid-run Retries & Timeouts Rely on library defaults or crash on slow/unresponsive servers Explicit MAX_RETRIES and TIMEOUT settings to retry transient failures and bound per-request wait times Concurrency limit Sequential or unbounded parallel calls (risking overload) CONCURRENT_REQUESTS + aiohttp.TCPConnector + asyncio.Semaphore to throttle max in-flight connections Event loop Single loop reused, can hit “bound to different loop” errors when restarting use Create a fresh asyncio event loop per batch to avoid loop/semaphore binding errors and ensure isolation
It’s generally better to store raw HTML in a proper database (Snowflake, BigQuery, Redshift, Postgres, etc.) rather than in CSV files. A single page’s HTML can easily exceed spreadsheet limits (e.g. Google Sheets caps at 50,000 characters per cell), and managing hundreds of pages would bloat and slow down CSVs. While we include a CSV option here for quick demos or minimal setups, large‐scale scraping and Feature Engineering are far more reliable and performant when run in a scalable data warehouse like Snowflake.
Once you run it for say BRA, ARG and JAM this is how your data folder will look like
Code block 5. Folder structure once you ran it for ARG, BRA and JAM
├── data
│ ├── website_scraped_data_ARG.csv
│ ├── website_scraped_data_BRA.csv
│ ├── website_scraped_data_JAM.csv
│ └── websites_initial_table.csv
Refer to Figure 2 to visualize what the output of the first script generates, i.e. visualize the table website_scraped_data_BRA. Note that one of the columns is html_content, which is a very large field since it takes the whole HTML content of the website.
Assign a Quality Score Script in Pyspark
Because each page’s HTML can be massive, and you’ll have hundreds or thousands of pages, you can’t efficiently process or store all that raw text in flat files. Instead, we hand off to Spark via Snowpark (Snowflake’s Pyspark engine) for scalable feature extraction. See notebooks/ps_website_quality_score.ipynb for a ready-to-run example: just select the Python kernel in Snowflake and import the built-in Snowpark libraries to spin up your Spark session (see Code Block 6).
Code block 6. Folder structure once you ran it for ARG, BRA and JAM
import pandas as pd
from bs4 import BeautifulSoup
import re
from tqdm import tqdm
import Snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col, lit, udf
from snowflake.snowpark.context import get_active_session
session = get_active_session()
Each market speaks its own language and follows different conventions, so we bundle all those rules into a simple country-specific config. For each country we define the contact/about keywords and price‐pattern regexes that signal a “good” merchant site, then point the script at the corresponding Snowflake input and output tables. This makes the feature extractor fully data-driven, reusing the same code for every region with just a change of config.
Code block 7. Config file
country_configs = {
"ARG": {
"name": "Argentina",
"contact_keywords": ["contacto", "contáctenos", "observaciones"],
"about_keywords": ["acerca de", "sobre nosotros", "quiénes somos"],
"price_patterns": [r'ARS\s?\d+', r'\$\s?\d+', r'\d+\.\d{2}\s?\$'],
"input_table": "DATABASE.SCHEMA.WEBSITE_SCRAPED_DATA_ARG",
"output_table": "DATABASE.SCHEMA.WEBSITE_QUALITY_SCORES_ARG"
},
"BRA": {
"name": "Brazil",
"contact_keywords": ["contato", "fale conosco", "entre em contato"],
"about_keywords": ["sobre", "quem somos", "informações"],
"price_patterns": [r'R\$\s?\d+', r'\d+\.\d{2}\s?R\$'],
"input_table": "DATABASE.SCHEMA.WEBSITE_SCRAPED_DATA_BRA",
"output_table": "DATABASE.SCHEMA.WEBSITE_QUALITY_SCORES_BRA"
},
[...]
Before we can register and use our Python scraper logic inside Snowflake, we first create a stage, a persistent storage area, by running the DDL in Code Block 8. This creates a named location @STAGE_WEBSITES under your DATABASE.SCHEMA, where we’ll upload the UDF package (including dependencies like BeautifulSoup and lxml). Once the stage exists, we deploy the extract_features_udf there, making it available to any Snowflake session for HTML parsing and feature extraction. Finally, we set the country_code variable to kick off the pipeline for a specific country, before looping through other country codes as needed.
Code block 8. Create a stage folder to keep the UDFs created
-- CREATE STAGE DATABASE.SCHEMA.STAGE_WEBSITES;
country_code = "BRA"
Now at this part of the code, refer to Code block 9, we’ll define the UDF function ‘extract_features_udf’ that will extract information from the HTML content, here’s what this part of the code does:
- Defines the Snowpark UDF called ‘extract_features_udf’ that lives in the Snowflake stage folder previously created
- Parses the raw HTML with BeautifulSoup well known library
- Extract text features:
- Total word count
- Page title length
- Flags for ‘contact’ and ‘about’ pages.
- Extracts structural features:
- Number of links
- Number of
images
- Number of