Programmatic SEO

Data Preparation

After you are done finalizing your keywords (head terms and modifiers), it’s time to collect all the data required for the programmatic build process. In this section, you will learn about different sources and ways to collect data for your programmatic SEO project.

And if you have the data already, you can skip this section. So in this step, I will assume that you don’t already have the dataset.

The data collection process depends on the topics that you have finalized and the type of posts that you will be creating.

For example, if the topic you have finalized is best places to visit in {city} then you will need the following data points, at least:

  • Name of the city
  • Country/state the city is in
  • List of all the interesting places
  • Best time of the year to visit
  • Distance from the nearest airport/railway station
  • Estimated cost
  • Good hotels in the city to stay in, etc.

And there will never be a single site or a single way that you can use to collect all the required data points for your project. Let’s get through each of them, in detail:

1. Use your country's government data

Governments of almost all the countries provide several datasets and APIs that you can use in your project. Generally, they are completely free to use, but make sure to read the terms and conditions at first.

For example, the US government's data website https://data.gov/ has more than 300k different datasets related to agriculture, commerce, defense, energy, health, transportation, business, etc.

gov data.png

Another example is the Indian government’s site https://data.gov.in/ which offers thousands of datasets related to health, sports, biotechnology, travel, tourism, technology, etc. sectors that you can use. Most of the data is available in the simple CSV format, but there are 1000s of APIs available too.

2. Look for other public data

There are many other datasets related to various topics available publicly that you can access and use. Most of these public datasets can be found directly on Google and on other search engines by searching {topic} datasets.

Before using any public data, make sure you have proper rights to use the data in your projects.

a. Find datasets on GitHub

There are thousands of datasets available on GitHub that you can search for by using the following 2 methods:

Search directly on GitHub.com by using the terms that you’re searching the datasets for. For example, if I am looking for food nutrition related data, then I’d search “food nutrition data” in GitHub.

Use site: search operators on Google to find what you’re looking for on GitHub. For example, if I am looking for data related to Indian cities, then I’d search site:github.com indian cities data (as you see in the below screenshot) on Google.

site github search.png

To get started, here is a GitHub repository called Awesome Public Datasets that has a collection of 100s of datasets related to many categories. Most of the listed datasets are completely free to use.

b. Find public .xls files

There are many Excel files indexed on Google that you can search for by using the filetype: search operator.

plant species xls.png

For example, if I am trying to find XLS files related to plant species, then I’d google filetype:xls plant species as you can see in the screenshot above. And searching this directly shows .xls files that you can download with just a click.

You can also find numerous PDF files by using the same method, but extracting data from PDF files is a bit complicated.

c. Find public Google Sheets spreadsheets

Just like the public .xls files, there are tons of Google Sheets documents that you can search for. You can use the site: operator for that.

google spreadsheet site search.png

For example, if you’re looking for data on animal species, then google the term site:docs.google.com/spreadsheets animal species, and a bunch of Google Sheets spreadsheets will appear.

d. Search through Reddit communities

Reddit has some active communities where you can request and share datasets related to various topics. The most popular Reddit community is r/datasets and some others are r/opendata, r/data and r/DataHoarder.

For example, I found a massive dataset of 107+ million journal articles of 4.7 TiB in size in one of these Reddit communities. And all that can be downloaded for free, as you see in the below screenshot.

millions journal articles.png

e. Look for public APIs

Just like public datasets, you can also find public APIs that you can use in your programmatic SEO projects. One of the biggest API sites is RapidAPI where you will find 1000s of free as well as premium APIs.

alpha vantage.png

For example, Alpha Vantage is a free API on RapidAPI that you can use to receive stock, forex, technical indicators, and cryptocurrency data.

f. Use Google Dataset Search tool

Google has a lesser known tool called Dataset Search, where you can search for the datasets that you’re looking for. Just type in your query in the search box, hit enter, and you will be presented with tons of useful data.

google dataset search.png

For example, I was looking for some data on COVID-19 and found an interesting one, for free.

3. Hire an experienced scraper

Data scraping is a complicated and time consuming process, so unless you’re an expert, hiring an experienced scraper to collect the data for your programmatic SEO project can be another good option.

The person you hire should be able to help you with scraping as well as cleaning the scraped data, so that you only focus on what matters the most. So make sure you communicate properly about your requirements beforehand.

data scraping.png

On freelancing platforms like Upwork, you can hire web scraping experts for as low as $10 per hour.