Why the need for URL Database?

Our company has developed a state of the art, highly accurate machine learning model for categorizing websites based on IAB taxonomy (with 440+ IAB categories, supporting both Tier 1 and Tier 2).

We are using it to classify 340 million domains on the internet, in terms of IAB taxonomy.

Our clients are using this offline URL Database of categorized domains with great success in their own applications and wide range of use cases.

You can purchase this offline database yourself, it is available here:

https://www.websitecategorizationapi.com/url_database.php

Distribution of categorizations for a sample of domains from our website category database

Here is a distribution of Tier 1 categories on a sample of around 800,000 domains taken from the offline database:

The most frequent domain categories are Business and Finance, Sports, Personal Finance, Hobbies & Interests, Technology and Computing, Events and Attractions.

For Tier 2 categories distribution is:

Business is the most common category, followed by Arts and Crafts, Travel Type.

An important part of many services in fields of web content filtering, AdTech marketing, cybersecurity, brand safety, contextual targeting (to name just a few) is a large and actively maintained URL database which contains, for each URL, categorized attributes like content category, safety classification and others.

In the rest of this article we will introduce you to the typical structure of URL database and how it is obtained.

Number of URLs and domains on internet

First let us address the question of URLs and domains. Did you know that there are more than 360 million of domains registered, according to 2021 stats from Verisign. Of those, around 15% are active.

Of course, each domain can have many subpages, leading to number of URLs in billions. As an example, the latest batch from well known crawler organization Common Crawl: https://commoncrawl.org/2022/02/january-2022-crawl-archive-now-available/ 

has found 2.95 billion webpages, this is just in a single crawl data set. Common crawl is a great way of analyzing website data and has many useful tools available for quicker parsing.

E.g. if you want to find all URLs that have some string like “/pricing.php” in their URLs you can use the common crawl columnar index and parse it. The columnar index data is stored in parquet format. You can learn more about parquet format here: https://parquet.apache.org

The sheer number of URLs existing in today’s web, with millions new added each day, means that any kind of categorization of URLs must be automated.

With machine learning by far the best choice for classifications, most categorisers for URLs, whether for content filtering or safety, are supervised machine learning models, belonging to the class of text classification models.

URL database – what kind of categories do we assign to URLs?

The type of categories that we assign to URLs depends on our objective. Let us say that our URL database will be needed in the AdTech company or marketing in general (if you are interested in learning more about AdTech, we created an introduction to main parts at the end of the blog post).

We have an advertiser A from specific industry, e.g. Automotive who wants to place the ads on websites of publishers. In order for ads to have a better conversion, advertiser should preferably advertise the ads on publishers websites that have the content that is from Automotive sector. Visitors of such websites are more likely to be interested in ads that are from an Automotive company.

But how do we know which webpages have Automotive content?

This is where the URL database helps.

It was generated using machine learning model classifier applied on billions of URLs and for each URL identifying the category of content, then storing this in URL database.

How can we store data in URL DataBase?

The AdTech company now just has to download the URL database and integrate it in existing app.

The categorization data in database itself can be stored in variety of ways. It can be stored in an SQL database format, NoSQL format or simply in text files, e.g. storing all URLs or domains that belong to specific category in the same text file.

URL Category Taxonomy for Web Content

When the objective is classification of web content, then one can use either own, custom taxonomies or the taxonomies that are standard in respective industry.

For marketing, the common standard of classifying content is the taxonomy from IAB, latest revision is available here:

https://iabtechlab.com/press-releases/tech-lab-releases-content-taxonomy-3-0/

If one is interested in classifying website content that is from Ecommerce field, then taxonomy from Google Products may be more appropriate:

https://www.google.com/basepages/producttype/taxonomy-with-ids.en-US.txt

URL categorization service from https://www.websitecategorizationapi.com supports both taxonomies.

You can try out both classifiers (for general content and for Ecommerce) here:

https://www.websitecategorizationapi.com/demo_dashboard_iab/

https://www.websitecategorizationapi.com/text-classification

Free URL Categorization Database

If you are interested in checking out only the categories of top 500 domains in the world, we offer this open source database here:

https://www.websitecategorizationapi.com/sample_categorized_domains.csv

Conclusion

In this blog post we introduced you to more information on how the URL databases are used and how they are created. We also provided a link to URL database service that you can use in your apps, services or for other purposes.

Website categorization API categories

Website categorization services usually use IAB categories.

Here are the top IAB1 categories:

The list of IAB2 categories is much larger and includes hundreds of categories.

Adding a small selection below:

IAB1 and IAB2 are included in website categorization API that can be used from NodeJs as well.

 

Frequently asked questions

What is a URL database?

URL Database is a collection of URLs or links to subpages, usually with having some attribute determined for them, e.g. content category, language, author, root domain, residing IP, number of tokens (content length), topics mentioned in URL, and others.

How do I find the URL category?

Follow the steps: 1. decide which taxonomy is most appropriate (IAB or Ecommerce), 2. submit your URL to the WebsitecategorizationAPI tool (in dashboard) or use our API endpoints for this purpose. 3. You will obtain within 10 seconds and you can use the main predicted category or use all categories which have confidence higher than your set threshold.

AdTech Glossary

AdTech is a rapidly growing industry, with new business models and technologies being developed every day. We created a taxonomy of these categories so that you can easily navigate this space.

– Advertisers: A company that wants to sell goods or services to another company. They may want to advertise their products, or they may be looking for ways to improve their own internal processes through data collection and analysis.

– Ad Networks: A collection of online publishers who deliver ads based on the content that is displayed on their website or application. These networks usually have thousands or millions of websites under their umbrella, and are able to serve ads that are relevant to the user’s interests at the time they visit them.

– Demand Side Platforms (DSP): A company that manages digital advertising campaigns for its clients by buying ad space from publishers and then reselling it at a higher price than what they paid for it in order to make a profit off of each impression sold through their platform.

– Supply Side Platforms (SSP): A company or network that sells ad space directly from publishers such as newspapers or news sites through an automated auction process where advertisers bid on how much they are willing to pay per impression delivered