Our company has developed a sophisticated machine learning model that categorizes websites according to the widely used IAB taxonomy, which includes over 440 categories and supports both Tier 1 and Tier 2.
Using this model, we have classified millions of website domains and made this available in form of Offline Categorization Database.
Many of our clients, ranging from smaller startups, unicorns, to multinationals have found value in integrating this offline URL Database into their applications, serving a variety of use cases.
We invite you to explore the potential of large-scale, precise website categorization and consider our offline URL Database as a tool to enhance your business operations.
You can find more information and purchase options here:
https://www.websitecategorizationapi.com/url_database.php
Distribution of categorizations for 5 million domains from our website category database
Here is a distribution of Tier 1 categories for 5 million domains taken from the offline database:
The most frequent domain categories are Business and Finance, Sports, Personal Finance, Hobbies & Interests, Technology and Computing, Events and Attractions.
For Tier 2 categories distribution, for sample from total set (only top 50 shown, to make chart readable) for a sample of top websites is:
Business is the most common category, followed by Arts and Crafts, Travel Type.
An important part of many services in fields of web content filtering, AdTech marketing, cybersecurity, brand safety, contextual targeting (to name just a few) is a large and actively maintained URL database which contains, for each URL, categorized attributes like content category, safety classification and others.
In the rest of this article we will introduce you to the typical structure of URL database and how it is obtained.
Number of URLs and domains on internet
First let us address the question of URLs and domains. Did you know that there are more than 360 million of domains registered, according to 2021 stats from Verisign. Of those, around 15% are active.
Of course, each domain can have many subpages, leading to number of URLs in billions. As an example, the latest batch from well known crawler organization Common Crawl: https://commoncrawl.org/2022/02/january-2022-crawl-archive-now-available/
has found 2.95 billion webpages, this is just in a single crawl data set. Common crawl is a great way of analyzing website data and has many useful tools available for quicker parsing.
E.g. if you want to find all URLs that have some string like “/pricing.php” in their URLs you can use the common crawl columnar index and parse it. The columnar index data is stored in parquet format. You can learn more about parquet format here: https://parquet.apache.org
The sheer number of URLs existing in today’s web, with millions new added each day, means that any kind of categorization of URLs must be automated.
With machine learning by far the best choice for classifications, most categorisers for URLs, whether for content filtering or safety, are supervised machine learning models, belonging to the class of text classification models.
URL database – what kind of categories do we assign to URLs?
The type of categories that we assign to URLs depends on our objective. Let us say that our URL database will be needed in the AdTech company or marketing in general (if you are interested in learning more about AdTech, we created an introduction to main parts at the end of the blog post).
We have an advertiser A from specific industry, e.g. Automotive who wants to place the ads on websites of publishers. In order for ads to have a better conversion, advertiser should preferably advertise the ads on publishers websites that have the content that is from Automotive sector. Visitors of such websites are more likely to be interested in ads that are from an Automotive company.
But how do we know which webpages have Automotive content?
This is where the URL database helps.
It was generated using machine learning model classifier applied on billions of URLs and for each URL identifying the category of content, then storing this in URL database.
How can we store data in URL DataBase?
The AdTech company now just has to download the URL database and integrate it in existing app.
The categorization data in database itself can be stored in variety of ways. It can be stored in an SQL database format, NoSQL format or simply in text files, e.g. storing all URLs or domains that belong to specific category in the same text file.
URL Category Taxonomy for Web Content
When the objective is classification of web content, then one can use either own, custom taxonomies or the taxonomies that are standard in respective industry.
For marketing, the common standard of classifying content is the taxonomy from IAB, latest revision is available here:
https://iabtechlab.com/press-releases/tech-lab-releases-content-taxonomy-3-0/
If one is interested in classifying website content that is from Ecommerce field, then taxonomy from Google Products may be more appropriate:
https://www.google.com/basepages/producttype/taxonomy-with-ids.en-US.txt
URL categorization service from https://www.websitecategorizationapi.com supports both taxonomies.
You can try out both classifiers (for general content and for Ecommerce) here:
https://www.websitecategorizationapi.com/demo_dashboard_iab/
https://www.websitecategorizationapi.com/text-classification
Free URL Categorization Database
If you are interested in checking out only the categories of top 500 domains in the world, we offer this open source database here:
https://www.websitecategorizationapi.com/sample_categorized_domains.csv
Conclusion
In this blog post we introduced you to more information on how the URL databases are used and how they are created. We also provided a link to URL database service that you can use in your apps, services or for other purposes.
Website categorization API categories
Website categorization services usually use IAB categories.
Here are the top IAB1 categories:
Real Estate Fine Art Pop Culture Home & Garden Business and Finance Hobbies &Interests Events and Attractions Personal Finance Travel Careers Shopping Family and Relationships Education Healthy Living Television Books and Literature Technology & Computing Style & Fashion Pets Movies News and Politics Automotive Religion & Spirituality Sports Food & Drink Medical Health Science Music and Audio Video Gaming
The list of IAB2 categories is much larger and includes hundreds of categories.
Adding a small selection below:
Design Celebrity Homes Houses Home Improvement Arts and Crafts Real Estate Buyingand Selling Economy Interior Decorating Vacation Properties Smart Home Apartments Insurance Industrial Property Gardening Personal Debt Industries HomeAppliances Remodeling & Construction Images/Galleries Travel Type Personal Celebrations & Life Events Landscaping Home Security Bars & Restaurants Home Utilities Retail Property Parks & Nature Outdoor Decorating Horror Movies Amusement and Theme Parks Land and Farms Career Planning Fashion Events SeniorHealth Museums & Galleries Party Supplies and Decorations Extreme Sports Personal Care Diving Comedy TV Birds Cats Reality TV Indie and Arthouse Movies Dining Out Cinemas and Events Business Eldercare
IAB1 and IAB2 are included in website categorization API that can be used from NodeJs as well.
Frequently asked questions
What is a URL database?
URL Database is a collection of URLs or links to subpages, usually with having some attribute determined for them, e.g. content category, language, author, root domain, residing IP, number of tokens (content length), topics mentioned in URL, and others.
How do I find the URL category?
Follow the steps: 1. decide which taxonomy is most appropriate (IAB or Ecommerce), 2. submit your URL to the WebsitecategorizationAPI tool (in dashboard) or use our API endpoints for this purpose. 3. You will obtain within 10 seconds and you can use the main predicted category or use all categories which have confidence higher than your set threshold.
AdTech Glossary
AdTech is a rapidly growing industry, with new business models and technologies being developed every day. We created a taxonomy of these categories so that you can easily navigate this space.
– Advertisers: A company that wants to sell goods or services to another company. They may want to advertise their products, or they may be looking for ways to improve their own internal processes through data collection and analysis.
– Ad Networks: A collection of online publishers who deliver ads based on the content that is displayed on their website or application. These networks usually have thousands or millions of websites under their umbrella, and are able to serve ads that are relevant to the user’s interests at the time they visit them.
– Demand Side Platforms (DSP): A company that manages digital advertising campaigns for its clients by buying ad space from publishers and then reselling it at a higher price than what they paid for it in order to make a profit off of each impression sold through their platform.
– Supply Side Platforms (SSP): A company or network that sells ad space directly from publishers such as newspapers or news sites through an automated auction process where advertisers bid on how much they are willing to pay per impression delivered