Agency Malta is a web crawler that crawls local marketing agency websites and indexes all their projects and blog posts – making it the only place where you can do thinks like “browse all E-Commerce websites built in Malta” or ask “Who’s been doing the remarkable work for a particular client?“.
The website is useful for businesses who are looking for a service provider to execute a particular goal of theirs. If a business is looking to create a new iOS app to complement their offering, it’s only natural that they’d want to be able to review their options of service providers. Agency Malta makes this possible, and easy.
Often, ideas start of in a whirl of enthusiasm and late night ideation and coding sessions, but when the initial momentum wears off these ideas are forgotten only to be remembered guiltily and fleetingly.
As a self proclaimed maker of things I set myself a challenge for this project: I wanted to build it and ship it as a solo project, as an exercise in discipline and product launching.
Building a web crawler isn’t a trivial task, but luckily there are many resources available to help make this task more manageable.
Some of the problems encountered and solved include correctly identifying project and blog pages, tagging projects internally for better searching and filtering and saving the correct meta information for each scraped page.
The crawler uses a variety of techniques to crawl local marketing agency websites and index useful information to be used when presenting the projects and blog posts in my application. Luckily, most web masters markup their web pages using modern best practices – so even though the “semantic web” of Web 2.0 didn’t pan out exactly, robots are still able to glean lots of information from well marked-up pages.
The web crawler heavily utilises Redis as a queue management system in order to be capable of crawling thousands of web pages every day while remaining performant. Laravel Horizon was the perfect tool here to allow me to monitor the background tasks that the crawler was initiating.