Have you ever had the problem to scan your website for specific words, for example to find restricted words. This is a perfect job for Python and Scrapy. With only a few lines of code you can automate this task.
Installation
First install Scrapy and the required dependencies. I used pip to do the job.
$ pip install Scrapy
Create a Scrapy Project
Befor you can start to spider your website you have to create a Scrapy project. Open the directory you want to store the project and run the following command:
$ scrapy startproject wordlist_scrapper
The Spider Script
Open the project with you preferred editor and create a new file "spider.py" in the "spiders" folder. Put in the following code:
from io import StringIO
from functools import partial
from scrapy.http import Request
from scrapy.spiders import BaseSpider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item
def find_all_substrings(string, sub):
import re
starts = [match.start() for match in re.finditer(re.escape(sub), string)]
return starts
class WebsiteSpider(CrawlSpider):
name = "webcrawler"
allowed_domains = ["www.phooky.com"]
start_urls = ["http://www.phooky.com"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]
crawl_count = 0
words_found = 0
def check_buzzwords(self, response):
self.__class__.crawl_count += 1
crawl_count = self.__class__.crawl_count
wordlist = [
"Lorem",
"dolores",
"feugiat",
]
url = response.url
contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
data = response.body.decode('utf-8')
for word in wordlist:
substrings = find_all_substrings(data, word)
for pos in substrings:
ok = False
if not ok:
self.__class__.words_found += 1
print(word + ";" + url + ";")
return Item()
def _requests_to_follow(self, response):
if getattr(response, "encoding", None) != None:
return CrawlSpider._requests_to_follow(self, response)
else:
return []
Replace "allowed_domains", "start_urls" and "wordlist" with your own data.
Crawl your Website
Now you can test the spider script and store the data in a csv-file:
$ scrapy crawl webcrawler > wordlist.csv
The result is a csv-file with two columns showing the word and the related url on your website where it appears.
Of course this a very simple script but I'm sure you can imagine how powerful Scrapy is. You can find more information about Scrapy in the official documentation:
Cover image by Lorenzo Cafaro | close-up-code-coding-computer-23989 | CC0 License