Summary
Sometimes you may require a list of IP addresses that have accessed your website from a particular source or from a bot, on Linux the Apache log files are in the format:
180.76.15.5 - - [25/Jul/2016:15:30:48 +0100] "GET / HTTP/1.1" 301 612 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 180.76.15.144 - - [25/Jul/2016:15:31:30 +0100] "GET / HTTP/1.1" 301 612 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 23.79.233.23 - - [25/Jul/2016:15:37:14 +0100] "GET / HTTP/1.1" 301 598 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" 23.79.233.24 - - [25/Jul/2016:15:37:19 +0100] "GET / HTTP/1.1" 302 582 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" 180.76.15.153 - - [25/Jul/2016:15:40:32 +0100] "GET / HTTP/1.1" 301 561 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 180.76.15.28 - - [25/Jul/2016:15:40:33 +0100] "GET / HTTP/1.1" 302 545 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 141.212.122.128 - - [25/Jul/2016:15:47:26 +0100] "GET / HTTP/1.1" 301 542 "-" "Mozilla/5.0 zgrab/0.x" 141.212.122.128 - - [25/Jul/2016:15:47:27 +0100] "GET / HTTP/1.1" 302 526 "http://91.192.193.170:80/" "Mozilla/5.0 zgrab/0.x" 64.138.2.85 - - [25/Jul/2016:15:48:12 +0100] "GET / HTTP/1.1" 301 598 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0" 64.138.2.85 - - [25/Jul/2016:15:48:13 +0100] "GET / HTTP/1.1" 302 582 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0"
From this you might want to block the "Baiduspider" if it is causing an increased load on your website.
Stey-by-Step Guide
Obtaining a list of IP Addresses can be achieved very easily using the following command:
grep "baidu.com" access_log | cut -d' ' -f1 | sort | uniq
Alternatively you can search for the spiders name (these commands are case sensitive, use the switch '-i' with the grep command to remove case sensitivity)
grep "Baiduspider" access_log | cut -d' ' -f1 | sort | uniq grep "YandexBot" access_log | cut -d' ' -f1 | sort | uniq
This will return a single list of IP Addresses that have accessed your website from Baidu, or any other spiders.
grep "baidu.com" access_log | cut -d' ' -f1 | sort | uniq 123.125.71.76 123.125.71.86 123.125.71.90 180.76.15.135 180.76.15.139 180.76.15.140 180.76.15.141 180.76.15.144 180.76.15.145 180.76.15.147 180.76.15.153 180.76.15.157 180.76.15.158 180.76.15.159 180.76.15.16 180.76.15.22 180.76.15.26 180.76.15.28 180.76.15.29 180.76.15.34 180.76.15.5
WARNING | Be careful that you do not block any legitimate IP Addresses using this method. You can confirm the owner of the IP address by performing a WHOIS lookup against it. |