Comment on Relieving High Server Load by Blocking Search Bots

Relieving High Server Load by Blocking Search Bots

Coding

Over the years, whenever the site was slow to load up, I always went through the same routine. I checked traffic logs to see if it was due to a sudden increase in traffic. If not, I would look at the mysql slow query log to see if there were slow queries bogging up the database. If that wasn’t the culprit, then I would take a look at the access logs to see if there were any irregular activities from search engine bots.

Most of the time, it wasn’t due to a sudden spike in traffic. Even when Roger Ebert mentioned Wopular on his Twitter feed, which had hundreds of thousands of followers, the server handled it just fine.

It’s usually a combination of the elements in the next two steps: a search bot got into a section of the site that had one or more slow queries and spanned many pages. As the bot went through each page, they started to pile up and eventually caused a traffic jam.

Well, over the years I’ve fixed many of those pages, so my slow query logs tend to be pretty small. The existing ones are usually not slow enough to give the server fits. They’re within a couple seconds; and when tested by themselves, they are less than a second.

But I still notice heavy loads on the server once in awhile on Wopular. Because they weren’t due to high traffic or slow queries, there was nothing much I could do other than wait them out. They were usually fine after about a day. During those times, the site would still be up, but just slow.

For MoviesWithButter, it’s a different story. They tend to take out the site, so waiting them out is not an option.

I started reading up on bots and found that Baidu and Yandex, search engines for China and Russia respectively, tend to be overly aggressive in spidering pages. Since I wasn’t getting much, if any referrals from them, I decided to forbid them to crawl either site.

I did it with both the robots.txt and .htaccess files, just to make sure they’re blocked. With the robots.txt file, you’ll have to wait a couple days to see if it works. With the .htaccess file, it’s immediate.

I tried two methods with the .htaccess file: one using RewriteRule and another using SetEnvIfNoCase. I couldn’t get the RewriteRule method working, but was successful with SetEnvIfNoCase. I mainly followed the directions on this page for the latter method.

Afterwards, I checked the apache logs to make sure that there was a “403” (forbidden) code instead of “200” (everything’s good) next to urls accessed by the offending search bots. I did with the SetEnvIfNoCase method.

After modifying the .htaccess file and making sure that it worked, the effects were felt immediately. The server’s load went down three to five multiples and CPU usage was halved.

 

Comment On This Story