Relieving High Server Load By Blocking Search Bots

Coding

Over the years, whenever the site was slow to load up, I always went through the same routine. I checked traffic logs to see if it was due to a sudden increase in traffic. If not, I would look at the mysql slow query log to see if there were slow queries bogging up the database. If that wasn’t the culprit, then I would take a look at the access logs to see if there were any irregular activities from search engine bots.

Most of the time, it wasn’t due to a sudden spike in traffic. Even when Roger Ebert mentioned Wopular on his Twitter feed, which had hundreds of thousands of followers, the server handled it just fine.

It’s usually a combination of the elements in the next two steps: a search bot got into a section of the site that had one or more slow queries and spanned many pages. As the bot went through each page, they started to pile up and eventually caused a traffic jam.

Well, over the years I’ve fixed many of those pages, so my slow query logs tend to be pretty small. The existing ones are usually not slow enough to give the server fits. They’re within a couple seconds; and when tested by themselves, they are less than a second.

But I still notice heavy loads on the server once in awhile on Wopular. Because they weren’t due to high traffic or slow queries, there was nothing much I could do other than wait them out. They were usually fine after about a day. During those times, the site would still be up, but just slow.

For MoviesWithButter, it’s a different story. They tend to take out the site, so waiting them out is not an option.

I started reading up on bots and found that Baidu and Yandex, search engines for China and Russia respectively, tend to be overly aggressive in spidering pages. Since I wasn’t getting much, if any referrals from them, I decided to forbid them to crawl either site.

I did it with both the robots.txt and .htaccess files, just to make sure they’re blocked. With the robots.txt file, you’ll have to wait a couple days to see if it works. With the .htaccess file, it’s immediate.

I tried two methods with the .htaccess file: one using RewriteRule and another using SetEnvIfNoCase. I couldn’t get the RewriteRule method working, but was successful with SetEnvIfNoCase. I mainly followed the directions on this page for the latter method.

Afterwards, I checked the apache logs to make sure that there was a “403” (forbidden) code instead of “200” (everything’s good) next to urls accessed by the offending search bots. I did with the SetEnvIfNoCase method.

After modifying the .htaccess file and making sure that it worked, the effects were felt immediately. The server’s load went down three to five multiples and CPU usage was halved.

SENH'S LATEST BLOG ENTRIES
  • "Storm The Gates" Mobile Game
    My friends just launched a mobile game called "Storm the Gates"! It's available on both Android and iOS. I'm not much of a gamer, so I didn't know what to do when I started it. But my 9-year-old son turned it on and knew exactly what to do. More
  • "Ip Man 3" Has New Release Date in China and Posters to Prove It
    Over at MoviesWithButter.com, I’ve written two articles regarding “Ip Man 3,” the film that’s holding its own in Asian territories against “Star Wars: The Force Awakens.” If you’re not familiar with Asian cinema,”Ip Man” one of the most popular franchises in that area of the world. More
  • Live-Blogged The Golden Globes at MWB
    Yesterday, I live-blogged The Golden Globes -- for various reasons. First, it gets traffic. Second, I might as well do something useful while watching TV. Lastly, I was looking forward to seeing Ricky Gervais host the event again. Nothing against Tina Fey and Amy Poehler, but he’s my favorite. A google search revealed that the last time I did this for this award show was More
  • Kings vs. Mavs: Ominous Stats, News for Sacramento Before Game
    The preview on NBA.com and the injury report from SactownRoyalty.com don’t look good for the Sacramento Kings going into tonight’s game against the playoff-bound Dallas Mavericks. More
  • What?! Kings Beat Thunder, 116-104
    (Image from the Sacramento Bee. Click here to view more images from the game.) What a win. It was a wild one, especially in the first half. We [Sacramento Kings] were down by 17 in the first quarter. In frustration, I was about to turn off the TV, but the teaml called a timeout. More
SENH'S RELATED BLOG ENTRIES
  • AddThis Removes "via @addthis" Suffix in Tweets
    I just noticed in the last couple of days that the sharing platform AddThis has removed “via @addthis” at the end of tweets using their widget. I use AddThis throughout the site. It’s a handy little widget that allows users to share articles through any social bookmarking/sharing site, like Facebook, Twitter, StumbleUpon and Pinterest. As a website owner, I use it to tweet articles to my Twitter account. More
  • Facebook, Please Fix Your Comment Moderation Issues
    I’ve switched over to Facebook’s comment system a while back. I did it for two reasons. First, offloading comments to a third party takes the load off my server. Second, I’m hoping to get some Facebook traffic juice. The first part worked out. More
  • Facebook Added Website Comment Notifications to User Page
    I just started noticing this recently. It’s a nice touch. Before, you would have to go into a specific comment moderation page to keep track of comments. Now, it’s all tied into your profile page, just like when someone likes or comments on your status updates. Months ago, I converted my commenting system to Facebook’s. More
  • By far, StumbleUpon has been the most effective sharing button
    I'm finding that StumbleUpon has been the most effective social button on the site. I've been consistently sharing my blog entires on StumbleUpon and other content sharing sites like Digg, Reddit, Twitter, Linkedin, +1, Facebook, and maybe a couple other ones. More
  • Buzzkill
    I knew Yahoo Buzz was one of the “sunset” sites - properties that would be shutdown by Yahoo. I still kept it on the site, but moved it to the bottom of the homepage. The quality of the articles being voted up have been on the low-end months before its eventual demise was announced; sometimes, you even get spam. More

 

Comment On This Story