Scraping For Scooters From Hamrobazaar

Hamrobazaar.com is one of the most popular forum based e-commerce website from Nepal. One can find various items in this site. But finding the right item that is on sale, for example say finding a Scooter, it can be quite a challenging task because of the number of entries and variations.

So, here is a Scrapy script for your rescue!

 


import scrapy
import re
from hb_scrape.items import HbScrapeItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.http.request import Request
import json
import requests

class scooty(scrapy.Spider):
name = "scooty"

def start_requests(self):
filters = ["scooty", "scooter","scoter","scotty"]
linkUrl = 'http://hamrobazaar.com/search.php?do_search=Search&searchword={0}&Search.x=0&Search.y=0&catid_search=0'

for i in range(0, len(filters)):
url = linkUrl.replace('{0}',filters[i])
yield Request(url, self.parse)

def parse(self, response):
print(response.url)
print('~~~~~~~~~~~~begin----------------------------')

for sel in response.xpath('//td[@bgcolor="#ECF0F6"]/a'):
aLink = Selector(text=sel.extract()).xpath('//@href').extract_first()
if 'useritems' not in aLink:
url = response.urljoin(aLink)
yield Request(url, self.parseAdLink)

for sel in response.xpath('//td[@bgcolor="#F2F4F9"]/a'):
aLink = Selector(text=sel.extract()).xpath('//@href').extract_first()
if 'useritems' not in aLink:
url = response.urljoin(aLink)
yield Request(url, self.parseAdLink)

nextLink = response.xpath('//u[contains(text(),"Next")]')
nextAlink = nextLink.xpath('../../@href').extract_first()
fullNextLink = response.urljoin(nextAlink)
yield Request(fullNextLink, self.parse)

def parseAdLink(self, response):
item = HbScrapeItem()
title = response.xpath('//span[@class="title"]//text()').extract()
item['adTitle'] = ''.join(title)

adPostDateLabel = response.xpath('//td[contains(text(),"Ad Post Date:")]')
item['adPostDate'] = adPostDateLabel.xpath('../td[2]/text()').extract_first()

adViewsLabel = response.xpath('//td[contains(text(),"Ad Views:")]')
item['adViewsCount'] = adViewsLabel.xpath('../td[2]/text()').extract_first()

sellerLabel = response.xpath('//td[contains(text(),"Sold by:")]')
item['seller'] = sellerLabel.xpath('../td[2]/text()').extract_first()

sellerPhoneLabel = response.xpath('//td[contains(text(),"Mobile Phone:")]')
item['sellerPhone'] = sellerPhoneLabel.xpath('../td[2]/text()').extract_first()

sellerAddressLabel = response.xpath('//td[contains(text(),"Location:")]')
address = sellerAddressLabel.xpath('../td[2]/text()').extract()
item['address'] = ' '.join(address)

priceLabel = response.xpath('//td[contains(text(),"Price:")]')
item['price'] = priceLabel.xpath('../td[2]//text()').extract_first()

makeYearLabel = response.xpath('//td[contains(text(),"Make Year:")]')
item['makeYear'] = makeYearLabel.xpath('../td[2]/text()').extract_first()

lotNumLabel = response.xpath('//td[contains(text(),"Lot No:")]')
item['lotNumber'] = lotNumLabel.xpath('../td[2]/text()').extract_first()

featuresLabel = response.xpath('//td[contains(text(),"Features:")]')
item['features'] = featuresLabel.xpath('../td[2]/text()').extract_first()

item['adUrl'] = response.url

yield item

Web Scraping Quotes From Good Reads

Introduction

GoodReads is a very good resource for info about books, authors and interesting quotations.

In this post, I will share a piece of code that will allow you to scrape for quotations from this site. The code is written for python’s Scrapy framework.

Getting Started

To get started with scraping quotes from your favorite author, first of all search for quotes by the author name in the quotes section.

Quote Search Section

Once you type the author’s name, you can look for css and xpath in the displayed results for finding pointers to scrape data.

Looking For Xpaths

Code For Spider

Now that we have data to scrape from, the next step is to create a spider that will scrape data from this page. A spider in scrapy is basically a class that you can use to scrape data from a location. You can find more info on scrapy here.

Basically, we want to loop over each “quoteDetail” section to get the author and quote text.


for sel in response.css('div.quoteDetails '):
quote = sel.css('div.quoteText::text').extract()
author = sel.css('div.quoteText a::text').extract_first()
item = GoodreadsItem()
item['author'] = author
item['quote'] = quote
yield item

Each quote gets extracted as a “GoodreadsItem” object.

Next, to scrape data from the next page, following code can be used:


checkNextPage = response.xpath('//a[@class="next_page"]').extract_first()
if(len(checkNextPage)>0):
nextPageLink = response.xpath('//a[@class="next_page"]/@href').extract_first()
nextPageFullUrl = response.urljoin(nextPageLink)
print(nextPageFullUrl)

Conclusion

That’s all the code needed for scraping. It’s quite easy and fun to scrape with Scrapy. Good luck!