Using Scrapy to scrape MITOPENCOURSEWARE site for syllabi

Course Queries Syllabus Queries 2 years ago

0 2 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating

Posted on 16 Aug 2022, this text provides information on Syllabus Queries related to Course Queries. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (2)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago

 

My spider to scrape the MITOPENCOURSEWARE site for syllabi doesn't work. Will someone please help me figure out what's wrong with it? The .*'s are to get to all the courses. Is this right?

  1 from scrapy.contrib.spiders import CrawlSpider, Rule
  2 from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
  3 from scrapy.selector import HtmlXPathSelector
  4 from opensyllabi.items import OpensyllabiItem
  5 
  6 class MITSpider(CrawlSpider):
  7     name = 'mit'
  8     allowed_domains = ['ocw.mit.edu']
  9     start_urls = ['http://ocw.mit.edu/courses']
 10     rules = [Rule(SgmlLinkExtractor(allow=['/.*/.*/syllabus']), 'parse_syllabus')]
 11 
 12     def parse_syllabus(self, response):
 13         x = HtmlXPathSelector(response)
 14 
 15         syllabus = OpensyllabiItem()
 16         syllabus['url'] = response.url
 17         syllabus['body'] = x.select("//div[@id='course_inner_section']").extract()
 18         return syllabus
profilepic.png
manpreet 2 years ago

Try:

rules = [
    Rule(SgmlLinkExtractor(allow=r'/[^/]+/[^/]+/syllabus'), 'parse_syllabus'),
    Rule(SgmlLinkExtractor()),
]

to get all links on the first page, and then watch out, that's a lot of links.


0 views   0 shares

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.