How to write a crawler?

Web Technologies Web Development 2 years ago

0 1 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating
_x000D_ _x000D_ I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc.

Posted on 16 Aug 2022, this text provides information on Web Development related to Web Technologies. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (1)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago
_x000D_ You'll be reinventing the wheel, to be sure. But here's the basics: A list of unvisited URLs - seed this with one or more starting pages A list of visited URLs - so you don't go around in circles A set of rules for URLs you're not interested in - so you don't index the whole Internet Put these in persistent storage, so you can stop and start the crawler without losing state. Algorithm is: while(list of unvisited URLs is not empty) { take URL from list remove it from the unvisited list and add it to the visited list fetch content record whatever it is you want to about the content if content is HTML { parse out URLs from links foreach URL { if it matches your rules and it's not already in either the visited or unvisited list add it to the unvisited list } } }

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.

Important Web Technologies Links