Creating a bot/crawler

Web Technologies Web Development 2 years ago

0 1 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating
_x000D_ _x000D_ I would like to make a small bot in order to automatically and periodontally surf on a few partner website. This would save several hours to a lot of employees here. The bot must be able to : connect to this website, on some of them log itself as a user, access and parse a particular information on the website. The bot must be integrated to our website and change it's settings (used user…) with data of our website. Eventually it must sum up the parse information. Preferably this operation must be done from the client side, not on the server. I tried dart last month and loved it… I would like to do it in dart. But I am a bit lost : Can I use a Document class object for each website I want to parse? Could be headless or should I use the chrome/dartium api to controle the webbrowser (i'd like to avoid this) ? I've been reading this thread : https://groups.google.com/a/dartlang.org/forum/?fromgroups=#!searchin/misc/crawler/misc/TkUYKZXjoEg/Lj5uoH3vPgIJ Does using https://github.com/dart-lang/html5lib is a good idea for my case?

Posted on 16 Aug 2022, this text provides information on Web Development related to Web Technologies. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (1)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago
_x000D_ There are two parts to this. Get the page from the remote site. Read the page into a class that you can parse. For the first part, if you are planning on running this client-side, you are likely to run into cross-site issues, in that your page, served from server X, cannot request pages from server Y, unless the correct headers are set. See: CORS with Dart, how do I get it to work? and Dart application and cross domain policy or the site in question needs to be returning the correct CORS headers. Assuming that you can actually get the pages from the remote site client-side, you can use HttpRequest to retrieve the actual content: // snippet of code... new HttpRequest.get("http://www.example.com", (req) { // process the req.responseText }); You can also use HttpRequest.getWithCredentials. If the site has some custom login, then you will probably problems (as you will likely be having to Http POST the username and password from your site into their server - This is when the second part comes in. You can process your HTML using the DocumentFragment.html(...) constructor, which gives you a nodes collection that you can iterate and recurse through. The example below shows this for a static block of html, but you could use the data returned from the HttpRequest above. import 'dart:html'; void main() { var d = new DocumentFragment.html(""" Foo """); // print the content of the top level nods d.nodes.forEach((node) => print(node.text)); // prints "Foo" // real-world - use recursion to go down the hierarchy. } I'm guessing (not having written a spider before) that you'd be wanting to pull out specific tags at specific locations / depths to sum as your results, and also add urls in hyperlinks to a queue that your bot will navigate into.

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.

Important Web Technologies Links