You're browsing the GameFAQs Message Boards as a guest. Sign Up for free (or Log In if you already have an account) to be able to post messages, change how messages are displayed, and view media in posts.
  1. Boards
  2. Message Board Help
  3. Scraping GameFAQs

User Info: Lifeinsteps

Lifeinsteps
1 week ago#1
I received a warning for attempting to program a (just for fun/personal use) scraper for GameFAQs to get some information about games.

When I saw the warning I stopped running the program right away seeing that I was threatened with an IP block, however I noticed in the warning it mentions that the problem is not scraping the website, but scraping with a script which "does not identify itself or falsifies headers".

The script I had written was in Python, and I was using urllib.request to retrieve pages from the top ranked games area.

Does the warning mean that I should identify with a different "User-agent" in my script and this would be allowed behavior (for instance, if I simply set the User-agent to '*')? Or does the warning simply mean: 'Only certain bots are allowed?' Also, if that's what it means, would limiting the rate of HTML requests to, say, 2 a minute keep it from upsetting the server (and the server owner(s))?

Until I get the answer I won't run the script anymore. I'm not trying to harm the website (on the contrary, I'm trying to pull the information because I like the website); I'm just a college programmer and was playing around with a personal project.
(edited 1 week ago)

User Info: Error1355

Error1355
1 week ago#2
Moderators really don't have anything to do with this, sorry.
Welcome home, shed your skin and expose your bones.
Take my hand, follow us into the black so far that we can't get back.

User Info: Lifeinsteps

Lifeinsteps
1 week ago#3
No worries! I was hoping that the admin would notice the thread and have a look, but failing that I found a less harmful solution that should be okay with everything (downloading the pages by hand and running my script on those pages).

User Info: Devin Morgan

Devin Morgan
1 week ago#4
@Lifeinsteps

You'll want to include an email address in the user-agent as a point of contact. That should ease up the restrictions you're seeing, just make sure you're using a reasonable load rate and avoid multi-threaded requests.

User Info: LusterSoldier

LusterSoldier
5 days ago#5
Devin Morgan posted...
You'll want to include an email address in the user-agent as a point of contact. That should ease up the restrictions you're seeing, just make sure you're using a reasonable load rate and avoid multi-threaded requests.


Providing an email address is the kind of thing you'd do if you were running a bot for commercial use, not personal use like the topic creator was doing. Since the operator of the bot has an actual GameFAQs account, would providing his GameFAQs username in the user-agent be acceptable? This site does have a PM and notification feature and he can be alerted by PM or a notification if the bot is requesting pages too frequently.
Luster Soldier --- ~Shield Bearer~ | ~Data Analyst~
Popular at school, but not as cool as Advokaiser, Guru Champ!
(edited 5 days ago)

User Info: Error1355

Error1355
5 days ago#6
I'm fairly sure that Devin knows what he is talking about in regards to this.
Welcome home, shed your skin and expose your bones.
Take my hand, follow us into the black so far that we can't get back.

User Info: vlado_e

vlado_e
5 days ago#7
LusterSoldier posted...
Since the operator of the bot has an actual GameFAQs account, would providing his GameFAQs username in the user-agent be acceptable?

Consider that this practice can be put to malicious use. Somebody could just put any name there and do a lot of requests to try and ban them. Or tie just tie it to a sockpuppet account to avoid any problems.

Sure, an email is also abusable but at least you cannot (easily) impersonate another user.
We do what we must / because we can. / For the good of all of us. / Except the ones who are dead.
  1. Boards
  2. Message Board Help
  3. Scraping GameFAQs